Adding similarity() and similarity_op(), '%' to pg_bigm (Pgbigm-hackers) - pg_bigm

On Thu, Oct 3, 2013 at 3:48 PM, Beena Emerson <memis****@gmail*****> wrote:
> On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> wrote:
>>
>> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****>
>> wrote:
>> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****>
>> > wrote:
>> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson
>> >> <memis****@gmail*****> wrote:
>> >>> Please find attached the updated patch to rebase against the new HEAD
>> >>> and
>> >>> also implements the comments I had given before.
>>
>> I extracted the similarity function code from the patch as the separate
>> one
>> so that we can more easily review and commit it. I'd like to work on this
>> first.
>> After committing it, I'd like to work on the remaining similarity search
>> code.
>>
>> Attached patch just implements pg_bigm version of similarity function.
>> This is in WIP yet. The description of bigm_similarity function must be
>> added into the document. The regression test must be updated.
>>
>>
>> While reviewing the similartiy function, I found that there is one big
>> problem
>> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but
>> pg_trgm
>> version of similarity function is not. Please see the following example:
>>
>> =# select similarity('wow', 'WOW');
>>  similarity
>> ------------
>>           1
>> (1 row)
>>
>> =# select bigm_similarity('wow', 'WOW');
>>  bigm_similarity
>> -----------------
>>                0
>> (1 row)
>>
>> Should we implement the *case-insensitive* bigm_similarity()?
>>
>
> The pg_bigm code is case sensitive, even the show_bigm and show_trgm behave
> differently.
>
> =# SELECT show_bigm ('ABC');
>      show_bigm
> -------------------
>  {" A",AB,BC,"C "}
> (1 row)
>
> =# SELECT show_trgm ('ABC');
>         show_trgm
> -------------------------
>  {"  a"," ab",abc,"bc "}
> (1 row)
>
> This is because of the difference in code of generate_trgm and
> generate_bigm.
>
> fn generate_trgm: trgm_op.c ln 213 - 228
>  while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
> != NULL)
>         {
> #ifdef IGNORECASE
>                 bword = lowerstr_with_len(bword, eword - bword);
>                 bytelen = strlen(bword);
> #else
>                 bytelen = eword - bword;
> #endif
>
>                 memcpy(buf + LPADDING, bword, bytelen);
>
> #ifdef IGNORECASE
>                 pfree(bword);
> #endif
>                 buf[LPADDING + bytelen] = ' ';
>                 buf[LPADDING + bytelen + 1] = ' ';
> ....
>
> fn generate_bigm: bigm_op.c ln 247 - 253
>  while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen))
> != NULL)
>         {
>                 bytelen = eword - bword;
>                 memcpy(buf + LPADDING, bword, bytelen);
>
>                 buf[LPADDING + bytelen] = ' ';
>                 buf[LPADDING + bytelen + 1] = ' ';
> ....
>
> Since similarity function uses this generate_bigm to get the bigrams and
> then compare it, there is a difference in behavior.
>
> So the way to make similarity function case-insensitive would be to change
> generate_bigm and not the similarity code itself. Also, the change will make
> the show_bigm function behave differently.

Yes, generate_bigm would need to be updated to make bigm_similarity
case-sensitive.

*From a user point of view*, bigm_similarity() and upcoming similarity
search should be case-sensitive? If yes, we should change generate_bigm,
but its change must not affect the behavior of the full-text search at all.

Or we should just implement both case-sensitive and -insensitive
bigm_similarity() and similarity search?

Thought?

Regards,

-- 
Fujii Masao

pg_bigm
Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm Fork

[Pgbigm-hackers] Adding similarity() and similarity_op(), '%' to pg_bigm

pg_bigm
Fork