Fujii Masao
masao****@gmail*****
2013年 10月 3日 (木) 17:41:39 JST
On Thu, Oct 3, 2013 at 3:48 PM, Beena Emerson <memis****@gmail*****> wrote: > On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> wrote: >> >> On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****> >> wrote: >> > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****> >> > wrote: >> >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson >> >> <memis****@gmail*****> wrote: >> >>> Please find attached the updated patch to rebase against the new HEAD >> >>> and >> >>> also implements the comments I had given before. >> >> I extracted the similarity function code from the patch as the separate >> one >> so that we can more easily review and commit it. I'd like to work on this >> first. >> After committing it, I'd like to work on the remaining similarity search >> code. >> >> Attached patch just implements pg_bigm version of similarity function. >> This is in WIP yet. The description of bigm_similarity function must be >> added into the document. The regression test must be updated. >> >> >> While reviewing the similartiy function, I found that there is one big >> problem >> in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but >> pg_trgm >> version of similarity function is not. Please see the following example: >> >> =# select similarity('wow', 'WOW'); >> similarity >> ------------ >> 1 >> (1 row) >> >> =# select bigm_similarity('wow', 'WOW'); >> bigm_similarity >> ----------------- >> 0 >> (1 row) >> >> Should we implement the *case-insensitive* bigm_similarity()? >> > > The pg_bigm code is case sensitive, even the show_bigm and show_trgm behave > differently. > > =# SELECT show_bigm ('ABC'); > show_bigm > ------------------- > {" A",AB,BC,"C "} > (1 row) > > =# SELECT show_trgm ('ABC'); > show_trgm > ------------------------- > {" a"," ab",abc,"bc "} > (1 row) > > This is because of the difference in code of generate_trgm and > generate_bigm. > > fn generate_trgm: trgm_op.c ln 213 - 228 > while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) > != NULL) > { > #ifdef IGNORECASE > bword = lowerstr_with_len(bword, eword - bword); > bytelen = strlen(bword); > #else > bytelen = eword - bword; > #endif > > memcpy(buf + LPADDING, bword, bytelen); > > #ifdef IGNORECASE > pfree(bword); > #endif > buf[LPADDING + bytelen] = ' '; > buf[LPADDING + bytelen + 1] = ' '; > .... > > fn generate_bigm: bigm_op.c ln 247 - 253 > while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) > != NULL) > { > bytelen = eword - bword; > memcpy(buf + LPADDING, bword, bytelen); > > buf[LPADDING + bytelen] = ' '; > buf[LPADDING + bytelen + 1] = ' '; > .... > > Since similarity function uses this generate_bigm to get the bigrams and > then compare it, there is a difference in behavior. > > So the way to make similarity function case-insensitive would be to change > generate_bigm and not the similarity code itself. Also, the change will make > the show_bigm function behave differently. Yes, generate_bigm would need to be updated to make bigm_similarity case-sensitive. *From a user point of view*, bigm_similarity() and upcoming similarity search should be case-sensitive? If yes, we should change generate_bigm, but its change must not affect the behavior of the full-text search at all. Or we should just implement both case-sensitive and -insensitive bigm_similarity() and similarity search? Thought? Regards, -- Fujii Masao