Beena Emerson
memis****@gmail*****
2013年 10月 3日 (木) 15:48:26 JST
On Thu, Oct 3, 2013 at 7:26 AM, Fujii Masao <masao****@gmail*****> wrote: > On Mon, Sep 30, 2013 at 12:34 PM, Amit Langote <amitl****@gmail*****> > wrote: > > On Mon, Sep 30, 2013 at 12:22 PM, Fujii Masao <masao****@gmail*****> > wrote: > >> On Sat, Sep 28, 2013 at 6:16 PM, Beena Emerson <memis****@gmail*****> > wrote: > >>> Please find attached the updated patch to rebase against the new HEAD > and > >>> also implements the comments I had given before. > > I extracted the similarity function code from the patch as the separate one > so that we can more easily review and commit it. I'd like to work on this > first. > After committing it, I'd like to work on the remaining similarity search > code. > > Attached patch just implements pg_bigm version of similarity function. > This is in WIP yet. The description of bigm_similarity function must be > added into the document. The regression test must be updated. > > While reviewing the similartiy function, I found that there is one big > problem > in bigm_similarity(). That is, bigm_similarity() is case-sensitive, but > pg_trgm > version of similarity function is not. Please see the following example: > > =# select similarity('wow', 'WOW'); > similarity > ------------ > 1 > (1 row) > > =# select bigm_similarity('wow', 'WOW'); > bigm_similarity > ----------------- > 0 > (1 row) > > Should we implement the *case-insensitive* bigm_similarity()? > > The pg_bigm code is case sensitive, even the show_bigm and show_trgm behave differently. =# SELECT show_bigm ('ABC'); show_bigm ------------------- {" A",AB,BC,"C "} (1 row) =# SELECT show_trgm ('ABC'); show_trgm ------------------------- {" a"," ab",abc,"bc "} (1 row) This is because of the difference in code of generate_trgm and generate_bigm. fn generate_trgm: trgm_op.c ln 213 - 228 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL) { #ifdef IGNORECASE bword = lowerstr_with_len(bword, eword - bword); bytelen = strlen(bword); #else bytelen = eword - bword; #endif memcpy(buf + LPADDING, bword, bytelen); #ifdef IGNORECASE pfree(bword); #endif buf[LPADDING + bytelen] = ' '; buf[LPADDING + bytelen + 1] = ' '; .... fn generate_bigm: bigm_op.c ln 247 - 253 while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL) { bytelen = eword - bword; memcpy(buf + LPADDING, bword, bytelen); buf[LPADDING + bytelen] = ' '; buf[LPADDING + bytelen + 1] = ' '; .... Since similarity function uses this generate_bigm to get the bigrams and then compare it, there is a difference in behavior. So the way to make similarity function case-insensitive would be to change generate_bigm and not the similarity code itself. Also, the change will make the show_bigm function behave differently. Beena Emerson -------------- next part -------------- HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...Descargar