Re: strange behavior of pg_trgm's similarity function
От | Merlin Moncure |
---|---|
Тема | Re: strange behavior of pg_trgm's similarity function |
Дата | |
Msg-id | CAHyXU0xndWEMb0+3J9V1LFXzuNaWnm_4aWGW8NC4ZBOsaT3SEQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: strange behavior of pg_trgm's similarity function (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Ответы |
Re: strange behavior of pg_trgm's similarity function
|
Список | pgsql-hackers |
On Thu, Oct 10, 2013 at 7:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 10.10.2013 15:03, Fujii Masao wrote: >> >> Hi, >> >> The behavior of pg_trgm's similarity function seems strange. Is this >> intentional? >> >> I was thinking that the following three calls of the similarity function >> return >> the same number because the second argument is just the three characters >> contained in the first argument in every calls. >> >> =# SELECT similarity('12345', '123'); >> =# SELECT similarity('12345', '234'); >> =# SELECT similarity('12345', '345'); >> >> But that's not true. Each returns the different number. >> >> =# SELECT similarity('12345', '123'); >> similarity >> ------------ >> 0.428571 >> (1 row) >> >> =# SELECT similarity('12345', '234'); >> similarity >> ------------ >> 0.111111 >> (1 row) >> >> =# SELECT similarity('12345', '345'); >> similarity >> ------------ >> 0.25 >> (1 row) >> >> This happens because, for example, similarity('12345', '123') returns >> the similarity number of '**12345*' and '**123*' (* means the blank >> character), >> NOT '12345' and '123'. IOW, two and one blank characters are added into >> the heading and tailing of each argument, respectively. I wonder why >> pg_trgm's similarity function works in this way. We should change this >> so that no blank characters are added into the arguments? > > > Well, you could also argue that "111111" and "222222" are quite similar, > even though pg_trgm's similarity will not think so. It comes down to the > definition of similarity, and how well that definition matches your > intuition. > > FWIW, it feels right to me that a match in the beginning of a word is worth > more than one in the middle of a string. -1 on changing that. I'm not so sure that the assumption that leading trigrams should effectively weight > 3x is a good one to build into the library. However, the behavior is clearly documented and can't be changed. I think you'd need to improvise an alternate set of "trigram ops" if you wanted to rig an alternate matching behavior. merlin
В списке pgsql-hackers по дате отправления: