Re: pg_trgm
От | Peter Eisentraut |
---|---|
Тема | Re: pg_trgm |
Дата | |
Msg-id | 1274983261.18581.14.camel@vanquo.pezone.net обсуждение исходный текст |
Ответ на | Re: pg_trgm (Tatsuo Ishii <ishii@postgresql.org>) |
Ответы |
Re: pg_trgm
Re: pg_trgm Re: pg_trgm |
Список | pgsql-hackers |
On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote: > > I don't know about Japanese, but the locale approach works just fine for > > other agglutinative languages. I would rather suspect that it is the > > trigram approach that might be rather useless for such languages, > > because you are going to get a lot of similarity hits for the affixes. > > I'm not sure what you mean by "affixes". But I will explain... > > A Japanese sentence consists of words. Problem is, each word is not > separated by space (agglutinative). So most text tools such as text > search need preprocess which finds word boundaries by looking up > dictionaries (and smart grammer analysis routine). In the process > "affixes" can be determined and perhaps removed from the target word > group to be used for text search (note that removing affixes is no > relevant to locale). Once we get space separated sentence, it can be > processed by text search or by pg_trgm just same as Engligh. (Note > that these preprocessing are done outside PostgreSQL world). The > difference is just the "word" can be consists of non ASCII letters. I think the problem at hand has nothing at all to do with agglutination or CJK-specific issues. You will get the same problem with other languages *if* you set a locale that does not adequately support the characters in use. E.g., Russian with locale C and encoding UTF8: select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E \u043D\u044B');similarity ──────────── NaN (1 row)
В списке pgsql-hackers по дате отправления: