Re: tsearch2: enable non ascii stop words with C locale
От | Tatsuo Ishii |
---|---|
Тема | Re: tsearch2: enable non ascii stop words with C locale |
Дата | |
Msg-id | 20070213.082314.74752487.t-ishii@sraoss.co.jp обсуждение исходный текст |
Ответ на | Re: tsearch2: enable non ascii stop words with C locale (Teodor Sigaev <teodor@sigaev.ru>) |
Ответы |
Re: tsearch2: enable non ascii stop words with C locale
|
Список | pgsql-hackers |
> > Currently tsearch2 does not accept non ascii stop words if locale is > > C. Included patches should fix the problem. Patches against PostgreSQL > > 8.2.3. > > I'm not sure about correctness of patch's description. > > First, p_islatin() function is used only in words/lexemes parser, not stop-word > code. I know. My guess is the parser does not read the stop word file at least with default configuration. > Second, p_islatin() function is used for catching lexemes like URL or HTML > entities, so, it's important to define real latin characters. And it works > right: it calls p_isalpha (already patched for your case), then it calls > p_isascii which should be correct for any encodings with C-locale. original p_islatin is defined as follows: static int p_islatin(TParser * prs) {return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0; } So if a character is not ASCII, it returns 0 even if p_isalpha returns 1. Is this what you expect? > Third (and last): > contrib_regression=# show server_encoding; > server_encoding > ----------------- > UTF8 > contrib_regression=# show lc_ctype; > lc_ctype > ---------- > C > contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); > lexize > -------- > {} > > Russian characters with UTF8 take two bytes. In our case, we added JAPANESE_STOP_WORD into english.stop then: select to_tsvector(JAPANESE_STOP_WORD) which returns words even they are in JAPANESE_STOP_WORD. And with the patches the problem was solved. -- Tatsuo Ishii SRA OSS, Inc. Japan
В списке pgsql-hackers по дате отправления: