Re: Extending range of to_tsvector et al
| От | john knightley |
|---|---|
| Тема | Re: Extending range of to_tsvector et al |
| Дата | |
| Msg-id | CA+nPCM-YXLLSszLW9Q_urCjzwnfkvFJNWYxcsfvvsB86fVJa-A@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Extending range of to_tsvector et al (Dan Scott <denials@gmail.com>) |
| Ответы |
Re: Extending range of to_tsvector et al
|
| Список | pgsql-hackers |
On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote: > Hi John: > > On Sun, Sep 30, 2012 at 11:45 PM, john knightley > <john.knightley@gmail.com> wrote: >> Dear Dan, >> >> thank you for your reply. >> >> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on >> a utf8 local >> >> A short 5 line dictionary file is sufficient to test:- >> >> raeuz >> 我们 >> 𦘭𥎵 >> 𪽖𫖂 >> >> >> line 1 "raeuz" Zhuang word written using English letters and show up >> under ts_vector ok >> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok >> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters >> found in Unicode 3.1 which came in about the year 2000 and show up >> under ts_vector ok >> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters >> found in Unicode 5.2 which came in about the year 2009 but do not show >> up under ts_vector ok >> line 5 "" Zhuang word written using rather old Chinese charcters >> found in PUA area of the font Sawndip.ttf but do not show up under >> ts_vector ok (Font can be downloaded from >> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) >> >> The last two words even though included in a dictionary do not get >> accepted by ts_vector. > > Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to > work using the default text search configuration (albeit with one > crucial note: I created the database with the "lc_ctype=C > lc_collate=C" options): > > WORKING: > > createdb --template=template0 --lc-ctype=C --lc-collate=C foobar > foobar=# select ts_debug(''); > ts_debug > ---------------------------------------------------------------- > (word,"Word, all letters",,{english_stem},english_stem,{}) > (1 row) > > NOT WORKING AS EXPECTED: > > > foobaz=# SHOW LC_CTYPE; > lc_ctype > ------------- > en_US.UTF-8 > (1 row) > > foobaz=# select ts_debug(''); > ts_debug > --------------------------------- > (blank,"Space symbols",,{},,) > (1 row) > > So... perhaps LC_CTYPE=C is a possible workaround for you? LC_CTYPE would not be a work around - this database needs to be in utf8 , the full text search is to be used for a mediawiki. Is this a bug that is being worked on? Regards John
В списке pgsql-hackers по дате отправления: