Re: Extending range of to_tsvector et al
От | john knightley |
---|---|
Тема | Re: Extending range of to_tsvector et al |
Дата | |
Msg-id | CA+nPCM9mTszOyEda7SPwothev_0=45sgeTGOYOH3QVrf8RwAVQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Extending range of to_tsvector et al (Dan Scott <denials@gmail.com>) |
Ответы |
Re: Extending range of to_tsvector et al
Re: Extending range of to_tsvector et al |
Список | pgsql-hackers |
Dear Dan, thank you for your reply. The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 𦘭𥎵 𪽖𫖂 line 1 "raeuz" Zhuang word written using English letters and show up under ts_vector ok line 2 "我们" uses everyday Chinese word and show up under ts_vector ok line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 "" Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) The last two words even though included in a dictionary do not get accepted by ts_vector. Regards John On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott <denials@gmail.com> wrote: > On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote: >> When using to_tsvector a number of newer unicode characters and pua >> characters are not included. How do I add the characters which I desire to >> be found? > > I've just started digging into this code a bit, but from what I've > found src/backend/tsearch/wparser_def.c defines much of the parser > functionality, and in the area of Unicode includes a number of > comments like: > > * with multibyte encoding and C-locale isw* function may fail or give > wrong result. > * multibyte encoding and C-locale often are used for Asian languages. > * any non-ascii symbol with multibyte encoding with C-locale is an > alpha character > > ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if > WCSTOMBS and TOWLOWER are available) to complicate testing scenarios > :) > > Also note that src/test/regress/sql/tsearch.sql and > regress/sql/tsdicts.sql currently focus on English, ASCII-only data. > > Perhaps this is a good opportunity for you to describe what your > environment looks like (OS, PostgreSQL version, encoding and locale > settings for the database) and show some sample to_tsquery() @@ > to_tsvector() queries that don't behave the way you think they should > behave - and we could start building some test cases as a first step? > > -- > Dan Scott > Laurentian University
В списке pgsql-hackers по дате отправления: