Обсуждение: fulltext search and hunspell
Hey,
I want to use hunspell as a dictionary for the full text search by
* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:
CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);
* changing the config
ALTER TEXT SEARCH CONFIGURATION german
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_hunspell, german_stem;
* now testing the lexizer:
SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
-----------
(1 Zeile)
Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)
The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.
Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?
Thanks
Jens,
could you check affix file for
compoundwords controlled z
also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.
Oleg
On Mon, 7 Feb 2011, Jens Sauer wrote:
> Hey,
>
> I want to use hunspell as a dictionary for the full text search by
>
> * using PostgresSQL 8.4.7
> * installing hunspell-de-de, hunspell-de-med
> * creating a dictionary:
>
> CREATE TEXT SEARCH DICTIONARY german_hunspell (
> TEMPLATE = ispell,
> DictFile = de_de,
> AffFile = de_de,
> StopWords = german
> );
>
> * changing the config
>
> ALTER TEXT SEARCH CONFIGURATION german
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part
> WITH german_hunspell, german_stem;
>
> * now testing the lexizer:
>
> SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
> ts_lexize
> -----------
>
> (1 Zeile)
>
> Shouldn't it be something like this:
> SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
> {sjokoladefabrikk,sjokolade,fabrikk}
> (from the 8.4 documentation of PostgreSQL)
>
>
> The dict and affix files in the tsearch_data directory were
> automatically generated by pg_updatedicts.
>
> Is this a problem of the splitting compound word functionality? Should
> I use ispell instead of hunspell?
>
> Thanks
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Hey,
thanks for your answer.
First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.
This is an extract of the de_de.affix file:
# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
PFX U Y 1
PFX U 0 un .
PFX V Y 1
PFX V 0 ver .
SFX F Y 35
[...]
I cannot find "compoundwords controlled z" there, so I manually added it.
[...]
# versions using the GPL may only include the GPL
compoundwords controlled z
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
[...]
Then I restarted PostgreSQL.
Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat für Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
»/usr/share/postgresql/8.4/tsearch_data/de_de.affix«: »PFX U Y 1
«
SQL-Funktion »ts_debug« Anweisung 1
SQL-Funktion »ts_debug« Anweisung 1
Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...
If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L
instead of "compoundwords controlled z"
I didn't get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)
But it seems that the hunspell dictionary is not working for compound words.
Maybe pg_updatedicts has a bug and generates affix files in the wrong format?
Jens
2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
> Jens,
>
> could you check affix file for
> compoundwords controlled z
>
> also, can you provide link to dictionary files, so we can check if they
> supported, since we have only rudiment support of hunspell.
> btw,it'd be nice to have output from ts_debug() to make sure dictionaries
> actually used.
>
> Oleg
Jens, have you tried german compound dictionary from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ Oleg On Tue, 8 Feb 2011, Jens Sauer wrote: > Hey, > > thanks for your answer. > > First I checked the links in the tsearch_data directory > de_de.affix, and de_de.dict are symlinks to the corresponding files in > /var/cache/postgresql/dicts/ > Then I recreated them by using pg_updatedicts. > > This is an extract of the de_de.affix file: > > # this is the affix file of the de_DE Hunspell dictionary > # derived from the igerman98 dictionary > # > # Version: 20091006 (build 20100127) > # > # Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de> > # > # License: GPLv2, GPLv3 or OASIS distribution license agreement > # There should be a copy of both of this licenses included > # with every distribution of this dictionary. Modified > # versions using the GPL may only include the GPL > > SET ISO8859-1 > TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-. > > PFX U Y 1 > PFX U 0 un . > > PFX V Y 1 > PFX V 0 ver . > > SFX F Y 35 > [...] > > I cannot find "compoundwords controlled z" there, so I manually added it. > > [...] > # versions using the GPL may only include the GPL > > compoundwords controlled z > > SET ISO8859-1 > TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-. > [...] > > Then I restarted PostgreSQL. > > Now I get an error: > SELECT * FROM ts_debug('Schokoladenfabrik'); > FEHLER: falsches Affixdateiformat f?r Flag > CONTEXT: Zeile 18 in Konfigurationsdatei > ?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1 > ? > SQL-Funktion ?ts_debug? Anweisung 1 > SQL-Funktion ?ts_debug? Anweisung 1 > > Which means: > ERROR: wrong Affixfileformat for flag > CONTEXT: Line 18 in Configuration ... > > If I add > COMPOUNDFLAG Z > ONLYINCOMPOUND L > > instead of "compoundwords controlled z" > > I didn't get an error: > > SELECT * FROM ts_debug('Schokoladenfabrik'); > alias | description | token | > dictionaries | dictionary | lexemes > -----------+-----------------+-------------------+-------------------------------+-------------+------------------- > asciiword | Word, all ASCII | Schokoladenfabrik | > {german_hunspell,german_stem} | german_stem | {schokoladenfabr} > (1 row) > > But it seems that the hunspell dictionary is not working for compound words. > > Maybe pg_updatedicts has a bug and generates affix files in the wrong format? > > Jens > > 2011/2/7 Oleg Bartunov <oleg@sai.msu.su>: >> Jens, >> >> could you check affix file for >> compoundwords controlled z >> >> also, can you provide link to dictionary files, so we can check if they >> supported, since we have only rudiment support of hunspell. >> btw,it'd be nice to have output from ts_debug() to make sure dictionaries >> actually used. >> >> Oleg > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Thanks for this tip, the german compound directory from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ works fine. I think the problem was the rudimentary support of hunspell dictionaries. Thanks for your help and your great software! Am 08.02.2011 11:34, schrieb Oleg Bartunov: > Jens, > > have you tried german compound dictionary from > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ > > Oleg > On Tue, 8 Feb 2011, Jens Sauer wrote: > >> Hey, >> >> thanks for your answer. >> >> First I checked the links in the tsearch_data directory >> de_de.affix, and de_de.dict are symlinks to the corresponding files in >> /var/cache/postgresql/dicts/ >> Then I recreated them by using pg_updatedicts. >> >> This is an extract of the de_de.affix file: >> >> # this is the affix file of the de_DE Hunspell dictionary >> # derived from the igerman98 dictionary >> # >> # Version: 20091006 (build 20100127) >> # >> # Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de> >> # >> # License: GPLv2, GPLv3 or OASIS distribution license agreement >> # There should be a copy of both of this licenses included >> # with every distribution of this dictionary. Modified >> # versions using the GPL may only include the GPL >> >> SET ISO8859-1 >> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-. >> >> PFX U Y 1 >> PFX U 0 un . >> >> PFX V Y 1 >> PFX V 0 ver . >> >> SFX F Y 35 >> [...] >> >> I cannot find "compoundwords controlled z" there, so I manually added >> it. >> >> [...] >> # versions using the GPL may only include the GPL >> >> compoundwords controlled z >> >> SET ISO8859-1 >> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-. >> [...] >> >> Then I restarted PostgreSQL. >> >> Now I get an error: >> SELECT * FROM ts_debug('Schokoladenfabrik'); >> FEHLER: falsches Affixdateiformat f?r Flag >> CONTEXT: Zeile 18 in Konfigurationsdatei >> ?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1 >> ? >> SQL-Funktion ?ts_debug? Anweisung 1 >> SQL-Funktion ?ts_debug? Anweisung 1 >> >> Which means: >> ERROR: wrong Affixfileformat for flag >> CONTEXT: Line 18 in Configuration ... >> >> If I add >> COMPOUNDFLAG Z >> ONLYINCOMPOUND L >> >> instead of "compoundwords controlled z" >> >> I didn't get an error: >> >> SELECT * FROM ts_debug('Schokoladenfabrik'); >> alias | description | token | >> dictionaries | dictionary | lexemes >> -----------+-----------------+-------------------+-------------------------------+-------------+------------------- >> >> asciiword | Word, all ASCII | Schokoladenfabrik | >> {german_hunspell,german_stem} | german_stem | {schokoladenfabr} >> (1 row) >> >> But it seems that the hunspell dictionary is not working for compound >> words. >> >> Maybe pg_updatedicts has a bug and generates affix files in the wrong >> format? >> >> Jens >> >> 2011/2/7 Oleg Bartunov <oleg@sai.msu.su>: >>> Jens, >>> >>> could you check affix file for >>> compoundwords controlled z >>> >>> also, can you provide link to dictionary files, so we can check if they >>> supported, since we have only rudiment support of hunspell. >>> btw,it'd be nice to have output from ts_debug() to make sure >>> dictionaries >>> actually used. >>> >>> Oleg >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83