Re: TSearch2: Auto identify document language?
| От | Michael Fuhr |
|---|---|
| Тема | Re: TSearch2: Auto identify document language? |
| Дата | |
| Msg-id | 20051211170438.GA77947@winnie.fuhr.org обсуждение исходный текст |
| Ответ на | TSearch2: Auto identify document language? (Hannes Dorbath <light@theendofthetunnel.de>) |
| Список | pgsql-general |
On Sun, Dec 11, 2005 at 01:17:42PM +0100, Hannes Dorbath wrote: > Is there a practical way to make a guess what language a document is > written in and auto magically use the adequate TSearch config? I thought > of looking up the document's words in various dicts and use the one with > the most matches.. doesn't matter if performance will be bad. I don't know how easily you could incorporate this into tsearch2, but for the general problem of language identification you could try something like Perl's Lingua::Identify module. http://search.cpan.org/dist/Lingua-Identify/lib/Lingua/Identify.pm CREATE FUNCTION langof(text) RETURNS text AS $$ use Lingua::Identify qw(:language_identification); return langof($_[0]); $$ LANGUAGE plperlu IMMUTABLE STRICT; SELECT langof('The quick brown fox jumped over the lazy dog.'); langof -------- en (1 row) SELECT langof('Der schnelle braune Fuchs sprang über den faulen Hund.'); langof -------- de (1 row) SELECT langof('El zorro marrón rápido saltó sobre el perro perezoso.'); langof -------- es (1 row) SELECT langof('La volpe marrone rapida ha saltato sopra il cane pigro.'); langof -------- it (1 row) SELECT langof('Le renard brun rapide a sauté par-dessus le chien paresseux.'); langof -------- fi (1 row) Language identification isn't always accurate -- in this example the function thinks the last text is Finnish instead of French -- but it might get better with more text to examine, and you can tell Lingua::Identify which languages to consider or ignore. -- Michael Fuhr
В списке pgsql-general по дате отправления: