Re: n-gram search function

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: n-gram search function
Дата
Msg-id Pine.LNX.4.64.0702190737420.400@sn.sai.msu.ru
обсуждение исходный текст
Ответ на Re: n-gram search function  ("Guillaume Smet" <guillaume.smet@gmail.com>)
Ответы Re: n-gram search function  ("Guillaume Smet" <guillaume.smet@gmail.com>)
Список pgsql-hackers
On Sun, 18 Feb 2007, Guillaume Smet wrote:

> Hi Oleg,
>
> On 2/17/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
>> 3-gram is implemented as a contrib/pg_trgm. It currently uses GiST index,
>> but may be enhanced with the GiN.
>
> As I'm facing the same problem, I've taken a look to pg_trgm. At the
> moment, my opinion is quite mixed but perhaps I did something wrong.
>
> I have a table (100k rows) with a location name in it generally
> composed of several words but not that long. I created the index
> directly on this column (ie I don't create a table with each word of
> the location name). Then I tried a few queries.
>
> Here is an example:
>
> prod=# explain analyze select nomlieu from lieu where nomlieu ilike 
> '%gaumont%';
>                                            QUERY PLAN
> -----------------------------------------------------------------------------------------------------
> Seq Scan on lieu  (cost=0.00..7230.20 rows=7 width=21) (actual
> time=7.768..556.930 rows=39 loops=1)
>  Filter: ((nomlieu)::text ~~* '%gaumont%'::text)
> Total runtime: 557.066 ms
> (3 rows)
>
> _prod=# explain analyze select nomlieu from lieu where nomlieu % 'gaumont';
>                                                            QUERY
> PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on lieu  (cost=3.37..200.80 rows=106 width=21)
> (actual time=689.799..690.035 rows=36 loops=1)
>  Recheck Cond: ((nomlieu)::text % 'gaumont'::text)
>  ->  Bitmap Index Scan on idx_lieu_nomlieu_trgm  (cost=0.00..3.37
> rows=106 width=0) (actual time=689.749..689.749 rows=36 loops=1)
>        Index Cond: ((nomlieu)::text % 'gaumont'::text)
> Total runtime: 690.195 ms
> (5 rows)
>
> The trigram version is slower and doesn't return 3 results I should
> have. The 3 results it doesn't return have the word gaumont in them at
> the start of the string exactly like the others.
>
> Is there anything I can do to improve the performances and investigate
> why I don't have these 3 results?

pg_trgm was developed for spelling corrrection and there is a threshold of
similarity, which is 0.3 by default. Readme explains what does it means.

Similarity could be very low, since you didn't make separate column and length
of the full string is used to normalize similarity.

pg_trgm as is isn't well suited for wild card search, but the idea is there.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Gregory Stark
Дата:
Сообщение: Re: TopPlan, again
Следующее
От: tomas@tuxteam.de
Дата:
Сообщение: Re: RFC: Temporal Extensions for PostgreSQL