Tsearch docs question

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Tsearch docs question
Дата	26 октября 2007 г. 15:25:56
Msg-id	1193423136.7624.56.camel@dogma.ljc.laika.com обсуждение исходный текст
Ответы	Re: Tsearch docs question
Список	pgsql-docs

Дерево обсуждения

The Tsearch docs, under the GiST and GIN section, say:

"Lossiness [of GiST] causes serious performance degradation since random
access of heap records is slow and limits the usefulness of GiST
indexes."

The docs do go into some detail, but I think it causes some confusion,
also.

Let me digress to state how I understand the relationship between GIN,
GiST, and RECHECK:

The benefit of avoiding RECHECK is to avoid the need to re-evaluate the
predicate after finding the entry in the index. This can be valuable in
tsearch, because the functions are much more expensive than (for
example) integer equality. We (currently) have to visit the heap anyway,
to see the visibility information. So avoiding a RECHECK clause doesn't
do anything to prevent random heap I/O (although, a less-lossy index
will have fewer false positives, by definition).

GIN (as used with tsearch) is lossy for more sophisticated tsqueries
(those involving labels) and non-lossy for simpler tsqueries. There's
only one tsquery type, so PostgreSQL has no way of differentiating
between these two cases.

GiST (as used with tsearch) is lossy for large tsvectors or tsqueries
containing labels; and non-lossy for small tsvectors matched against a
tsquery that contains no labels. PostgreSQL can't differentiate between
these two cases.

So, for GiST they always RECHECK (so you're always sure to get the right
result), and for GIN the default operator does not RECHECK (for
performance), but if you suspect that you might be using labels in your
tsqueries you need to use a special RECHECKing operator, "@@@", to be
accurate.

Is the above accurate?

Back to the docs: I think the docs could clear this issue up somewhat.
The current wording suggests that GIN performs better because it avoids
a trip to the heap, when in reality it seems the benefit is avoiding the
need to re-evaluate the expensive tsearch functions (which might need to
access TOASTed data).

There's also a related issue: I think a RECHECK would be less costly if
you have the tsvectors materialized in the table (using triggers) and
index that. Maybe that could be a tip for using GiST indexes.

Regards,
    Jeff Davis

В списке pgsql-docs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Tsearch docs question