Tsearch docs question

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Tsearch docs question
Дата
Msg-id 1193423136.7624.56.camel@dogma.ljc.laika.com
обсуждение исходный текст
Ответы Re: Tsearch docs question  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-docs
The Tsearch docs, under the GiST and GIN section, say:

"Lossiness [of GiST] causes serious performance degradation since random
access of heap records is slow and limits the usefulness of GiST
indexes."

The docs do go into some detail, but I think it causes some confusion,
also.

Let me digress to state how I understand the relationship between GIN,
GiST, and RECHECK:

The benefit of avoiding RECHECK is to avoid the need to re-evaluate the
predicate after finding the entry in the index. This can be valuable in
tsearch, because the functions are much more expensive than (for
example) integer equality. We (currently) have to visit the heap anyway,
to see the visibility information. So avoiding a RECHECK clause doesn't
do anything to prevent random heap I/O (although, a less-lossy index
will have fewer false positives, by definition).

GIN (as used with tsearch) is lossy for more sophisticated tsqueries
(those involving labels) and non-lossy for simpler tsqueries. There's
only one tsquery type, so PostgreSQL has no way of differentiating
between these two cases.

GiST (as used with tsearch) is lossy for large tsvectors or tsqueries
containing labels; and non-lossy for small tsvectors matched against a
tsquery that contains no labels. PostgreSQL can't differentiate between
these two cases.

So, for GiST they always RECHECK (so you're always sure to get the right
result), and for GIN the default operator does not RECHECK (for
performance), but if you suspect that you might be using labels in your
tsqueries you need to use a special RECHECKing operator, "@@@", to be
accurate.

Is the above accurate?

Back to the docs: I think the docs could clear this issue up somewhat.
The current wording suggests that GIN performs better because it avoids
a trip to the heap, when in reality it seems the benefit is avoiding the
need to re-evaluate the expensive tsearch functions (which might need to
access TOASTed data).

There's also a related issue: I think a RECHECK would be less costly if
you have the tsvectors materialized in the table (using triggers) and
index that. Maybe that could be a tip for using GiST indexes.

Regards,
    Jeff Davis


В списке pgsql-docs по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Example of plpgsql RETURN NEXT
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Pattern for use of the alias "Postgres"