Tsearch docs question
От | Jeff Davis |
---|---|
Тема | Tsearch docs question |
Дата | |
Msg-id | 1193423136.7624.56.camel@dogma.ljc.laika.com обсуждение исходный текст |
Ответы |
Re: Tsearch docs question
|
Список | pgsql-docs |
The Tsearch docs, under the GiST and GIN section, say: "Lossiness [of GiST] causes serious performance degradation since random access of heap records is slow and limits the usefulness of GiST indexes." The docs do go into some detail, but I think it causes some confusion, also. Let me digress to state how I understand the relationship between GIN, GiST, and RECHECK: The benefit of avoiding RECHECK is to avoid the need to re-evaluate the predicate after finding the entry in the index. This can be valuable in tsearch, because the functions are much more expensive than (for example) integer equality. We (currently) have to visit the heap anyway, to see the visibility information. So avoiding a RECHECK clause doesn't do anything to prevent random heap I/O (although, a less-lossy index will have fewer false positives, by definition). GIN (as used with tsearch) is lossy for more sophisticated tsqueries (those involving labels) and non-lossy for simpler tsqueries. There's only one tsquery type, so PostgreSQL has no way of differentiating between these two cases. GiST (as used with tsearch) is lossy for large tsvectors or tsqueries containing labels; and non-lossy for small tsvectors matched against a tsquery that contains no labels. PostgreSQL can't differentiate between these two cases. So, for GiST they always RECHECK (so you're always sure to get the right result), and for GIN the default operator does not RECHECK (for performance), but if you suspect that you might be using labels in your tsqueries you need to use a special RECHECKing operator, "@@@", to be accurate. Is the above accurate? Back to the docs: I think the docs could clear this issue up somewhat. The current wording suggests that GIN performs better because it avoids a trip to the heap, when in reality it seems the benefit is avoiding the need to re-evaluate the expensive tsearch functions (which might need to access TOASTed data). There's also a related issue: I think a RECHECK would be less costly if you have the tsvectors materialized in the table (using triggers) and index that. Maybe that could be a tip for using GiST indexes. Regards, Jeff Davis
В списке pgsql-docs по дате отправления: