Re: tsvector pg_stats seems quite a bit off.
От | Tom Lane |
---|---|
Тема | Re: tsvector pg_stats seems quite a bit off. |
Дата | |
Msg-id | 19403.1275145960@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: tsvector pg_stats seems quite a bit off. (Jan Urbański <wulczer@wulczer.org>) |
Список | pgsql-hackers |
Jan Urbański <wulczer@wulczer.org> writes: > Hm, I am now thinking that maybe this theory is flawed, because tsvecors > contain only *unique* words, and Zipf's law is talking about words in > documents in general. Normally a word like "the" would appear lots of > times in a document, but (even ignoring the fact that it's a stopword > and so won't appear at all) in a tsvector it will be present only once. > This may or may not be a problem, not sure if such "squashing" of > occurences as tsvectors do skewes the distribution away from Zipfian or not. Well, it's still going to approach Zipfian distribution over a large number of documents. In any case we are not really depending on Zipf's law heavily with this approach. The worst-case result if it's wrong is that we end up with an MCE list shorter than our original target. I suggest we could try this and see if we notice that happening a lot. regards, tom lane
В списке pgsql-hackers по дате отправления: