Re: tsvector pg_stats seems quite a bit off.
От | Jan Urbański |
---|---|
Тема | Re: tsvector pg_stats seems quite a bit off. |
Дата | |
Msg-id | 4C012FD3.1070609@wulczer.org обсуждение исходный текст |
Ответ на | Re: tsvector pg_stats seems quite a bit off. (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: tsvector pg_stats seems quite a bit off.
|
Список | pgsql-hackers |
On 29/05/10 17:09, Tom Lane wrote: > Jan Urbański <wulczer@wulczer.org> writes: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. > >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. > >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 > > There is definitely something wrong with your math there. It's not > possible for the 100'th most common word to have a frequency as high > as 0.06 --- the ones above it presumably have larger frequencies, > which makes the total quite a lot more than 1.0. Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 With regards to my other mail this means that top_stopwords = 10 and error_factor = 10 would mean bucket_width = 7150 and final prune value of 6787. Jan
В списке pgsql-hackers по дате отправления: