pgsql: Fix misuse of Lossy Counting (LC) algorithm in

Поиск
Список
Период
Сортировка
От tgl@postgresql.org (Tom Lane)
Тема pgsql: Fix misuse of Lossy Counting (LC) algorithm in
Дата
Msg-id 20100530215910.033417541D2@cvs.postgresql.org
обсуждение исходный текст
Список pgsql-committers
Log Message:
-----------
Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().

We must filter out hashtable entries with frequencies less than those
specified by the algorithm, else we risk emitting junk entries whose
actual frequency is much less than other lexemes that did not get
tabulated.  This is bad enough by itself, but even worse is that
tsquerysel() believes that the minimum frequency seen in pg_statistic is a
hard upper bound for lexemes not included, and was thus underestimating
the frequency of non-MCEs.

Also, set the threshold frequency to something with a little bit of theory
behind it, to wit assume that the input distribution is approximately
Zipfian.  This might need adjustment in future, but some preliminary
experiments suggest that it's not too unreasonable.

Back-patch to 8.4, where this code was introduced.

Jan Urbanski, with some editorialization by Tom

Tags:
----
REL8_4_STABLE

Modified Files:
--------------
    pgsql/src/backend/tsearch:
        ts_typanalyze.c (r1.7 -> r1.7.2.1)
        (http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/ts_typanalyze.c?r1=1.7&r2=1.7.2.1)

В списке pgsql-committers по дате отправления:

Предыдущее
От: tgl@postgresql.org (Tom Lane)
Дата:
Сообщение: pgsql: Fix misuse of Lossy Counting (LC) algorithm in
Следующее
От: c2main@pgfoundry.org (User C2main)
Дата:
Сообщение: slony1-ctl - slony-ctl: Add umask 0077 by default