Re: benchmarking the query planner

Поиск

Список

Период

Сортировка

От	Greg Stark
Тема	Re: benchmarking the query planner
Дата	12 декабря 2008 г. 15:09:19
Msg-id	4136ffa0812121033o105f8d6bie0f2a87082c77df7@mail.gmail.com обсуждение исходный текст
Ответ на	Re: benchmarking the query planner (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I seem to recall Greg suggesting that there were ways to estimate
> ndistinct without sorting, but short of a fundamental algorithm change
> there's not going to be a win here.

Not sure if you meant me, but I can say the approach I saw documented
before involved keeping a hash of all values seen in a full table
scan. When the hash reached a maximum size you discard half the hash
key space and from then on ignore any values which hash into that
space. When you reach the end you have a hash table with a random
sample of 1/2^n distinct values (where n is the number of times the
hash table overflowed) and the counts for those values.

If you're just interested in counting distinct values in the sample I
suppose you could do it using a Bloom filter just as we've talked
about for other hash tables. You scan through the list and increment
an ndistinct counter for every value you find where the bits aren't
all already set (then set those bits). That would undercount slightly
and I'm not sure how much faster than qsort it would actually be. The
log(n) factor in qsort's average case is pretty small when we're
talking about small samples.

-- 
greg

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: benchmarking the query planner