Re: benchmarking the query planner

Поиск

Список

Период

Сортировка

От	Greg Stark
Тема	Re: benchmarking the query planner
Дата	12 декабря 2008 г. 14:04:30
Msg-id	4136ffa0812121001x3fde1871t558fd8c58070b7d@mail.gmail.com обсуждение исходный текст
Ответ на	Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>)
Ответы	Re: benchmarking the query planner
Список	pgsql-hackers

Дерево обсуждения

On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Incidentally we *do* do block sampling. We pick random blocks and then pick
>> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime
>> around then. It dramatically reduced the i/o requirements but there were long
>> discussions of how to do it without introducing biases.
>
> No, we pick random rows. On bigger tables, they get further apart
> typically and so we miss any clustering. I mean that we should pick a
> random block and read all rows on it.

I think what's happening here is that our sampling method is based on
the assumption that a records location is not related to its value.

Consider a table which is clustered and has two copies of each value.
When we look at a block and see n/2 values and we know there are 1000
blocks then a human would conjecture that there are 1000*n/2 distinct
values throughout the table and every value is represented twice
throughout the whole table. But if we're assuming the records are
randomly distributed then looking at the whole block will actually
throw us off completely. We'll deduce from the fact that we saw every
value twice that there must be hundreds of copies spread throughout
the database and there must be a lot less than 1000*n/2 distinct
values.

I think you need to find two different formulas, one which represents
a clustered table and one which represents randomly distributed data.
Then you need a way to measure just how clustered the data is so you
know how much weight to give each formula. Perhaps comparing the
number of duplicates in whole-block samples versus overall random
selections would give that measure.

-- 
greg

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: benchmarking the query planner