Re: MCV lists for highly skewed distributions

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: MCV lists for highly skewed distributions
Дата
Msg-id CAJVSVGWm9QgkbpgbZ74GCLD=UdSrYo4aq+tBdAU9foR-N+1NNw@mail.gmail.com
обсуждение исходный текст
Ответ на MCV lists for highly skewed distributions  (Jeff Janes <jeff.janes@gmail.com>)
Ответы Re: MCV lists for highly skewed distributions  (John Naylor <jcnaylor@gmail.com>)
Список pgsql-hackers
On 12/28/17, Jeff Janes <jeff.janes@gmail.com> wrote:
> I want to revive a patch I sent  couple years ago to the performance list,
> as I have seen the issue pop up repeatedly since then.

> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.
>
> So instead of having the threshold of 1.25x the average frequency over all
> values, I think we should use 1.25x the average frequency of only those
> values not already included in the MCV, as in the attached.

After reading the thread you mentioned, I think that's a good approach.

On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?

Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.

This part doesn't use any loop variables, so should happen before the loop:

+                if (num_mcv > track_cnt)
+                    num_mcv = track_cnt;

I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.

> I think that perhaps maxmincount should also use the dynamic
> values_cnt_remaining rather than the static one.  After all, things
> included in the MCV don' get represented in the histogram.  When I've seen
> planning problems due to skewed distributions I also usually see redundant
> values in the histogram boundary list which I think should be in the MCV
> list instead. But I have not changed that here, pending discussion.

I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.

-John Naylor


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vladimir Svedov
Дата:
Сообщение: array_ndims never returns zero
Следующее
От: David Steele
Дата:
Сообщение: Re: Basebackups reported as idle