Re: planner chooses incremental but not the best one

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: planner chooses incremental but not the best one
Дата	18 декабря 2023 г. 12:53:48
Msg-id	b3e92c2a-7fb0-fae2-a4aa-5541ae19fd38@enterprisedb.com обсуждение исходный текст
Ответ на	Re: planner chooses incremental but not the best one (Richard Guo <guofenglinux@gmail.com>)
Ответы	Re: planner chooses incremental but not the best one Re: planner chooses incremental but not the best one
Список	pgsql-hackers

Дерево обсуждения

On 12/18/23 11:40, Richard Guo wrote:
> 
> On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     Oh! Now I see what you meant by using the new formula in 84f9a35e3
>     depending on how we sum tuples. I agree that seems like the right thing.
> 
>     I'm not sure it'll actually help with the issue, though - if I apply the
>     patch, the plan does not actually change (and the cost changes just a
>     little bit).
> 
>     I looked at this only very briefly, but I believe it's due to the
>     assumption of independence I mentioned earlier - we end up using the new
>     formula introduced in 84f9a35e3, but it assumes it assumes the
>     selectivity and number of groups are independent. But that'd not the
>     case here, because the groups are very clearly correlated (with the
>     condition on "b").
> 
> 
> You're right.  The patch allows us to adjust the estimate of distinct
> values for appendrels using the new formula introduced in 84f9a35e3.
> But if the restrictions are correlated with the grouping expressions,
> the new formula does not behave well.  That's why the patch does not
> help in case [1], where 'b' and 'c' are correlated.
> 
> OTOH, if the restrictions are not correlated with the grouping
> expressions, the new formula would perform quite well.  And in this case
> the patch would help a lot, as shown in [2] where estimate_num_groups()
> gives a much more accurate estimate with the help of this patch.
> 
> So this patch could be useful in certain situations.  I'm wondering if
> we should at least have this patch (if it is right).
>

I do agree the patch seems to do the right thing, and it's worth pushing
on it's own.

> 
>     If that's the case, I'm not sure how to fix this :-(
> 
> 
> The commit message of 84f9a35e3 says
> 
>     This could possibly be improved upon in the future by identifying
>     correlated restrictions and using a hybrid of the old and new
>     formulae.
> 
> Maybe this is something we can consider trying.  But anyhow this is not
> an easy task I suppose.

Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)

The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

    ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

    1 * ndistinct(b,c)

I'm well aware this is only a very trivial example, and for more complex
examples it's likely way more complicated. But hopefully it illustrates
the general idea.

The other idea would be to maybe look at multi-column MCV, and try using
it to deduce cross-column correlation too (it could be more accurate for
arbitrary predicates).

And finally, we might try being a bit more pessimistic and look at what
the "worst case" behavior could be. So for example instead of trying to
estimate the real number of groups, we'd ask "What's the minimum number
of groups we're likely to get?". And use that to cost the incremental
sort. But I don't think we do this elsewhere, and I'm not sure we want
to start with this risk-based approach here. It might be OK, because we
usually expect the incremental sort to be much cheaper, ...

If this something would be interested in exploring? I don't have
capacity to work on this myself, but I'd be available for reviews,
feedback and so on.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: planner chooses incremental but not the best one