Re: planner chooses incremental but not the best one

Поиск
Список
Период
Сортировка
От ywgrit
Тема Re: planner chooses incremental but not the best one
Дата
Msg-id CAPt2h2b683yMzVDVtCWiC_FNEym19tt4cdcxhUsQVCVEZ8z2dQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: planner chooses incremental but not the best one  (ywgrit <yw987194828@gmail.com>)
Список pgsql-hackers

Hi,Tomas

Recently, I looked at papers related to estimation of cardinarity with selection. I may be biased towards the scheme provided by the paper "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". This paper uses distinct sampling as opposed to the current uniform sampling, and the main differences between the two are as follows.

1)It not only counts the distincts on each column or multiple columns, but also saves some rows corresponding to each distinct value, i.e., it saves some part of the rows of the original relation as samples. The purpose of saving these rows is to accommodate restrictions on the queries, such as where clauses.

2)The queries are executed on the samples, and the result of the execution is used as the statistical value of cardinarity.

The advantages of this paper over existing practices are as follows.

1)The samples collected can be applied to arbitrary predicates, e.g. predicates that are correlated with the columns of group by clauses.

2)The accuracy is very high, and in some scenarios, the statistical error can be minimized by hundreds of times compared to uniform sampling.

However, the scheme provided in this paper also has some defects, as mentioned above, the scheme relies on the collected samples, which will lead to a significant increase in the storage overhead of statistical information.

I'd like to hear your opinions.

ywgrit.


ywgrit <yw987194828@gmail.com> 于2023年12月22日周五 16:20写道:

The possible solution of one scenario I can come up with so far is the query's predicate columns and group columns belonging to one table .

For a query that contains where clause, perhaps num_groups could be estimated according to the following formula.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ... pred_col_n).

ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause = ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.

pred_col_n belong to the columns involved in the where clause and sort_col_m belong to the columns involved in the group by clause.

The reason for multiplying by selectivity of rel directly is that the selectivity of rel depends on only pred_col not sort_col. So the above formula can be simplified as follows.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) * selectivity of rel.

The correctness of the above formula depends on the following conditions.

  • ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m) statistics already exist, and need be accurate.

  • Both pred_col_n and sort_col_m are uniformly distributed, if not, statistics such as mcv are needed for correction.

  • The tuples of rel are the number of total tuples of the table , not the number of filtered tuples.

After experimentation, in the scenario mentioned in previous thread. The estimate num_groups is 3, the accuracy of result strongly relies on the uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause could be able to estimated accurately.

I'd like to hear your opinions.

Regards.

ywgrit.


Tomas Vondra <tomas.vondra@enterprisedb.com> 于2023年12月18日周一 20:53写道:


On 12/18/23 11:40, Richard Guo wrote:
>
> On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>
>     Oh! Now I see what you meant by using the new formula in 84f9a35e3
>     depending on how we sum tuples. I agree that seems like the right thing.
>
>     I'm not sure it'll actually help with the issue, though - if I apply the
>     patch, the plan does not actually change (and the cost changes just a
>     little bit).
>
>     I looked at this only very briefly, but I believe it's due to the
>     assumption of independence I mentioned earlier - we end up using the new
>     formula introduced in 84f9a35e3, but it assumes it assumes the
>     selectivity and number of groups are independent. But that'd not the
>     case here, because the groups are very clearly correlated (with the
>     condition on "b").
>
>
> You're right.  The patch allows us to adjust the estimate of distinct
> values for appendrels using the new formula introduced in 84f9a35e3.
> But if the restrictions are correlated with the grouping expressions,
> the new formula does not behave well.  That's why the patch does not
> help in case [1], where 'b' and 'c' are correlated.
>
> OTOH, if the restrictions are not correlated with the grouping
> expressions, the new formula would perform quite well.  And in this case
> the patch would help a lot, as shown in [2] where estimate_num_groups()
> gives a much more accurate estimate with the help of this patch.
>
> So this patch could be useful in certain situations.  I'm wondering if
> we should at least have this patch (if it is right).
>

I do agree the patch seems to do the right thing, and it's worth pushing
on it's own.

>
>     If that's the case, I'm not sure how to fix this :-(
>
>
> The commit message of 84f9a35e3 says
>
>     This could possibly be improved upon in the future by identifying
>     correlated restrictions and using a hybrid of the old and new
>     formulae.
>
> Maybe this is something we can consider trying.  But anyhow this is not
> an easy task I suppose.

Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)

The challenge is where to get usable information about correlation
between columns. I only have a couple very rought ideas of what might
try. For example, if we have multi-column ndistinct statistics, we might
look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from

    ndistinct(b,c,d) / ndistinct(b,c)

If we know how many distinct values we have for the predicate column, we
could then estimate the number of groups. I mean, we know that for the
restriction "WHERE b = 3" we only have 1 distinct value, so we could
estimate the number of groups as

    1 * ndistinct(b,c)

I'm well aware this is only a very trivial example, and for more complex
examples it's likely way more complicated. But hopefully it illustrates
the general idea.

The other idea would be to maybe look at multi-column MCV, and try using
it to deduce cross-column correlation too (it could be more accurate for
arbitrary predicates).

And finally, we might try being a bit more pessimistic and look at what
the "worst case" behavior could be. So for example instead of trying to
estimate the real number of groups, we'd ask "What's the minimum number
of groups we're likely to get?". And use that to cost the incremental
sort. But I don't think we do this elsewhere, and I'm not sure we want
to start with this risk-based approach here. It might be OK, because we
usually expect the incremental sort to be much cheaper, ...

If this something would be interested in exploring? I don't have
capacity to work on this myself, but I'd be available for reviews,
feedback and so on.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Track in pg_replication_slots the reason why slots conflict?
Следующее
От: "Hayato Kuroda (Fujitsu)"
Дата:
Сообщение: RE: Synchronizing slots from primary to standby