Re: PoC: Using Count-Min Sketch for join cardinality estimation

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: PoC: Using Count-Min Sketch for join cardinality estimation
Дата
Msg-id 95428c26-353e-9643-d76a-8764f68fb894@enterprisedb.com
обсуждение исходный текст
Ответ на Re: PoC: Using Count-Min Sketch for join cardinality estimation  (John Naylor <john.naylor@enterprisedb.com>)
Ответы Re: PoC: Using Count-Min Sketch for join cardinality estimation  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
On 6/18/21 7:03 PM, John Naylor wrote:
> On Wed, Jun 16, 2021 at 8:23 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
> 
>  > Not really, but to be fair even for the histograms it's only really
>  > supported by "seems to work in practice" :-(
> 
> Hmm, we cite a theoretical result in analyze.c, but I don't know if 
> there is something better out there:
> 
>   * The following choice of minrows is based on the paper
>   * "Random sampling for histogram construction: how much is enough?"
>   * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
> 

True. I read that paper (long time ago), and it certainly gives some 
very interesting guidance and guarantees regarding relative error. And 
now that I look at it, the theorems 5 & 6, and the corollary 1 do 
provide a way to calculate probability of a lower error (essentially 
vary the f, get the probability).

I still think there's a lot of reliance on experience from practice, 
because even with such strong limits delta=0.5 of a histogram with 100 
buckets, representing 1e9 rows, is still plenty of space for errors.

> What is more troubling to me is that we set the number of MCVs to the 
> number of histograms. Since b5db1d93d2a6 we have a pretty good method of 
> finding the MCVs that are justified. When that first went in, I 
> experimented with removing the MCV limit and found it easy to create 
> value distributions that lead to thousands of MCVs. I guess the best 
> justification now for the limit is plan time, but if we have a sketch 
> also, we can choose one or the other based on a plan-time speed vs 
> accuracy tradeoff (another use for a planner effort guc). In that 
> scenario, for tables with many MCVs we would only use them for 
> restriction clauses.
> 

Sorry, I'm not sure what you mean by "we set the number of MCVs to the 
number of histograms" :-(

When you say "MCV limit" you mean that we limit the number of items to 
statistics target, right? I agree plan time is one concern - but it's 
also about analyze, as we need larger sample to build a larger MCV or 
histogram (as the paper you referenced shows).

I think the sketch is quite interesting for skewed data sets where the 
MCV can represent only small fraction of the data, exactly because of 
the limit. For (close to) uniform data distributions we can just use 
ndistinct estimates to get estimates that are better than those from a 
sketch, I think.

So I think we should try using MCV first, and then use sketches for the 
rest of the data (or possibly all data, if one side does not have MCV).

FWIW I think the sketch may be useful even for restriction clauses, 
which is what the paper calls "point queries". Again, maybe this should 
use the same correction depending on ndistinct estimate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: Optionally automatically disable logical replication subscriptions on error
Следующее
От: John Naylor
Дата:
Сообщение: Re: PoC: Using Count-Min Sketch for join cardinality estimation