Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
От | Tomas Vondra |
---|---|
Тема | Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog |
Дата | |
Msg-id | 76f80a24-fe89-c6f9-5e79-239a02e8a2b7@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog (Bruce Momjian <bruce@momjian.us>) |
Ответы |
Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog
|
Список | pgsql-hackers |
On 5/25/22 00:16, Bruce Momjian wrote: > On Mon, May 16, 2022 at 12:09:41AM +0200, Tomas Vondra wrote: >> I think it's an interesting idea. In principle it allows deducing the >> multi-column MCV for arbitrary combination of columns, not determined in >> advance. We'd have the MCV with HLL instead of frequencies for columns >> A, B and C: >> >> (a1, hll(a1)) >> (a2, hll(a2)) >> (...) >> (aK, hll(aK)) >> >> >> (b1, hll(b1)) >> (b2, hll(b2)) >> (...) >> (bL, hll(bL)) >> >> (c1, hll(c1)) >> (c2, hll(c2)) >> (...) >> (cM, hll(cM)) >> >> and from this we'd be able to build MCV for any combination of those >> three columns. > > Sorry, but I am lost here. I read about HLL here: > > https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869 > > However, I don't see how they can be combined for multiple columns. It's the same as combining multiple HLL filters. HLL is essentially just an array of counters, and to calculate a union (i.e. HLL for elements in at least one of the input HLL sketches), you can just do Max() of the counters. For intersection, you have to use inclusion-exclusion principle, i.e. intersection(HLL1, HLL2) = estimate(HLL1) + estimate(HLL2) - estimate(union(HLL1,HLL2)) which is exactly the same as P(A & B) = P(A) + P(B) - P(A | B) There's more in: https://github.com/citusdata/postgresql-hll https://agkn.wordpress.com/2012/12/17/hll-intersections-2/ which also mentions the weakness - error is proportional to the size of the union, while the intersection may be much smaller. Which might be an issue, especially when combining multiple columns. > Above, I know A,B,C are columns, but what is a1, a2, etc? a1 is a value in column A, common enough to make it into the MCV. But instead of just a frequency, we store a HLL tracking unique rows (by adding CTID to the HLL). So for example for a "city" column, you'd have ("NY", HLL of CTIDs for rows with city = NY) ("Philadephia", HLL of CTIDs for rows with city = Philadelphia) ... etc. -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: