Re: Use merge-based matching for MCVs in eqjoinsel
| От | David Geier |
|---|---|
| Тема | Re: Use merge-based matching for MCVs in eqjoinsel |
| Дата | |
| Msg-id | 3a4211f6-c6d6-4ff3-a5c5-231179e4e519@gmail.com обсуждение исходный текст |
| Ответ на | Re: Use merge-based matching for MCVs in eqjoinsel (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>) |
| Ответы |
Re: Use merge-based matching for MCVs in eqjoinsel
|
| Список | pgsql-hackers |
Hi Ilia! On 16.09.2025 17:52, Ilia Evdokimov wrote: > Hi hackers, > > On 10.09.2025 16:56, Ilia Evdokimov wrote: >> Unfortunately, the JOB benchmark does not contain semi join nodes. >> However, TPC-DS does. I'll look for the queries with slowest planner >> times there and check them. >> >> I'll need some time to check both join and semi join cases with small >> and large default_statistics_target. I'll share the results later. > > JOIN > ============================== > > I’ve benchmarked the new implementation of eqjoinsel() with different > values of default_statistics_target. On small targets (1, 5, 10, 25, 50, > 75, 100) the results are all within statistical noise, and I did not > observe any regressions. In my view, it’s reasonable to keep the current > condition that the hash table is not used for default_statistics_target > = 1. Raising that threshold does not seem useful. > > Here are the results for JOB queries (where the effect of semi join is > not visible due to different data distributions): > > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > ------------------------------------------------------------------------------------------ > 1 | 1.00 | 1846.643 | > 1847.409 > 5 | 1.00 | 1836.391 | > 1828.318 > 10 | 0.95 | 1841.750 | > 1929.722 > 25 | 0.99 | 1873.172 | > 1890.741 > 50 | 0.98 | 1869.897 | > 1898.470 > 75 | 1.02 | 1969.368 | > 1929.521 > 100 | 0.97 | 1857.890 | > 1921.207 > 1000 | 1.14 | 2279.700 | > 1997.102 > 2500 | 1.78 | 4682.658 | > 2636.202 > 5000 | 6.45 | 15943.696 | > 2471.242 > 7500 | 12.45 | 34350.855 | > 2758.565 > 10000 | 20.52 | 62519.342 | > 3046.819 > Good that we've confirmed that. > SEMI JOIN > ============================== > > Unfortunately, in TPC-DS it is not possible to clearly see improvements > for semi joins. To address this, I designed a synthetic example where > the data distribution forces the loop to run fully, without exiting > early, which makes the effect on semi joins more visible. In this setup, > I also ensured that the length of the MCV array is equal to the chosen > default_statistics_target. > > CREATE TABLE t1 AS > SELECT CASE > WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1 > ELSE (g % 1000000) + 10000 > END AS id > FROM generate_series(1, 3000000) g; > > CREATE TABLE t2 AS > SELECT CASE > WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001 > ELSE (g % 1000000) + 20000 > END AS id > FROM generate_series(1, 3000000) g; > > ANALYZE t1, t2; > > The results of the query are: > > SELECT * FROM t1 > WHERE id IN (SELECT id FROM t2); > > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > ------------------------------------------------------------------------------------------ > 1 | 1.12 | 1.191 | > 1.062 > 5 | 1.02 | 0.493 | > 0.481 > 10 | 0.92 | 0.431 | > 0.471 > 25 | 1.27 | 0.393 | > 0.309 > 50 | 1.04 | 0.432 | > 0.416 > 75 | 0.96 | 0.398 | > 0.415 > 100 | 0.95 | 0.450 | > 0.473 > 1000 | 9.42 | 6.742 | > 0.716 > 2500 | 19.15 | 21.621 | > 1.129 > 5000 | 46.74 | 85.667 | > 1.833 > 7500 | 73.26 | 194.806 | > 2.659 > 10000 | 107.95 | 349.981 | > 3.242 > That's some decent speedups, considering that it's planning time. Thanks for testing the code! -- David Geier
В списке pgsql-hackers по дате отправления: