Обсуждение: Use merge-based matching for MCVs in eqjoinsel
Hi hackers,
While analyzing planner performance on JOB with default_statistics_target = 1000, I noticed that a significant portion of planning time is spent inside the eqjoinsel() function. According to perf, in most JOB queries at default_statistics_target = 1000, eqjoinsel() is the most expensive function during planning, accounting for approximately 8% of total CPU time. At default_statistics_target = 10000, the planner spend up to 75% of its time inside eqjoinsel(), making it one of the primary bottlenecks.
Еhis overhead is caused by the O(N^2) nested-loop comparison of MCVs in var1 = var2 clauses.
I propose an optimization: when the column datatype supports ordering(i.e., has < and >), we can sort both MCV lists and apply mege-style algorithm to detect matches. This reduces runtime from O(N^2) to O(NlogN), where N is the number of MCV entries. The patch also applies the same optimization to semi-join clauses, which show similar performance behavior.
On JOB, this changes reduce planner time in most queries with complex joins and large MCVs with no observable effect on plan quality. I’ve also attached bar charts showing per-query planner time before and after the patch for default_statistics_target = 100, 1000, 10000 along with query numbers for reference.
Any feedback or suggestions are welcome!
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
Вложения
While analyzing planner performance on JOB with default_statistics_target = 1000, I noticed that a significant portion of planning time is spent inside the eqjoinsel() function. According to perf, in most JOB queries at default_statistics_target = 1000, eqjoinsel() is the most expensive function during planning, accounting for approximately 8% of total CPU time. At default_statistics_target = 10000, the planner spend up to 75% of its time inside eqjoinsel(), making it one of the primary bottlenecks.
This overhead is caused by the O(N^2) nested-loop comparison of MCVs in var1 = var2 clauses.
I propose an optimization: when the column datatype supports ordering(i.e., has < and >), we can sort both MCV lists and apply mege-style algorithm to detect matches. This reduces runtime from O(N^2) to O(NlogN), where N is the number of MCV entries. The patch also applies the same optimization to semi-join clauses, which show similar performance behavior.
Following up on my previous message about optimizing eqjoinsel() for Var1 = Var2 and semijoin clauses, I’d like to share more detailed performance results across different values of default_statistics_target on the JOB benchmark.
The performance improvement grows as the number of MCV entries increases (i.e., with higher default_statistics_target). The table below shows total planner time summed over all 113 queries in JOB for each setting of default_statistics_target, before and after applying patch:
Total planner time across all JOB queries
=========================================
default_statistics_target | Before Patch (ms) | After Patch (ms)
--------------------------+-------------------+------------------
100 | 1828.433 | 1820.556
1000 | 2194.282 | 1963.110
2500 | 4606.705 | 2140.126
5000 | 16661.581 | 2616.109
7500 | 35988.569 | 3061.161
10000 | 66616.620 | 3504.144
For default_statistics_target < 1000, the planning time remains the same before and after the patch. The optimization reduces planner time substantially - by up to 13X at default_statistics_target = 10000 - while the generated plans and selectivity calculations remain unchanged. To illustrate this, the table below shows the 10 slowest JOB queries (by planning time), along with their planning times before and after applying the patch.
Top 10 slowest queries at default_statistics_target = 10000
===========================================================
Query | Before Patch (ms) | After Patch (ms)
------+--------------------+-------------------
29c | 1939.282 | 111.219
29b | 1939.237 | 100.109
29a | 1931.870 | 100.224
31c | 1622.255 | 67.609
30c | 1602.156 | 70.942
28b | 1521.026 | 84.058
30b | 1519.972 | 68.777
30a | 1518.014 | 70.529
28a | 1514.908 | 86.662
31a | 1507.303 | 63.579
As shown, the total planner time for these top 10 queries drops substantially with the optimization.
I’ve identified and fixed two issues in the original v1 patch: In 'eqjoinsel_semi' the second MCV array was allocated with an incorrect size. And the initialization of FunctionCallInfoData was moved outside the comparator compare_mcv_items() to avoid repeated setup during sorting. I've attached the updated v2 patch with changes.
Any suggestions?
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
Вложения
Following up on my previous messages about optimizing eqjoinsel() and eqjoinsel_semi() for Var1 = Var2 clauses, I’d like to share detailed profiling results showing the effect of the patch on JOB for different values of default_statistics_target.
The first table shows the total planner time (summed over all 113 queries) before and after applying the patch, along with the speedup achieved:
default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
--------------------------+---------------------+---------------------+--------------------
100 | 1.00x | 1828.433 | 1820.556
1000 | 1.12x | 2194.282 | 1963.110
2500 | 2.15x | 4606.705 | 2140.126
5000 | 6.37x | 16661.581 | 2616.109
7500 | 11.76x | 35988.569 | 3061.161
10000 | 19.01x | 66616.620 | 3504.144
The second table shows the profiling of eqjoinsel() using perf, demonstrating that the function, which dominates planning at high statistics targets, becomes essentially negligible after the patch:
default_statistics_target | eqjoinsel() Before (perf) | eqjoinsel() After (perf)
--------------------------+---------------------------+--------------------------
100 | 0.01% | 0.04%
1000 | 6.23% | 0.06%
2500 | 35.45% | 0.23%
5000 | 66.14% | 0.53%
7500 | 72.70% | 0.97%
10000 | 75.42% | 1.25%
I’ve attached v3 of the patch. This version adds a check for NULL values when comparing MCV entries, ensuring correctness in edge cases.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com
Вложения
Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> writes: > I’ve attached v3 of the patch. This version adds a check for NULL values > when comparing MCV entries, ensuring correctness in edge cases. Um ... what edge cases would those be? We do not put NULL into MCV arrays. regards, tom lane
On 03.09.2025 23:26, Tom Lane wrote: > Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> writes: >> I’ve attached v3 of the patch. This version adds a check for NULL values >> when comparing MCV entries, ensuring correctness in edge cases. > Um ... what edge cases would those be? We do not put NULL into > MCV arrays. You're right - MCV arrays never contain NULLs. However, comparing two MCV values could theoretically return NULL, even though this is very unlikely. This check existed even before my changes, and similar checks are used in other selectivity-estimation functions in 'selfuncs.c'. ... fcinfo->isnull = false; fresult = FunctionCallInvoke(fcinfo); if (!fcinfo->isnull && DatumGetBool(fresult)) ... By "edge cases" I was referring to this situation; I probably did not choose the best wording. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
Hi Ilia! On 29.07.2025 16:07, Ilia Evdokimov wrote: > > On 21.07.2025 16:55, Ilia Evdokimov wrote: >> >> While analyzing planner performance on JOB with >> default_statistics_target = 1000, I noticed that a significant portion >> of planning time is spent inside the eqjoinsel() function. According >> to perf, in most JOB queries at default_statistics_target = 1000, >> eqjoinsel() is the most expensive function during planning, accounting >> for approximately 8% of total CPU time. At default_statistics_target = >> 10000, the planner spend up to 75% of its time inside eqjoinsel(), >> making it one of the primary bottlenecks. >> >> This overhead is caused by the O(N^2) nested-loop comparison of MCVs >> in var1 = var2 clauses. Thanks for working on this. I've wanted to submit a patch for the very same issue for a while. I've come across this issue multiple times in the field. >> I propose an optimization: when the column datatype supports >> ordering(i.e., has < and >), we can sort both MCV lists and apply >> mege-style algorithm to detect matches. This reduces runtime from >> O(N^2) to O(NlogN), where N is the number of MCV entries. The patch >> also applies the same optimization to semi-join clauses, which show >> similar performance behavior. Why do you sort both lists and then merge instead of putting the smaller list into a hash map and then doing hash lookups (if the type is hashable)? There are more problems like this in the planner. For example col IN (many values) is also quadratic because for every value in the IN list all MCVs are checked. It would be great to fix this as well. -- David Geier
Hi! On 03.09.2025 18:53, Ilia Evdokimov wrote: > Following up on my previous messages about optimizing eqjoinsel() and > eqjoinsel_semi() for Var1 = Var2 clauses, I’d like to share detailed > profiling results showing the effect of the patch on JOB for different > values of default_statistics_target. > > The first table shows the total planner time (summed over all 113 > queries) before and after applying the patch, along with the speedup > achieved: > > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > --------------------------+---------------------+--------------------- > +-------------------- > 100 | *1.00x* | 1828.433 | 1820.556 > 1000 | *1.12x* | 2194.282 | 1963.110 > 2500 | *2.15x* | 4606.705 | 2140.126 > 5000 | *6.37x* | 16661.581 | 2616.109 > 7500 | *11.76x* | 35988.569 | > 3061.161 > 10000 | *19.01x* | 66616.620 | > 3504.144 > It looks to me like these results are with optimizations disabled? Can you share the SQL script you used for testing? -- David Geier
Hi Ilia! On 05.09.2025 16:03, David Geier wrote: >>> I propose an optimization: when the column datatype supports >>> ordering(i.e., has < and >), we can sort both MCV lists and apply >>> mege-style algorithm to detect matches. This reduces runtime from >>> O(N^2) to O(NlogN), where N is the number of MCV entries. The patch >>> also applies the same optimization to semi-join clauses, which show >>> similar performance behavior. > > Why do you sort both lists and then merge instead of putting the smaller > list into a hash map and then doing hash lookups (if the type is hashable)? I've tested your and my code with the following script: CREATE TABLE foo(col TEXT); CREATE TABLE bar(col TEXT); SET default_statistics_target = 10000; -- Generate MCV values. PostgreSQL doesn't store MCVs if the table has -- only a single value or every value has exactly the same cardinality. DO $$ BEGIN FOR i IN 1..10000 LOOP FOR j IN 1..least(i, 50) LOOP INSERT INTO foo VALUES ('aaaaaaaaaaaaaaaaaaaa' || i::TEXT); INSERT INTO bar VALUES ('aaaaaaaaaaaaaaaaaaaa' || (i + 100000)::TEXT); END LOOP; END LOOP; END; $$; ANALYZE foo, bar; \timing on EXPLAIN SELECT * FROM foo f, bar b WHERE f.col = b.col; Results are: - master: 433 ms - Order+Merge: 11 ms - Hash map: 4 ms I've attached my draft patch. -- David Geier
Вложения
On 08.09.2025 13:08, David Geier wrote: > Hi Ilia! > > On 05.09.2025 16:03, David Geier wrote: >>>> I propose an optimization: when the column datatype supports >>>> ordering(i.e., has < and >), we can sort both MCV lists and apply >>>> mege-style algorithm to detect matches. This reduces runtime from >>>> O(N^2) to O(NlogN), where N is the number of MCV entries. The patch >>>> also applies the same optimization to semi-join clauses, which show >>>> similar performance behavior. >> Why do you sort both lists and then merge instead of putting the smaller >> list into a hash map and then doing hash lookups (if the type is hashable)? > I've tested your and my code with the following script: > > CREATE TABLE foo(col TEXT); > CREATE TABLE bar(col TEXT); > SET default_statistics_target = 10000; > > -- Generate MCV values. PostgreSQL doesn't store MCVs if the table has > -- only a single value or every value has exactly the same cardinality. > DO $$ > BEGIN > FOR i IN 1..10000 LOOP > FOR j IN 1..least(i, 50) LOOP > INSERT INTO foo VALUES ('aaaaaaaaaaaaaaaaaaaa' || i::TEXT); > INSERT INTO bar VALUES ('aaaaaaaaaaaaaaaaaaaa' || (i + 100000)::TEXT); > END LOOP; > END LOOP; > END; > $$; > > ANALYZE foo, bar; > \timing on > EXPLAIN SELECT * FROM foo f, bar b WHERE f.col = b.col; > > Results are: > > - master: 433 ms > - Order+Merge: 11 ms > - Hash map: 4 ms > > I've attached my draft patch. > > -- > David Geier Hi David, Thank you for reviewing. I have read all the previous messages - and yes, you are right. I don’t know why I didn’t consider using a hash table approach initially. Your idea makes a lot of sense. To evaluate it, I ran benchmarks on JOB with three variants: $ ./benchmark.sh master $ ./benchmark.sh merge $ ./benchmark.sh hash I compared total planning time across all 113 queries. $ python3 planning_time.py master hash default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) -------------------------------------------------------------------------------- 100 | 1.00 | 1892.627 | 1886.969 1000 | 1.09 | 2286.922 | 2100.099 2500 | 1.94 | 4647.167 | 2400.711 5000 | 6.15 | 17964.779 | 2919.914 7500 | 10.58 | 38622.443 | 3650.375 10000 | 16.33 | 69538.085 | 4257.864 $ python3 planning_time.py master merge default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) -------------------------------------------------------------------------------- 100 | 1.00 | 1892.627 | 1898.622 1000 | 1.12 | 2286.922 | 2033.553 2500 | 1.92 | 4647.167 | 2423.552 5000 | 5.94 | 17964.779 | 3025.739 7500 | 10.48 | 38622.443 | 3684.262 10000 | 16.72 | 69538.085 | 4159.418 Based on these results, I’d prefer the hash lookup implementation, so I think it makes sense to improve your patch further and bring it into good shape. Shall I take care of that, or would you prefer to do it yourself? -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
Вложения
On 08.09.2025 13:35, Ilia Evdokimov wrote: > Based on these results, I’d prefer the hash lookup implementation, so > I think it makes sense to improve your patch further and bring it into > good shape. Shall I take care of that, or would you prefer to do it > yourself? I realized I mistakenly copied the wrong results for the hash-map version in my previous draft. Sorry about that. Here are the correct benchmark results: Merge default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) -------------------------------------------------------------------------------- 100 | 1.00 | 1892.627 | 1898.622 1000 | 1.12 | 2286.922 | 2033.553 2500 | 1.92 | 4647.167 | 2423.552 5000 | 5.94 | 17964.779 | 3025.739 7500 | 10.48 | 38622.443 | 3684.262 10000 | 16.72 | 69538.085 | 4159.418 Hash-Map default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) -------------------------------------------------------------------------------- 100 | 1.00 | 1892.627 | 1886.969 1000 | 1.09 | 2286.922 | 2100.099 2500 | 1.94 | 4647.167 | 2400.711 5000 | 6.15 | 17964.779 | 2919.914 7500 | 10.58 | 38622.443 | 3650.375 10000 | 16.33 | 69538.085 | 4257.864 -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
Hi Ilia! > I have read all the previous messages - and yes, you are right. I don’t > know why I didn’t consider using a hash table approach initially. Your > idea makes a lot of sense. Your solution would be beneficial on top, for cases where the data type is not hashable. But I think that's overkill for a v1. I would start with the hash-based version. > To evaluate it, I ran benchmarks on JOB with three variants: > > $ ./benchmark.sh master > $ ./benchmark.sh merge > $ ./benchmark.sh hash > > I compared total planning time across all 113 queries. Was this running with optimizations? How did you extract the planning time? > > $ python3 planning_time.py master hash > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | 1.00 | 1892.627 | > 1886.969 > 1000 | 1.09 | 2286.922 | > 2100.099 > 2500 | 1.94 | 4647.167 | > 2400.711 > 5000 | 6.15 | 17964.779 | > 2919.914 > 7500 | 10.58 | 38622.443 | > 3650.375 > 10000 | 16.33 | 69538.085 | > 4257.864 > > $ python3 planning_time.py master merge > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | 1.00 | 1892.627 | > 1898.622 > 1000 | 1.12 | 2286.922 | > 2033.553 > 2500 | 1.92 | 4647.167 | > 2423.552 > 5000 | 5.94 | 17964.779 | > 3025.739 > 7500 | 10.48 | 38622.443 | > 3684.262 > 10000 | 16.72 | 69538.085 | > 4159.418 I would have expected the delta between the "merge" and "hash" variant to be bigger, especially for default_statistics_target=10000. My small test also showed that. Any idea why this is not showing in your results? > Based on these results, I’d prefer the hash lookup implementation, so I > think it makes sense to improve your patch further and bring it into > good shape. Shall I take care of that, or would you prefer to do it > yourself? I think process-wise it's best if you review my code and I do the changes. Could you as part of your review test tables with just a few MCVs to make sure we're not regressing "small" cases? For now I'm only bailing if one of the two MCV lists has just a single value. I'm expecting the gains from fine tuning this value to be not measurable but let's double check. -- David Geier
On 08.09.2025 12:45, Ilia Evdokimov wrote: > > I realized I mistakenly copied the wrong results for the hash-map > version in my previous draft. Sorry about that. Here are the correct > benchmark results: > > Merge > > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | 1.00 | 1892.627 | > 1898.622 > 1000 | 1.12 | 2286.922 | > 2033.553 > 2500 | 1.92 | 4647.167 | > 2423.552 > 5000 | 5.94 | 17964.779 | > 3025.739 > 7500 | 10.48 | 38622.443 | > 3684.262 > 10000 | 16.72 | 69538.085 | > 4159.418 > > > Hash-Map > > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | 1.00 | 1892.627 | > 1886.969 > 1000 | 1.09 | 2286.922 | > 2100.099 > 2500 | 1.94 | 4647.167 | > 2400.711 > 5000 | 6.15 | 17964.779 | > 2919.914 > 7500 | 10.58 | 38622.443 | > 3650.375 > 10000 | 16.33 | 69538.085 | > 4257.864 > It still seems to me like something is fishy with the numbers or something in the benchmark adds a lot over overhead so that small differences in eqjoinsel_inner() don't show here. The delta between "hash" and "merge" for default_statistics_target=10000 should be the biggest but it's actually slower. -- David Geier
To evaluate it, I ran benchmarks on JOB with three variants: $ ./benchmark.sh master $ ./benchmark.sh merge $ ./benchmark.sh hash I compared total planning time across all 113 queries.Was this running with optimizations? How did you extract the planning time?
I save all query plans using EXPLAIN SUMMARY, then go through all the plans, read the 'Planning Time' for each, and sum them up.
I would have expected the delta between the "merge" and "hash" variant to be bigger, especially for default_statistics_target=10000. My small test also showed that. Any idea why this is not showing in your results?
So would I. With default_statistics_target = 10000 and the selectivity in the JOB queries being close to zero, the difference should be noticeable. I can only explain the previous results by cache-related effects on my machine.
I reran the benchmark on a clean cluster and collected the top slowest JOB queries — now the effect is clearly visible.
Merge (sum of all JOB queries)
==================
default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
--------------------------------------------------------------------------------
100 | 1.00 | 1888.105 | 1879.431
1000 | 1.14 | 2282.239 | 2009.114
2500 | 2.10 | 5595.030 | 2668.530
5000 | 5.56 | 18544.933 | 3333.252
7500 | 9.17 | 37390.956 | 4076.390
10000 | 16.10 | 69319.479 | 4306.417
HashMap (sum of all JOB queries)
==================
default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
--------------------------------------------------------------------------------
100 | 1.03 | 1888.105 | 1828.088
1000 | 1.18 | 2282.239 | 1939.884
2500 | 2.64 | 5595.030 | 2117.872
5000 | 7.80 | 18544.933 | 2377.206
7500 | 13.80 | 37390.956 | 2709.973
10000 | 23.32 | 69319.479 | 2973.073
Top 10 slowest JOB queries (default_statistics_target = 10000)
Query | master (ms) | merge (ms) | Hash (ms)
------+-------------+------------+-----------
29c | 1904.586 | 144.135 | 100.473
29b | 1881.392 | 117.891 | 89.028
29a | 1868.805 | 112.242 | 83.913
31c | 1867.234 | 76.498 | 56.140
30c | 1646.630 | 88.494 | 62.549
30b | 1608.820 | 84.821 | 64.603
31a | 1573.964 | 75.978 | 56.140
28a | 1457.738 | 95.939 | 77.309
28b | 1455.052 | 99.383 | 73.065
30a | 1416.699 | 91.057 | 62.549
BTW, the hashmap from your patch could also be applied to eqjoinsel_semi() function.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com
Hi! On 08.09.2025 15:45, Ilia Evdokimov wrote: > I reran the benchmark on a clean cluster and collected the top slowest > JOB queries — now the effect is clearly visible. > > Merge (sum of all JOB queries) > ================== > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | *1.00* | 1888.105 > | 1879.431 > 1000 | *1.14* | 2282.239 > | 2009.114 > 2500 | *2.10* | 5595.030 > | 2668.530 > 5000 | *5.56* | 18544.933 > | 3333.252 > 7500 | *9.17* | 37390.956 > | 4076.390 > 10000 | *16.10* | 69319.479 > | 4306.417 > > HashMap (sum of all JOB queries) > ================== > default_statistics_target | Planner Speedup (×) | Planner Before (ms) | > Planner After (ms) > -------------------------------------------------------------------------------- > 100 | *1.03* | 1888.105 | > 1828.088 > 1000 | *1.18* | 2282.239 | > 1939.884 > 2500 | *2.64* | 5595.030 | > 2117.872 > 5000 | *7.80* | 18544.933 | > 2377.206 > 7500 | *13.80* | 37390.956 | > 2709.973 > 10000 | *23.32* | 69319.479 | > 2973.073 > > Top 10 slowest JOB queries (default_statistics_target = 10000) > Query | master (ms) | merge (ms) | Hash (ms) > ------+-------------+------------+----------- > 29c | 1904.586 | 144.135 | 100.473 > 29b | 1881.392 | 117.891 | 89.028 > 29a | 1868.805 | 112.242 | 83.913 > 31c | 1867.234 | 76.498 | 56.140 > 30c | 1646.630 | 88.494 | 62.549 > 30b | 1608.820 | 84.821 | 64.603 > 31a | 1573.964 | 75.978 | 56.140 > 28a | 1457.738 | 95.939 | 77.309 > 28b | 1455.052 | 99.383 | 73.065 > 30a | 1416.699 | 91.057 | 62.549 This looks much better. Very nice! > > BTW, the hashmap from your patch could also be applied to > eqjoinsel_semi() function. > Yep. The inner loop only runs until clamped_nvalues2 and it doesn't compute matchprodfreq. I'll try to modify the patch such that it accounts for these differences without being too hard to read. Do you think anything else needs changes in the patch? Did you have a chance to check tables with just few MCVs or are there any queries in the JOB which regress with very small default_statistics_target? -- David Geier
Hi David,
Do you think anything else needs changes in the patch?
From an architectural perspective, I think the patch is already in good shape. However, I have some suggestions regarding code style:
- I would move McvHashEntry, McvHashContext, he new hash table definition, hash_mcv and are_mcvs_equal to the top.
- I’m not sure get_hash_func_oid() is needed at all – it seems we could do without it.
- It would be better to name the parameters matchProductFrequencies -> matchprodfreq, nMatches -> nmatches, hasMatch1 -> hasmatch1, hasMatch2 -> hasmatch2 in eqjoinsel_inner_with_hashtable().
- As I wrote earlier, since we now have a dedicated function eqjoinsel_inner_with_hashtable(), perhaps it could be used in both eqjoinsel_inner() and eqjoinsel_semi(). And since the hash-based search was factored out, maybe it would make sense to also factor out the O(N^2) nested loop implementation?
- I think it would be helpful to add a comment explaining that using a hash table is not efficient when the MCV array length equals 1:
if (Min(statsSlot1->nvalues, statsSlot2->nvalues) == 1)
return false;
Did you have a chance to check tables with just few MCVs or are there any queries in the JOB which regress with very small default_statistics_target?
Sure. I need some time to check this.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com
Hi! Thanks for the review. On 09.09.2025 09:29, Ilia Evdokimov wrote: > From an architectural perspective, I think the patch is already in good > shape. However, I have some suggestions regarding code style: > > 1. I would move McvHashEntry, McvHashContext, he new hash table > definition, hash_mcv and are_mcvs_equal to the top. Done. I've also moved up try_eqjoinsel_with_hashtable(). > 2. I’m not sure get_hash_func_oid() is needed at all – it seems we > could do without it. Removed. > 3. It would be better to name the parameters matchProductFrequencies -> > matchprodfreq, nMatches -> nmatches, hasMatch1 -> hasmatch1, > hasMatch2 -> hasmatch2 in eqjoinsel_inner_with_hashtable(). Done. > 4. As I wrote earlier, since we now have a dedicated function > eqjoinsel_inner_with_hashtable(), perhaps it could be used in both > eqjoinsel_inner() and eqjoinsel_semi(). And since the hash-based Done. The gains for SEMI join are even bigger because the function is called 3 times for e.g. EXPLAIN SELECT * FROM foo f WHERE EXISTS (SELECT 1 FROM bar b where f.col = b.col); For that query the planning time for me goes from ~1300 ms -> 12 ms. > search was factored out, maybe it would make sense to also factor > out the O(N^2) nested loop implementation? Generally agreed and while tempting, I've refrained from doing the refactoring. Let's better do this in a separate patch to keep things simple. > 5. I think it would be helpful to add a comment explaining that using a > hash table is not efficient when the MCV array length equals 1: > > if (Min(statsSlot1->nvalues, statsSlot2->nvalues) == 1) > return false; Done. >> Did you have a >> chance to check tables with just few MCVs or are there any queries in >> the JOB which regress with very small default_statistics_target? > > > Sure. I need some time to check this. > Could you please do that with the latest attached patch so that we test it once more? Could you also run once more the JOB benchmark to get some test coverage on the SEMI join code (assuming it also uses SEMI joins)? Once we've concluded on above and there are no objections, I will register this patch in the commit fest. -- David Geier
Вложения
Hi! On 09.09.2025 12:22, David Geier wrote: > Could you please do that with the latest attached patch so that we test > it once more? LGTM. Yes, I'll test this patch. > > Could you also run once more the JOB benchmark to get some test coverage > on the SEMI join code (assuming it also uses SEMI joins)? Unfortunately, the JOB benchmark does not contain semi join nodes. However, TPC-DS does. I'll look for the queries with slowest planner times there and check them. I'll need some time to check both join and semi join cases with small and large default_statistics_target. I'll share the results later. > > Once we've concluded on above and there are no objections, I will > register this patch in the commit fest. Sure. No problem. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
Hi hackers, On 10.09.2025 16:56, Ilia Evdokimov wrote: > Unfortunately, the JOB benchmark does not contain semi join nodes. > However, TPC-DS does. I'll look for the queries with slowest planner > times there and check them. > > I'll need some time to check both join and semi join cases with small > and large default_statistics_target. I'll share the results later. JOIN ============================== I’ve benchmarked the new implementation of eqjoinsel() with different values of default_statistics_target. On small targets (1, 5, 10, 25, 50, 75, 100) the results are all within statistical noise, and I did not observe any regressions. In my view, it’s reasonable to keep the current condition that the hash table is not used for default_statistics_target = 1. Raising that threshold does not seem useful. Here are the results for JOB queries (where the effect of semi join is not visible due to different data distributions): default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) ------------------------------------------------------------------------------------------ 1 | 1.00 | 1846.643 | 1847.409 5 | 1.00 | 1836.391 | 1828.318 10 | 0.95 | 1841.750 | 1929.722 25 | 0.99 | 1873.172 | 1890.741 50 | 0.98 | 1869.897 | 1898.470 75 | 1.02 | 1969.368 | 1929.521 100 | 0.97 | 1857.890 | 1921.207 1000 | 1.14 | 2279.700 | 1997.102 2500 | 1.78 | 4682.658 | 2636.202 5000 | 6.45 | 15943.696 | 2471.242 7500 | 12.45 | 34350.855 | 2758.565 10000 | 20.52 | 62519.342 | 3046.819 SEMI JOIN ============================== Unfortunately, in TPC-DS it is not possible to clearly see improvements for semi joins. To address this, I designed a synthetic example where the data distribution forces the loop to run fully, without exiting early, which makes the effect on semi joins more visible. In this setup, I also ensured that the length of the MCV array is equal to the chosen default_statistics_target. CREATE TABLE t1 AS SELECT CASE WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1 ELSE (g % 1000000) + 10000 END AS id FROM generate_series(1, 3000000) g; CREATE TABLE t2 AS SELECT CASE WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001 ELSE (g % 1000000) + 20000 END AS id FROM generate_series(1, 3000000) g; ANALYZE t1, t2; The results of the query are: SELECT * FROM t1 WHERE id IN (SELECT id FROM t2); default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) ------------------------------------------------------------------------------------------ 1 | 1.12 | 1.191 | 1.062 5 | 1.02 | 0.493 | 0.481 10 | 0.92 | 0.431 | 0.471 25 | 1.27 | 0.393 | 0.309 50 | 1.04 | 0.432 | 0.416 75 | 0.96 | 0.398 | 0.415 100 | 0.95 | 0.450 | 0.473 1000 | 9.42 | 6.742 | 0.716 2500 | 19.15 | 21.621 | 1.129 5000 | 46.74 | 85.667 | 1.833 7500 | 73.26 | 194.806 | 2.659 10000 | 107.95 | 349.981 | 3.242 -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com
Hi David, In v2 patch, when the join is reversed we pass the commutator operator Oid to eqjoinsel_semi(), and inside that function we immediately call get_opcode(<commutator operator Oid>). Did you mean for the function to take an operator Oid instead of an here? If that was unintentional, perhaps the cleanest fix is to add a new 'operator' parameter to eqjoinsel_semi() so we can keep passing 'opfuncoid' as before and avoid changing the behavior. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com