Обсуждение: Support loser tree for k-way merge
Hi hackers,
Background
==========
Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
'loser tree' can significantly speed up the k-way merge.
Test
====
With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case:
SET max_parallel_workers_per_gather = 0;
CREATE UNLOGGED TABLE t AS SELECT generate_series(1, 20000000) AS a, md5(random()::text) AS b;
create extension if not exists pg_prewarm;
select pg_prewarm('t');
SET enable_loser_tree = OFF;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;
SET enable_loser_tree = ON;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;
Open questions
==============
1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should
decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple
comparators or just always use the 'loser tree'?
Looking forward to your reply and comment.
--
Regards,
ChangAo Chen
Вложения
On 03/12/2025 13:48, cca5507 wrote: > With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case: Nice speedup! > 1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should > decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple > comparators or just always use the 'loser tree'? What is the worst case scenario for the loser tree, where the heap is faster? How big is the difference? - Heikki
Hi Heikki, > What is the worst case scenario for the loser tree, where the heap is > faster? How big is the difference? In tuplesort_heap_replace_top(), it has 2 comparisons each level, but it can early return if the parent less than both child. In tuplesort_loser_tree_adjust(), it has 1 comparison each level, but it can't early return. So on specific data, the heap may be better than the loser tree. But I think the possibility is very small. -- Regards, ChangAo Chen
> So on specific data, the heap may be better than the loser tree. But I think the possibility > is very small. For example, a column contains all the same values, so the heap can always early return when replacing top. -- Regards, ChangAo Chen
Hi,
Thanks for raising this.
> Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
> it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
> 'loser tree' can significantly speed up the k-way merge.
I was playing around with this and I do also see the performance
benefits that you
mention with loser tree.
```
DROP TABLE IF EXISTS t;
CREATE TABLE t (
col1 INT,
col2 INT
);
INSERT INTO t (col1, col2)
SELECT
(i % 10000000),
(i % 100)
FROM generate_series(1, 10000000) AS s(i);
```
Using something like the above, testing an "external merge" on the
high cardinality
column there is a bit of an improvement with loser tree, but there is
a bit of a loss on
low-cardinality.
In general though, I see a bit of an issue with allowing a GUC to
change strategies. What happens if there are multiple "external merge"
nodes and they can each benefit from different strategies? Maybe that's
not an issue since all the other enable_ GUCs have that problem, but
it is something to consider.
Can we drive the decision for what to do based on optimizer
stats, i.e. n_distinct and row counts? Not sure what the calculation would
be specifically, but something else to consider.
We can still provide the GUC to override the optimizer decisions,
but at least the optimizer, given up-to-date stats, may get it right most
of the time.
--
Sami Imseih
Amazon Web Services (AWS)
Hi, Thank you for your reply. > Can we drive the decision for what to do based on optimizer > stats, i.e. n_distinct and row counts? Not sure what the calculation would > be specifically, but something else to consider. > > We can still provide the GUC to override the optimizer decisions, > but at least the optimizer, given up-to-date stats, may get it right most > of the time. That makes sense to me. TODO ==== 1) Consider optimizer statistics when deciding whether to use the heap or the loser tree. 2) Do we need a USEMEM() call to the array of losers? 3) Now the array length of losers is MAXORDER * 2, and in fact MAXORDER is enough, need some refactor of the code. (Is itworth doing?) 4) Add more code comments and doc. Help are welcome! -- Regards, ChangAo Chen
On Thu, Dec 4, 2025 at 1:14 AM Sami Imseih <samimseih@gmail.com> wrote: > Can we drive the decision for what to do based on optimizer > stats, i.e. n_distinct and row counts? Not sure what the calculation would > be specifically, but something else to consider. It's happened multiple times before that someone proposes a change that makes sorting faster on some inputs, but turns out to regress on low cardinality (I've done it myself). It seems to be pretty hard not to regress that case. Occasionally the author proposes to take optimizer stats into account, and that was rejected because cardinality stats are often wildly wrong. Further, underestimation is far more common than overestimation, in which case IIUC the planner would just continue to choose the existing heap method. > We can still provide the GUC to override the optimizer decisions, > but at least the optimizer, given up-to-date stats, may get it right most > of the time. I don't have much faith that people will properly set a GUC whose effects depends on the input characteristics and memory settings. The new method might be a better overall trade-off, but we'd need some more comprehensive measurements to know what we're dealing with. -- John Naylor Amazon Web Services
Hi, I summarized the number of comparisons needed for different 'k': k = 2, heap: 1, loser tree: 1 k = 3, heap: 2, loser tree: [1, 2] k = 4, heap: [2, 3], loser tree: 2 k = 5, heap: [2, 4], loser tree: [2, 3] So if k < 5, the loser tree is never worse than the heap for any input data. Thoughts? -- Regards, ChangAo Chen
Hi, Can we support loser tree first and set the default value of enable_loser_tree to off? Anyway, I attach 2 patches. (pass check-world) v2-0001: support tuplesort_heap_build() which can build a valid heap in O(n), better than currently O(n log n). This also make beginmerge() more clear and simple when supporting loser tree. v2-0002: support the loser tree. -- Regards, ChangAo Chen
Вложения
On Thu, Dec 4, 2025 at 1:11 PM cca5507 <cca5507@qq.com> wrote: > I summarized the number of comparisons needed for different 'k': > > k = 2, heap: 1, loser tree: 1 > k = 3, heap: 2, loser tree: [1, 2] > k = 4, heap: [2, 3], loser tree: 2 > k = 5, heap: [2, 4], loser tree: [2, 3] > > So if k < 5, the loser tree is never worse than the heap for any input data. Please explain your notation. For starters, does "comparison" refer to sortkey comparison or does it include checking for sentinel? If loser tree can't early return, why is the number not always a constant? If "k" is very small, I'm guessing the merge step is small compared to sorting the individual runs, in which case it matters less which one to use. That's just a guess, though -- we need structured testing. On Fri, Dec 5, 2025 at 11:11 PM cca5507 <cca5507@qq.com> wrote: > Can we support loser tree first and set the default value of enable_loser_tree to off? If you, the patch author, cannot demonstrate how to choose this setting, what makes you think our users can? (That said, a temporary GUC is useful for testing) Here's a half-baked idea: If the regressions are mostly in low-cardinality inputs, is it possible to add a fast path that just checks if the current key is the same as the last one? -- John Naylor Amazon Web Services
> I summarized the number of comparisons needed for different 'k': > > k = 2, heap: 1, loser tree: 1 > k = 3, heap: 2, loser tree: [1, 2] > k = 4, heap: [2, 3], loser tree: 2 > k = 5, heap: [2, 4], loser tree: [2, 3] > > So if k < 5, the loser tree is never worse than the heap for any input data. > Please explain your notation. For starters, does "comparison" refer to > sortkey comparison or does it include checking for sentinel? The "k" is "k-way merge", and the "comparisons" are the number of tuple comparisons during one heap's sift down or loser tree's adjustment. > If loser tree can't early return, why is the number not always a constant? Because the leaf nodes are in the bottom two levels of the loser tree, and different levels have different number of comparisons. > If "k" is very small, I'm guessing the merge step is small compared to > sorting the individual runs, in which case it matters less which one > to use. That's just a guess, though -- we need structured testing. Yeah, the loser tree at most reduce one tuple comparison in each adjustment if k < 5. > If you, the patch author, cannot demonstrate how to choose this > setting, what makes you think our users can? (That said, a temporary > GUC is useful for testing) Yeah, users may have no ideas about the characteristics of their data, so it's hard for them to choose the setting. > Here's a half-baked idea: If the regressions are mostly in > low-cardinality inputs, is it possible to add a fast path that just > checks if the current key is the same as the last one? For heap, it reduces one tuple comparison if the keys are same and increase one if not. For loser tree, it reduces many tuple comparisons (maybe tree's height - 1?) if the keys are same and increase one if not. The bad case is all keys are different, so we still need to decide when to use the fast path, it's hard I think. -- Regards, ChangAo Chen