Обсуждение: Support loser tree for k-way merge

Поиск
Список
Период
Сортировка

Support loser tree for k-way merge

От
"cca5507"
Дата:
Hi hackers,

Background
==========
Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
'loser tree' can significantly speed up the k-way merge.

Test
====
With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case:

SET max_parallel_workers_per_gather = 0;
CREATE UNLOGGED TABLE t AS SELECT generate_series(1, 20000000) AS a, md5(random()::text) AS b;
create extension if not exists pg_prewarm;
select pg_prewarm('t');

SET enable_loser_tree = OFF;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;

SET enable_loser_tree = ON;
# SET work_mem = '4MB'; ('8MB' '16MB' '32MB' '64MB' ...)
explain analyze select * from t order by b;

Open questions
==============
1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should
decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple
comparators or just always use the 'loser tree'?

Looking forward to your reply and comment.

--
Regards,
ChangAo Chen

Вложения

Re: Support loser tree for k-way merge

От
Heikki Linnakangas
Дата:
On 03/12/2025 13:48, cca5507 wrote:
> With the WIP patch(v1-0001), I got a 3% ~ 13%(different work_mem) speed up in the following test case:

Nice speedup!

> 1) Now I add a GUC 'enable_loser_tree' to control the use of loser tree, maybe we should
> decide whether to use the 'loser tree' based on the value of 'k', the complexity of tuple
> comparators or just always use the 'loser tree'?

What is the worst case scenario for the loser tree, where the heap is 
faster? How big is the difference?

- Heikki




Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
Hi Heikki,

> What is the worst case scenario for the loser tree, where the heap is 
> faster? How big is the difference?

In tuplesort_heap_replace_top(), it has 2 comparisons each level, but it can early return
if the parent less than both child.

In tuplesort_loser_tree_adjust(), it has 1 comparison each level, but it can't early return.

So on specific data, the heap may be better than the loser tree. But I think the possibility
is very small.

--
Regards,
ChangAo Chen

Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
> So on specific data, the heap may be better than the loser tree. But I think the possibility
> is very small.

For example, a column contains all the same values, so the heap can always early return
when replacing top.

--
Regards,
ChangAo Chen

Re: Support loser tree for k-way merge

От
Sami Imseih
Дата:
Hi,

Thanks for raising this.

> Now we use 'heap' during the k-way merge, it's O(n log k). The 'loser tree' is also O(n log k), but
> it's usually has fewer comparisons than the 'heap'. When the tuple comparator is complex, the
> 'loser tree' can significantly speed up the k-way merge.

I was playing around with this and I do also see the performance
benefits that you
mention with loser tree.

```
DROP TABLE IF EXISTS t;
CREATE TABLE t (
    col1 INT,
    col2 INT
);


INSERT INTO t (col1, col2)
SELECT
    (i % 10000000),
    (i % 100)
FROM generate_series(1, 10000000) AS s(i);
```

Using something like the above, testing an "external merge" on the
high cardinality
column there is a bit of an improvement with loser tree, but there is
a bit of a loss on
low-cardinality.

In general though, I see a bit of an issue with allowing a GUC to
change strategies. What happens if there are multiple "external merge"
nodes and they can each benefit from different strategies? Maybe that's
not an issue since all the other enable_ GUCs have that problem, but
it is something to consider.

Can we drive the decision for what to do based on optimizer
stats, i.e. n_distinct and row counts? Not sure what the calculation would
be specifically, but something else to consider.

We can still provide the GUC to  override the optimizer decisions,
but at least the optimizer, given up-to-date stats, may get it right most
of the time.

--
Sami Imseih
Amazon Web Services (AWS)



Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
Hi,

Thank you for your reply.

> Can we drive the decision for what to do based on optimizer
> stats, i.e. n_distinct and row counts? Not sure what the calculation would
> be specifically, but something else to consider.
> 
> We can still provide the GUC to  override the optimizer decisions,
> but at least the optimizer, given up-to-date stats, may get it right most
> of the time.

That makes sense to me.

TODO
====
1) Consider optimizer statistics when deciding whether to use the heap or the loser tree.
2) Do we need a USEMEM() call to the array of losers?
3) Now the array length of losers is MAXORDER * 2, and in fact MAXORDER is enough, need some refactor of the code. (Is
itworth doing?)
 
4) Add more code comments and doc.

Help are welcome!

--
Regards,
ChangAo Chen

Re: Support loser tree for k-way merge

От
John Naylor
Дата:
On Thu, Dec 4, 2025 at 1:14 AM Sami Imseih <samimseih@gmail.com> wrote:
> Can we drive the decision for what to do based on optimizer
> stats, i.e. n_distinct and row counts? Not sure what the calculation would
> be specifically, but something else to consider.

It's happened multiple times before that someone proposes a change
that makes sorting faster on some inputs, but turns out to regress on
low cardinality (I've done it myself). It seems to be pretty hard not
to regress that case. Occasionally the author proposes to take
optimizer stats into account, and that was rejected because
cardinality stats are often wildly wrong.

Further, underestimation is far more common than overestimation, in
which case IIUC the planner would just continue to choose the existing
heap method.

> We can still provide the GUC to  override the optimizer decisions,
> but at least the optimizer, given up-to-date stats, may get it right most
> of the time.

I don't have much faith that people will properly set a GUC whose
effects depends on the input characteristics and memory settings.

The new method might be a better overall trade-off, but we'd need some
more comprehensive measurements to know what we're dealing with.

--
John Naylor
Amazon Web Services



Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
Hi,

I summarized the number of comparisons needed for different 'k':

k = 2, heap: 1, loser tree: 1
k = 3, heap: 2, loser tree: [1, 2]
k = 4, heap: [2, 3], loser tree: 2
k = 5, heap: [2, 4], loser tree: [2, 3]

So if k < 5, the loser tree is never worse than the heap for any input data. 

Thoughts?

--
Regards,
ChangAo Chen

Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
Hi,

Can we support loser tree first and set the default value of enable_loser_tree to off?

Anyway, I attach 2 patches. (pass check-world)

v2-0001: support tuplesort_heap_build() which can build a valid heap in O(n), better
than currently O(n log n). This also make beginmerge() more clear and simple when
supporting loser tree.

v2-0002: support the loser tree.

--
Regards,
ChangAo Chen

Вложения

Re: Support loser tree for k-way merge

От
John Naylor
Дата:
On Thu, Dec 4, 2025 at 1:11 PM cca5507 <cca5507@qq.com> wrote:
> I summarized the number of comparisons needed for different 'k':
>
> k = 2, heap: 1, loser tree: 1
> k = 3, heap: 2, loser tree: [1, 2]
> k = 4, heap: [2, 3], loser tree: 2
> k = 5, heap: [2, 4], loser tree: [2, 3]
>
> So if k < 5, the loser tree is never worse than the heap for any input data.

Please explain your notation. For starters, does "comparison" refer to
sortkey comparison or does it include checking for sentinel? If loser
tree can't early return, why is the number not always a constant?

If "k" is very small, I'm guessing the merge step is small compared to
sorting the individual runs, in which case it matters less which one
to use. That's just a guess, though -- we need structured testing.

On Fri, Dec 5, 2025 at 11:11 PM cca5507 <cca5507@qq.com> wrote:
> Can we support loser tree first and set the default value of enable_loser_tree to off?

If you, the patch author, cannot demonstrate how to choose this
setting, what makes you think our users can? (That said, a temporary
GUC is useful for testing)

Here's a half-baked idea: If the regressions are mostly in
low-cardinality inputs, is it possible to add a fast path that just
checks if the current key is the same as the last one?

--
John Naylor
Amazon Web Services



Re: Support loser tree for k-way merge

От
"cca5507"
Дата:
> I summarized the number of comparisons needed for different 'k':
>
> k = 2, heap: 1, loser tree: 1
> k = 3, heap: 2, loser tree: [1, 2]
> k = 4, heap: [2, 3], loser tree: 2
> k = 5, heap: [2, 4], loser tree: [2, 3]
>
> So if k < 5, the loser tree is never worse than the heap for any input data.

> Please explain your notation. For starters, does "comparison" refer to
> sortkey comparison or does it include checking for sentinel?

The "k" is "k-way merge", and the "comparisons" are the number of tuple comparisons
during one heap's sift down or loser tree's adjustment.

> If loser tree can't early return, why is the number not always a constant?

Because the leaf nodes are in the bottom two levels of the loser tree, and different levels
have different number of comparisons.

> If "k" is very small, I'm guessing the merge step is small compared to
> sorting the individual runs, in which case it matters less which one
> to use. That's just a guess, though -- we need structured testing.

Yeah, the loser tree at most reduce one tuple comparison in each adjustment if k < 5.

> If you, the patch author, cannot demonstrate how to choose this
> setting, what makes you think our users can? (That said, a temporary
> GUC is useful for testing)

Yeah, users may have no ideas about the characteristics of their data, so it's hard for them
to choose the setting.

> Here's a half-baked idea: If the regressions are mostly in
> low-cardinality inputs, is it possible to add a fast path that just
> checks if the current key is the same as the last one?

For heap, it reduces one tuple comparison if the keys are same and increase one if not.
For loser tree, it reduces many tuple comparisons (maybe tree's height - 1?) if the keys
are same and increase one if not. The bad case is all keys are different, so we still need
to decide when to use the fast path, it's hard I think.

--
Regards,
ChangAo Chen