Обсуждение: tuple radix sort

Поиск
Список
Период
Сортировка

tuple radix sort

От
John Naylor
Дата:
First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms

Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:

1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:

typedef struct
{
  void     *tuple;      /* the tuple itself */
  Datum    datum1;      /* value of first key column */

  union
  {
    struct
    {
      bool    isnull1;    /* is first key column NULL? */
      int     srctape;    /* source tape number */
    };
    Datum    cond_datum1; /* sort key for radix sort */
  };
} SortTuple;


4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1] (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.

Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:

drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint  -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;

select pg_prewarm('test');
set work_mem = '64MB';

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms


[1] https://github.com/skarupke/ska_sort/tree/master
[2] https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3] http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>
> I suspect the challenge
> will be multikey sorts when the first key has low cardinality.

As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves
that:

```
evantest=# drop table if exists test_multi;
evantest=# create unlogged table test_multi (category int, name text);
— first column has only 4 distinct values
evantest=# insert into test_multi select (random() * 4)::int as category, md5(random()::text) || md5(random()::text) as
namefrom generate_series(1, 1000000); 
evantest=# vacuum freeze test_multi;
evantest=# select count(*) from test_multi;
evantest=# set work_mem = '64MB’;

evantest-# \timing on
Timing is on.
evantest=# set wip_radix_sort = 'off';
Time: 0.403 ms
evantest=# \o /dev/null
evantest=# select * from test_multi order by category, name;
Time: 5607.336 ms (00:05.607)
evantest=# select * from test_multi order by category, name;
Time: 5703.555 ms (00:05.704)
evantest=# select * from test_multi order by category, name;
Time: 5692.644 ms (00:05.693)

evantest=# set wip_radix_sort = 'on';
Time: 0.859 ms
evantest=# select * from test_multi order by category, name;
Time: 5822.979 ms (00:05.823)
evantest=# select * from test_multi order by category, name;
Time: 5881.256 ms (00:05.881)
evantest=# select * from test_multi order by category, name;
Time: 5976.351 ms (00:05.976)
```

Roughly 5% slower for this corner case.

However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

```
evantest=# \o
evantest=# drop table if exists test_multi;
DROP TABLE
evantest=# create unlogged table test_multi (category int, name text);
CREATE TABLE
evantest=# insert into test_multi
evantest-# select (random() * 1000000)::int as category,  md5(random()::text) || md5(random()::text) as name from
generate_series(1,1000000); 
INSERT 0 1000000
evantest=# vacuum freeze test_multi;
VACUUM
evantest=# select count(*) from test_multi;
  count
---------
 1000000
(1 row)

evantest=# select * from test_multi limit 5;
 category |                               name
----------+------------------------------------------------------------------
   607050 | c555126a5afea9f5ffe3880248c89944d211bc378f8c3b6d125b4360fe8619b7
   843579 | 69b5a1dba76f52ff238566a3f88315a7425116d5d271fb54701b6e49d4afd8ce
   106298 | a96e8674db219e12463ecdbb405b99c631767972e489093045c97976c17c6561
   621860 | 5e6739ea9f533f9cdb0b8db76e3d4ce63be6b2b612c8aff06c4b80451f8f2edc
   290110 | 56944320e5abd3a854fffdd185b969727e8d414448d440725a392cda4c6355c4
(5 rows)

evantest=# \timing on
Timing is on.

evantest=# \o /dev/null
evantest=# set wip_radix_sort = 'off';
Time: 0.904 ms
evantest=# select * from test_multi limit 5;
Time: 0.983 ms
evantest=# select * from test_multi order by category, name;
Time: 593.578 ms
evantest=# select * from test_multi order by category, name;
Time: 597.329 ms
evantest=# select * from test_multi order by category, name;
Time: 600.050 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.737 ms
evantest=# select * from test_multi order by category, name;
Time: 611.604 ms
evantest=# select * from test_multi order by category, name;
Time: 613.115 ms
evantest=# select * from test_multi order by category, name;
Time: 615.003 ms
```

This seems like a real regression.

Then I tried to only sort on the first column, yes, now radix is faster:

```
evantest=# set wip_radix_sort = 'off’;
evantest=# select * from test_multi order by category;
Time: 445.498 ms
evantest=# select * from test_multi order by category;
Time: 451.834 ms
evantest=# select * from test_multi order by category;
Time: 454.531 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.329 ms
evantest=# select * from test_multi order by category;
Time: 402.829 ms
evantest=# select * from test_multi order by category;
Time: 408.014 ms
evantest=# select * from test_multi order by category;
Time: 415.340 ms
evantest=# select * from test_multi order by category;
Time: 413.969 ms
```

Hope the test helps. (The test was run a MacBook M4. )

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
John Naylor
Дата:
On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:
> > On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
> >
> > I suspect the challenge
> > will be multikey sorts when the first key has low cardinality.
>
> As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that
provesthat: 
>
> ```
> evantest=# drop table if exists test_multi;
> evantest=# create unlogged table test_multi (category int, name text);
> — first column has only 4 distinct values

Thanks for testing. Note it's actually 5 because of rounding. Your
text also seems to have em-dashes and unicode apostrophes where it
should have dashes / single quotes. That's not great if you expect
others to try to reproduce. I'm also not thrilled about having to
remove your psql prompt.

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 4)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

Anyway, because this table is larger than my first example, the input
no longer fits into 64MB of work_mem and it switches to an external
merge sort. Normally I set work_mem to 1GB for testing sorts so I
don't have to think about it, but neglected to in my first email. I
don't know if that explains the disparity, but I've made that change
for my quick tests below.

> evantest=# \o /dev/null
> evantest=# select * from test_multi order by category, name;
[...]
> Roughly 5% slower for this corner case.

Seems fine for me on this old Intel laptop (min of 5 runs):

set wip_radix_sort = 'off';
2368.369

set wip_radix_sort = 'on';
2290.654

It's close enough that I'll want to test more closely at a range of
low-cardinality inputs. I haven't done any rigorous scripted testing
yet, so take this with a grain of salt.

> However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 1000000)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

> evantest=# set wip_radix_sort = 'off';
> Time: 0.904 ms

> evantest=# select * from test_multi order by category, name;
> Time: 593.578 ms
> evantest=# select * from test_multi order by category, name;
> Time: 597.329 ms
> evantest=# select * from test_multi order by category, name;
> Time: 600.050 ms
>
> evantest=# set wip_radix_sort = 'on';
> Time: 0.737 ms
> evantest=# select * from test_multi order by category, name;
> Time: 611.604 ms
> evantest=# select * from test_multi order by category, name;
> Time: 613.115 ms
> evantest=# select * from test_multi order by category, name;
> Time: 615.003 ms
> ```
>
> This seems like a real regression.

It's better for me here (min of 5 again), although the time scanning
the table probably dominates:

off:
536.257

on:
471.345

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 19:29, John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>>>
>>> I suspect the challenge
>>> will be multikey sorts when the first key has low cardinality.
>>
>> As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that
provesthat: 
>>
>> ```
>> evantest=# drop table if exists test_multi;
>> evantest=# create unlogged table test_multi (category int, name text);
>> — first column has only 4 distinct values
>
> Thanks for testing. Note it's actually 5 because of rounding.

Yes, 0-4, totally 5.

> Your
> text also seems to have em-dashes and unicode apostrophes where it
> should have dashes / single quotes. That's not great if you expect
> others to try to reproduce.

I just copied the content from psql (running in iTerm). I did a Google search, and found that was because of Mac Mail’s
“smartquotes” substitution. Looks like even I manually type in a pair of single quotes, it still does the substitution.
Iwill try to see how to disable that, but I don’t want to switch to another mail app. 

> I'm also not thrilled about having to
> remove your psql prompt.
>

I just wanted to show my entire test process, so I simply copied all contents from psql. In future, I will remove psql
promptsfrom reproduce procedure. 

> drop table if exists test_multi;
> create unlogged table test_multi (category int, name text);
> insert into test_multi select (random() * 4)::int as category,
> md5(random()::text) || md5(random()::text) as name from
> generate_series(1, 1000000);
> vacuum freeze test_multi;
>
> Anyway, because this table is larger than my first example, the input
> no longer fits into 64MB of work_mem and it switches to an external
> merge sort. Normally I set work_mem to 1GB for testing sorts so I
> don't have to think about it, but neglected to in my first email.

I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data:

```
evantest=# set work_mem = '1GB';
Time: 0.301 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 575.247 ms
evantest=# select * from test_multi order by category, name;
Time: 554.351 ms
evantest=# select * from test_multi order by category, name;
Time: 565.100 ms
evantest=#
evantest=# set wip_radix_sort = 'on';
Time: 0.752 ms
evantest=# select * from test_multi order by category, name;
Time: 558.057 ms
evantest=# select * from test_multi order by category, name;
Time: 565.542 ms
evantest=# select * from test_multi order by category, name;
Time: 559.973 ms
```

With radix_sort on and off, execution time are almost the same.

Then I restore the data to low cardinality, off is still faster than on:
```
evantest=# set wip_radix_sort = ‘off';
Time: 0.549 ms
evantest=# select * from test_multi order by category, name;
Time: 5509.075 ms (00:05.509)
evantest=# select * from test_multi order by category, name;
Time: 5553.566 ms (00:05.554)
evantest=# select * from test_multi order by category, name;
Time: 5598.595 ms (00:05.599)
evantest=# set wip_radix_sort = ‘on';
Time: 0.786 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5770.964 ms (00:05.771)
evantest=# select * from test_multi order by category, name;
Time: 5779.755 ms (00:05.780)
evantest=# select * from test_multi order by category, name;
Time: 5851.134 ms (00:05.851)
evantest=#
evantest=# set work_mem = '2GB’; # increasing work_mem to 2GB doesn’t help
Time: 0.404 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5781.005 ms (00:05.781)
evantest=# select * from test_multi order by category, name;
Time: 5826.025 ms (00:05.826)
evantest=# select * from test_multi order by category, name;
Time: 5937.919 ms (00:05.938)
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:
>
> <v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch>

I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo
thatmight effect the tests: 

```
+    Assert(last = first);
```

“=“ should be “=="

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: tuple radix sort

От
John Naylor
Дата:
On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:

> I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with
data:

> With radix_sort on and off, execution time are almost the same.

Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Speaking of tests, I forgot to mention that regression tests will fail
since in-place radix sort is an unstable sort, as qsort is as well,
but in a different way -- this is expected. In assert builds, the
patch verifies the sort after the fact with the standard comparator,
and will throw an error if it's wrong.

On Thu, Oct 30, 2025 at 9:19 AM Chao Li <li.evan.chao@gmail.com> wrote:
> I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo
thatmight effect the tests: 
>
> ```
> +       Assert(last = first);
> ```
>
> “=“ should be “=="

Yes, you're quite right, thanks for spotting! I reran my stress test
that has randomly distributed NULLs and the assert still holds, so
nothing further to fix yet. The NULL partitioning part of the code
hasn't been well tested in its current form, and I may arrange things
so that that step and the datum conditioning step happen in a single
pass. I'm not yet sure if it's important enough to justify the
additional complexity.

--
John Naylor
Amazon Web Services



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
>> I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with
data:
>
>> With radix_sort on and off, execution time are almost the same.
>
> Are you by chance running with asserts on? It's happened before, so I
> have to make sure. That makes a big difference here because I disabled
> diversion thresholds in assert builds so that regression tests (few
> cases with large inputs) cover the paths I want, in addition to my
> running a standalone stress test.
>

Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Re: tuple radix sort

От
David Rowley
Дата:
On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:
> > On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
> > Are you by chance running with asserts on? It's happened before, so I
> > have to make sure. That makes a big difference here because I disabled
> > diversion thresholds in assert builds so that regression tests (few
> > cases with large inputs) cover the paths I want, in addition to my
> > running a standalone stress test.
>
> Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Never expect anything meaningful to come from running performance
tests on Assert builds. You should always be rebuilding without
Asserts before doing performance tests.

David



Re: tuple radix sort

От
Chao Li
Дата:

> On Oct 30, 2025, at 13:01, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:
>>> On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
>>> Are you by chance running with asserts on? It's happened before, so I
>>> have to make sure. That makes a big difference here because I disabled
>>> diversion thresholds in assert builds so that regression tests (few
>>> cases with large inputs) cover the paths I want, in addition to my
>>> running a standalone stress test.
>>
>> Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.
>
> Never expect anything meaningful to come from running performance
> tests on Assert builds. You should always be rebuilding without
> Asserts before doing performance tests.
>

Sure, good to learn. Actually I am very new to PG development, so any guidance is greatly appreciated.

I just made a distclean, then configure without any parameter. Now, the overall execution time reduced ~10% than with
asserts.With the low cardinality data, off and on are very close: 

```
evantest=# set wip_radix_sort = 'off';
Time: 0.206 ms
evantest=# select * from test_multi order by category, name;
Time: 5070.277 ms (00:05.070)
evantest=# select * from test_multi order by category, name;
Time: 5158.748 ms (00:05.159)
evantest=# select * from test_multi order by category, name;
Time: 5072.708 ms (00:05.073)

evantest=# set wip_radix_sort = 'on';
Time: 0.177 ms
evantest=# select * from test_multi order by category, name;
Time: 4992.516 ms (00:04.993)
evantest=# select * from test_multi order by category, name;
Time: 5145.361 ms (00:05.145)
evantest=# select * from test_multi order by category, name;
Time: 5101.800 ms (00:05.102)

evantest=# \o
evantest=# show work_mem;
 work_mem
----------
 1GB
(1 row)

Time: 0.186 ms
evantest=# explain select * from test_multi order by category, name;
                                QUERY PLAN
---------------------------------------------------------------------------
 Sort  (cost=122003.84..124503.84 rows=1000000 width=69)
   Sort Key: category, name
   ->  Seq Scan on test_multi  (cost=0.00..22346.00 rows=1000000 width=69)
(3 rows)
```

And with high cardinality test data, on has a big win:
```
evantest=# set wip_radix_sort = 'off';
Time: 0.174 ms
evantest=# select * from test_multi order by category, name;
Time: 353.702 ms
evantest=# select * from test_multi order by category, name;
Time: 375.549 ms
evantest=# select * from test_multi order by category, name;
Time: 367.967 ms
evantest=# set wip_radix_sort = 'on';
Time: 0.147 ms
evantest=# select * from test_multi order by category, name;
Time: 279.537 ms
evantest=# select * from test_multi order by category, name;
Time: 278.114 ms
evantest=# select * from test_multi order by category, name;
Time: 284.273 ms
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: tuple radix sort

От
John Naylor
Дата:
I wrote:

> The v1 patch
> has some optimizations, but in other ways things are simple and/or
> wasteful. Exactly how things fit together will be informed by what, if
> anything, has to be done to avoid regressions.

In v1, radix sort diverts to qsort_tuple for small partitions (similar
to how quicksort diverts to insertion sort), but qsort_tuple is
inefficient because the comparator is called via a function pointer.

I also thought having two different radix sorts was too complex, so I
wondered if it'd be better to get rid of the smaller radix sort (whose
control flow I find harder to understand, even ignoring the unsightly
goto) and have the larger sort divert to a new quicksort
specialization that compares on the conditioned datum. That allows
skipping branches for NULL comparisons and order reversal. I've done
this in v2. It makes sense to replace the three current
integer-comparison quicksorts with one.

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I used a standalone program (attached) to microbenchmark this new
fallback qsort vs. a pass of radix sort on one byte to get a decent
threshold value. This is not quite fair, since the quicksort will then
be finished, but the radix sort could still need to recurse to the
next byte(s), so these number could underestimate the threshold. This
is just to get an idea.

The numbers are in RDTSC ticks per element sorted.

cardinality: 256
number of elements:  100   qsort: 35.4 radix: 49.2
number of elements:  200   qsort: 34.9 radix: 38.1
number of elements:  400   qsort: 42.4 radix: 34.4
number of elements:  800   qsort: 95.0 radix: 29.2
number of elements: 1600   qsort: 115.0 radix: 22.4
number of elements: 3200   qsort: 125.5 radix: 19.4
number of elements: 6400   qsort: 128.1 radix: 17.6

With the highest cardinality possible on a single byte, radix sort is
actually not bad at low inputs. Notice that the time per element is
consistently going down with larger inputs. Smaller inputs have large
constant overheads, made worse by my unrolling the counting step.

cardinality: 2
number of elements:  100   qsort: 09.2 radix: 28.0
number of elements:  200   qsort: 09.1 radix: 19.5
number of elements:  400   qsort: 10.4 radix: 15.7
number of elements:  800   qsort: 10.1 radix: 14.5
number of elements: 1600   qsort: 10.4 radix: 13.7
number of elements: 3200   qsort: 15.8 radix: 13.6
number of elements: 6400   qsort: 22.2 radix: 13.8

This is an extreme best case for B&M quicksort, which is basically
O(n) -- the point at which the per-element time goes up seems purely
due to exceeding L1 cache. Radix sort takes a big input to catch up,
but it doesn't seem awful, either.

cardinality: 16
number of elements:  100   qsort: 19.5 radix: 34.5
number of elements:  200   qsort: 18.7 radix: 22.6
number of elements:  400   qsort: 18.5 radix: 17.2
number of elements:  800   qsort: 25.0 radix: 14.8
number of elements: 1600   qsort: 43.8 radix: 13.8
number of elements: 3200   qsort: 51.2 radix: 13.2
number of elements: 6400   qsort: 59.0 radix: 12.8

This is still low cardinality, but behaves more like the high cardinality case.

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Other things to consider:

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

- The quicksort on the NULL partition still compares isnull1 -- the
branches are predictable but perhaps it's worth it to add a
specialization that skips that.

--
John Naylor
Amazon Web Services

Вложения

Re: tuple radix sort

От
John Naylor
Дата:
I wrote:

> I've set the threshold to 400 for now, but I'm not claiming that's the
> end story. In addition to the underestimation mentioned above,
> unrolling the counting step is a factor. Unrolling makes smaller
> inputs worse (which we can reach by recursing from larger inputs), but
> unrolling seems important for large inputs with low cardinality (a few
> percent, but I haven't shared numbers yet). We've found that a large
> input with only 4-5 distinct values just barely wins with radix sort.
> I'll be curious to see if unrolling is actually needed to prevent
> regressions there.

Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)

Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:

cardinality: 256
number of elements:  100   qsort: 34.8 radix: 38.3
number of elements:  200   qsort: 34.9 radix: 29.7
number of elements:  400   qsort: 40.8 radix: 29.2

cardinality: 16
number of elements:  100   qsort: 19.3 radix: 26.2
number of elements:  200   qsort: 18.9 radix: 18.2
number of elements:  400   qsort: 18.5 radix: 14.5

cardinality: 2
number of elements:  100   qsort: 09.3 radix: 21.6
number of elements:  200   qsort: 08.9 radix: 15.4
number of elements:  400   qsort: 10.3 radix: 14.0

To test further, I dug up a test from [1] that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:

V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms

on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms

V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms

on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms

So v3 gets rid of loop unrolling, reduces the threshold to 200.

[1]
https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd

TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup

--
John Naylor
Amazon Web Services

Вложения