Re: Parallel CREATE INDEX for GIN indexes

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: Parallel CREATE INDEX for GIN indexes
Дата	12 июля 2024 г. 15:34:25
Msg-id	148f0f59-55bf-40a9-ab28-51904aa8c325@enterprisedb.com обсуждение исходный текст
Ответ на	Re: Parallel CREATE INDEX for GIN indexes (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Ответы	Re: Parallel CREATE INDEX for GIN indexes
Список	pgsql-hackers

Дерево обсуждения

Hi,

I got to do the detailed benchmarking on the latest version of the patch
series, so here's the results. My goal was to better understand the
impact of each patch individually - especially the two parts introduced
by Matthias, but not only - so I ran the test on a build with each fo
the 0001-0009 patches.

This is the same test I did at the very beginning, but the basic details
are that I have a 22GB table with archives of our mailing lists (1.6M
messages, roughly), and I build a couple different GIN indexes on that:

create index trgm on messages using gin (msg_body gin_trgm_ops);
create index tsvector on messages using gin (msg_body_tsvector);
create index jsonb on messages using gin (msg_headers);
create index jsonb_hash on messages using gin (msg_headers jsonb_path_ops);

The indexes are 700MB-3GB, so not huge, but also not tiny. I did the
test with a varying number of parallel workers for each patch, measuring
the execution time and a couple more metrics (using pg_stat_statements).
See the attached scripts for details, and also conf/results from the two
machines I use for these tests.

Attached is also a PDF with a summary of the tests - there are four
sections with results in total, two for each machine with different
work_mem values (more on this later).

For each configuration, there are tables/charts for three metrics:

- total CREATE INDEX duration
- relative CREATE INDEX duration (relative to serial build)
- amount of temporary files written

Hopefully it's easy to understand/interpret, but feel free to ask.
There's also CSVs with raw results, in case you choose to do your own
analysis (there's more metrics than presented here).

While doing these tests, I realized there's a bug in how the patches
handle collations - it simply grabbed the value for the indexed column,
but if that's missing (e.g. for tsvector), it fell over. Instead the
patch needs to use the default collation, so that's fixed in 0001.

The other thing I realized while working on this is that it's probably
wrong to tie parallel callback to work_mem - both conceptually, but also
for performance reasons. I did the first run with the default work_mem
(4MB), and that showed some serious regressions with the 0002 patch
(where it took ~3.5x longer than serial build). It seemed to be due to a
lot of merges of small TID lists, so I tried re-running the tests with
work_mem=32MB, and the regression pretty much disappeared.

Also, with 4MB there were almost no benefits of parallelism on the
smaller indexes (jsonb and jsonb_hash) - that's probably not unexpected,
but 32MB did improve that a little bit (still not great, though).

In practice this would not be a huge issue, because the later patches
make the regression go away - so unless we commit only the first couple
patches, the users would not be affected by this. But it's annoying, and
more importantly it's a bit bogus to use work_mem here - why should that
be appropriate? It was more a temporary hack because I didn't have a
better idea, and the comment in ginBuildCallbackParallel() questions
this too, after all.

My plan is to derive this from maintenance_work_mem, or rather the
fraction we "allocate" for each worker. The planner logic caps the
number of workers to maintenance_work_mem / 32MB, which means each
worker has >=32MB of maintenance_work_mem at it's disposal. The worker
needs to do the BuildAccumulator thing, and also the tuplesort. So it
seems reasonable to use 1/2 of the budget (>=16MB) for each of those.
Which seems good enough, IMHO. It's significantly more than 4MB, and the
32MB I used for the second round was rather arbitrary.

So for further discussion, let's focus on results in the two sections
for 32MB ...

And let's talk about the improvement by Matthias, namely:

* 0008 Use a single GIN tuplesort
* 0009 Reduce the size of GinTuple by 12 bytes

I haven't really seen any impact on duration - it seems more or less
within noise. Maybe it would be different on machines with less RAM, but
on my two systems it didn't really make a difference.

It did significantly reduce the amount of temporary data written, by
~40% or so. This is pretty nicely visible on the "trgm" case, which
generates the most temp files of the four indexes. An example from the
i5/32MB section looks like this:

label    0000 0001  0002  0003  0004  0005  0006  0007  0008  0009  0010
------------------------------------------------------------------------
trgm / 3    0 2635  3690  3715  1177  1177  1179  1179   696   682  1016

So we start with patches producing 2.6GB - 3.7GB of temp files. Then the
compression of TID lists cuts that down to ~1.2GB, and the 0008 patch
cuts that to just 700MB. That's pretty nice, even if it doesn't speed
things up. The 0009 (GinTuple reduction) improves that a little bit, but
the difference is smaller.

I'm still a bit unsure about the tuplesort changes, but producing less
temporary files seems like a good thing.


Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index. So I was wondering if we might
parallelize that too, and how much time would it save - 0010 is an
experimental patch doing that. It splits the processing into 3 phases:

1. workers feeding data into tuplesort
2. leader finishes sort and "repartitions" the data
3. workers inserting their partition into index

The patch is far from perfect (more a PoC) - it implements these phases
by introducing a barrier to coordinate the processes. Workers feed the
data into the tuplesort as now, but instead of terminating they wait on
a barrier.

The leader reads data from the tuplesort, and partitions them evenly
into the a SharedFileSet with one file per worker. And then wakes up the
workers through the barrier again, and they do the inserts.

This does help a little bit, reducing the duration by ~15-25%. I wonder
if this might be improved by partitioning the data differently - not by
shuffling everything from the tuplesort into fileset (it increases the
amount of temporary data in the charts). And also by by distributing the
data differently - right now it's a bit of a round robin, because it
wasn't clear we know how many entries are there.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel CREATE INDEX for GIN indexes

Вложения