Re: Parallel CREATE INDEX for GIN indexes
От | Tomas Vondra |
---|---|
Тема | Re: Parallel CREATE INDEX for GIN indexes |
Дата | |
Msg-id | 148f0f59-55bf-40a9-ab28-51904aa8c325@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Parallel CREATE INDEX for GIN indexes (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Ответы |
Re: Parallel CREATE INDEX for GIN indexes
|
Список | pgsql-hackers |
Hi, I got to do the detailed benchmarking on the latest version of the patch series, so here's the results. My goal was to better understand the impact of each patch individually - especially the two parts introduced by Matthias, but not only - so I ran the test on a build with each fo the 0001-0009 patches. This is the same test I did at the very beginning, but the basic details are that I have a 22GB table with archives of our mailing lists (1.6M messages, roughly), and I build a couple different GIN indexes on that: create index trgm on messages using gin (msg_body gin_trgm_ops); create index tsvector on messages using gin (msg_body_tsvector); create index jsonb on messages using gin (msg_headers); create index jsonb_hash on messages using gin (msg_headers jsonb_path_ops); The indexes are 700MB-3GB, so not huge, but also not tiny. I did the test with a varying number of parallel workers for each patch, measuring the execution time and a couple more metrics (using pg_stat_statements). See the attached scripts for details, and also conf/results from the two machines I use for these tests. Attached is also a PDF with a summary of the tests - there are four sections with results in total, two for each machine with different work_mem values (more on this later). For each configuration, there are tables/charts for three metrics: - total CREATE INDEX duration - relative CREATE INDEX duration (relative to serial build) - amount of temporary files written Hopefully it's easy to understand/interpret, but feel free to ask. There's also CSVs with raw results, in case you choose to do your own analysis (there's more metrics than presented here). While doing these tests, I realized there's a bug in how the patches handle collations - it simply grabbed the value for the indexed column, but if that's missing (e.g. for tsvector), it fell over. Instead the patch needs to use the default collation, so that's fixed in 0001. The other thing I realized while working on this is that it's probably wrong to tie parallel callback to work_mem - both conceptually, but also for performance reasons. I did the first run with the default work_mem (4MB), and that showed some serious regressions with the 0002 patch (where it took ~3.5x longer than serial build). It seemed to be due to a lot of merges of small TID lists, so I tried re-running the tests with work_mem=32MB, and the regression pretty much disappeared. Also, with 4MB there were almost no benefits of parallelism on the smaller indexes (jsonb and jsonb_hash) - that's probably not unexpected, but 32MB did improve that a little bit (still not great, though). In practice this would not be a huge issue, because the later patches make the regression go away - so unless we commit only the first couple patches, the users would not be affected by this. But it's annoying, and more importantly it's a bit bogus to use work_mem here - why should that be appropriate? It was more a temporary hack because I didn't have a better idea, and the comment in ginBuildCallbackParallel() questions this too, after all. My plan is to derive this from maintenance_work_mem, or rather the fraction we "allocate" for each worker. The planner logic caps the number of workers to maintenance_work_mem / 32MB, which means each worker has >=32MB of maintenance_work_mem at it's disposal. The worker needs to do the BuildAccumulator thing, and also the tuplesort. So it seems reasonable to use 1/2 of the budget (>=16MB) for each of those. Which seems good enough, IMHO. It's significantly more than 4MB, and the 32MB I used for the second round was rather arbitrary. So for further discussion, let's focus on results in the two sections for 32MB ... And let's talk about the improvement by Matthias, namely: * 0008 Use a single GIN tuplesort * 0009 Reduce the size of GinTuple by 12 bytes I haven't really seen any impact on duration - it seems more or less within noise. Maybe it would be different on machines with less RAM, but on my two systems it didn't really make a difference. It did significantly reduce the amount of temporary data written, by ~40% or so. This is pretty nicely visible on the "trgm" case, which generates the most temp files of the four indexes. An example from the i5/32MB section looks like this: label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 ------------------------------------------------------------------------ trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682 1016 So we start with patches producing 2.6GB - 3.7GB of temp files. Then the compression of TID lists cuts that down to ~1.2GB, and the 0008 patch cuts that to just 700MB. That's pretty nice, even if it doesn't speed things up. The 0009 (GinTuple reduction) improves that a little bit, but the difference is smaller. I'm still a bit unsure about the tuplesort changes, but producing less temporary files seems like a good thing. Now, what's the 0010 patch about? For some indexes (e.g. trgm), the parallel builds help a lot, because they produce a lot of temporary data and the parallel sort is a substantial part of the work. But for other indexes (especially the "smaller" indexes on jsonb headers), it's not that great. For example for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%. Clearly, this happens because a lot of time is spent outside the sort, actually inserting data into the index. So I was wondering if we might parallelize that too, and how much time would it save - 0010 is an experimental patch doing that. It splits the processing into 3 phases: 1. workers feeding data into tuplesort 2. leader finishes sort and "repartitions" the data 3. workers inserting their partition into index The patch is far from perfect (more a PoC) - it implements these phases by introducing a barrier to coordinate the processes. Workers feed the data into the tuplesort as now, but instead of terminating they wait on a barrier. The leader reads data from the tuplesort, and partitions them evenly into the a SharedFileSet with one file per worker. And then wakes up the workers through the barrier again, and they do the inserts. This does help a little bit, reducing the duration by ~15-25%. I wonder if this might be improved by partitioning the data differently - not by shuffling everything from the tuplesort into fileset (it increases the amount of temporary data in the charts). And also by by distributing the data differently - right now it's a bit of a round robin, because it wasn't clear we know how many entries are there. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- gin-parallel-builds.pdf
- i5.tgz
- xeon.tgz
- v20240712-0001-Allow-parallel-create-for-GIN-indexes.patch
- v20240712-0002-Use-mergesort-in-the-leader-process.patch
- v20240712-0003-Remove-the-explicit-pg_qsort-in-workers.patch
- v20240712-0004-Compress-TID-lists-before-writing-tuples-t.patch
- v20240712-0005-Collect-and-print-compression-stats.patch
- v20240712-0006-Enforce-memory-limit-when-combining-tuples.patch
- v20240712-0007-Detect-wrap-around-in-parallel-callback.patch
- v20240712-0008-Use-a-single-GIN-tuplesort.patch
- v20240712-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patch
- v20240712-0010-WIP-parallel-inserts-into-GIN-index.patch
В списке pgsql-hackers по дате отправления: