Re: Parallel CREATE INDEX for BRIN indexes
От | Matthias van de Meent |
---|---|
Тема | Re: Parallel CREATE INDEX for BRIN indexes |
Дата | |
Msg-id | CAEze2WgJrU61Q12Q4UO0Xcgk1DdTegXss=r3uX1UabiMVT-8gg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel CREATE INDEX for BRIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Ответы |
Re: Parallel CREATE INDEX for BRIN indexes
(Tomas Vondra <tomas.vondra@enterprisedb.com>)
|
Список | pgsql-hackers |
On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 7/5/23 16:33, Matthias van de Meent wrote: > > ... > > > >> Maybe. I wasn't that familiar with what parallel tuplesort can and can't > >> do, and the little I knew I managed to forget since I wrote this patch. > >> Which similar features do you have in mind? > > > > I was referring to the feature that is "emitting a single sorted run > > of tuples at the leader backend based on data gathered in parallel > > worker backends". It manages the sort state, on-disk runs etc. so that > > you don't have to manage that yourself. > > > > Adding a new storage format for what is effectively a logical tape > > (logtape.{c,h}) and manually merging it seems like a lot of changes if > > that functionality is readily available, standardized and optimized in > > sortsupport; and adds an additional place to manually go through for > > disk-related changes like TDE. > > > > Here's a new version of the patch, with three main changes: Thanks! I've done a review on the patch, and most looks good. Some places need cleanup and polish, some others more documentations, and there are some other issues, but so far it's looking OK. > One thing I was wondering about is whether it might be better to allow > the workers to process overlapping ranges, and then let the leader to > merge the summaries. That would mean we might not need the tableam.c > changes at all, but the leader would need to do more work (although the > BRIN indexes tend to be fairly small). The main reason that got me > thinking about this is that we have pretty much no tests for the union > procedures, because triggering that is really difficult. But for > parallel index builds that'd be much more common. Hmm, that's a good point. I don't mind either way, but it would add overhead in the leader to do all of that merging - especially when you configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd need to merge up to parallel_workers tuples. That could be a significant overhead. ... thinks a bit. Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite unlikely to be a serious problem - the per-backend IO saved of such large ranges on a single backend has presumably much more impact than the merging of n_parallel_tasks max-sized brin tuples. So, seems fine with me. Review follows below. Kind regards, Matthias van de Meent Neon (https://neon.tech/) ----------- > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > + BrinShared *brinshared; Needs some indentation fixes. > + int bs_reltuples; > [...] > + state->bs_reltuples += reltuples; My IDE warns me that reltuples is a double. Looking deeper into the value, it contains the number of live tuples in the table, so this conversion may not result in a meaningful value for tables with >=2^31 live tuples. Tables > 56GB could begin to get affected by this. > + int bs_worker_id; This variable seems to be unused. > + BrinSpool *bs_spool; > + BrinSpool *bs_spool_out; Are both used? If so, could you add comments why we have two spools here, instead of only one? > +/* > + * A version of the callback, used by parallel index builds. The main difference > + * is that instead of writing the BRIN tuples into the index, we write them into > + * a shared tuplestore file, and leave the insertion up to the leader (which may + ... shared tuplesort, and ... > brinbuildCallbackParallel(...) > + while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) shouldn't this be an 'if' now? > + while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) > + state->bs_currRangeStart += state->bs_pagesPerRange; Is there a reason why you went with iterative addition instead of a single divide-and-multiply like the following?: + state->bs_currRangeStart += state->bs_pagesPerRange * ((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange); > diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c > [...] > -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan) > +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor) > [...] > - /* compare phs_syncscan initialization to similar logic in initscan */ > + bpscan->phs_chunk_factor = chunk_factor; > + /* compare phs_syncscan initialization to similar logic in initscan > + * > + * Disable sync scans if the chunk factor is set (valid block number). > + */ I think this needs some pgindent or other style work, both on comment style and line lengths > diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c > [...] > + Assert(false); (x3) I think these can be cleaned up, right? > diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c > [...] > + * Computing BrinTuple size with only the tuple is difficult, so we want to track > + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant > + * to do - it's essentially a BrinTuple prefixed by length. We only write the > + * BrinTuple to the logtapes, though. Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense? > + tuplesort_puttuple_common(state, &stup, > + base->sortKeys && > + base->sortKeys->abbrev_converter && > + !stup.isnull1); Can't this last argument just be inlined, based on knowledge that we don't use sortKeys in brin? > +comparetup_index_brin(const SortTuple *a, const SortTuple *b, > + Tuplesortstate *state) > +{ > + BrinTuple *tuple1; > [...] > + tuple1 = &((BrinSortTuple *) a)->tuple; > [...] I'm fairly sure that this cast (and it's neighbour) is incorrect and should be the following instead: + tuple1 = &((BrinSortTuple *) (a->tuple))->tuple; Additionally, I think the following would be a better approach here, as we wouldn't need to do pointer-chasing: + static int + comparetup_index_brin(const SortTuple *a, const SortTuple *b, + Tuplesortstate *state) + { + Assert(TuplesortstateGetPublic(state)->haveDatum1); + + if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1)) + return 1; + if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1)) + return -1; + /* silence compilers */ + return 0; + } --- Thanks for working on this!
В списке pgsql-hackers по дате отправления: