Re: Parallel CREATE INDEX for BRIN indexes

Поиск

Список

Период

Сортировка

От	Matthias van de Meent
Тема	Re: Parallel CREATE INDEX for BRIN indexes
Дата	12 июля 2023 г. 00:11:02
Msg-id	CAEze2WgJrU61Q12Q4UO0Xcgk1DdTegXss=r3uX1UabiMVT-8gg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Parallel CREATE INDEX for BRIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Parallel CREATE INDEX for BRIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 7/5/23 16:33, Matthias van de Meent wrote:
> > ...
> >
> >> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
> >> do, and the little I knew I managed to forget since I wrote this patch.
> >> Which similar features do you have in mind?
> >
> > I was referring to the feature that is "emitting a single sorted run
> > of tuples at the leader backend based on data gathered in parallel
> > worker backends". It manages the sort state, on-disk runs etc. so that
> > you don't have to manage that yourself.
> >
> > Adding a new storage format for what is effectively a logical tape
> > (logtape.{c,h}) and manually merging it seems like a lot of changes if
> > that functionality is readily available, standardized and optimized in
> > sortsupport; and adds an additional place to manually go through for
> > disk-related changes like TDE.
> >
>
> Here's a new version of the patch, with three main changes:

Thanks! I've done a review on the patch, and most looks good. Some
places need cleanup and polish, some others more documentations, and
there are some other issues, but so far it's looking OK.

> One thing I was wondering about is whether it might be better to allow
> the workers to process overlapping ranges, and then let the leader to
> merge the summaries. That would mean we might not need the tableam.c
> changes at all, but the leader would need to do more work (although the
> BRIN indexes tend to be fairly small). The main reason that got me
> thinking about this is that we have pretty much no tests for the union
> procedures, because triggering that is really difficult. But for
> parallel index builds that'd be much more common.

Hmm, that's a good point. I don't mind either way, but it would add
overhead in the leader to do all of that merging - especially when you
configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
need to merge up to parallel_workers tuples. That could be a
significant overhead.

... thinks a bit.

Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
unlikely to be a serious problem - the per-backend IO saved of such
large ranges on a single backend has presumably much more impact than
the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
with me.

Review follows below.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

-----------

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c

> +    BrinShared       *brinshared;

Needs some indentation fixes.

> +    int            bs_reltuples;
> [...]
> +    state->bs_reltuples += reltuples;

My IDE warns me that reltuples is a double. Looking deeper into the
value, it contains the number of live tuples in the table, so this
conversion may not result in a meaningful value for tables with >=2^31
live tuples. Tables > 56GB could begin to get affected by this.

> +    int            bs_worker_id;

This variable seems to be unused.

> +    BrinSpool  *bs_spool;
> +    BrinSpool  *bs_spool_out;

Are both used? If so, could you add comments why we have two spools
here, instead of only one?

> +/*
> + * A version of the callback, used by parallel index builds. The main difference
> + * is that instead of writing the BRIN tuples into the index, we write them into
> + * a shared tuplestore file, and leave the insertion up to the leader (which may

+ ... shared tuplesort, and ...

> brinbuildCallbackParallel(...)
> +    while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)

shouldn't this be an 'if' now?

> +        while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1)
> +            state->bs_currRangeStart += state->bs_pagesPerRange;

Is there a reason why you went with iterative addition instead of a
single divide-and-multiply like the following?:

+        state->bs_currRangeStart += state->bs_pagesPerRange *
((state->bs_currRangeStart - thisblock) / state->bs_pagesPerRange);

> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
> [...]
> -table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
> +table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, BlockNumber chunk_factor)
> [...]
> -    /* compare phs_syncscan initialization to similar logic in initscan */
> +    bpscan->phs_chunk_factor = chunk_factor;
> +    /* compare phs_syncscan initialization to similar logic in initscan
> +     *
> +     * Disable sync scans if the chunk factor is set (valid block number).
> +     */

I think this needs some pgindent or other style work, both on comment
style and line lengths

> diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
> [...]
> +            Assert(false); (x3)

I think these can be cleaned up, right?

> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
> [...]
> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
> + * BrinTuple to the logtapes, though.

Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?

> +    tuplesort_puttuple_common(state, &stup,
> +                              base->sortKeys &&
> +                              base->sortKeys->abbrev_converter &&
> +                              !stup.isnull1);

Can't this last argument just be inlined, based on knowledge that we
don't use sortKeys in brin?

> +comparetup_index_brin(const SortTuple *a, const SortTuple *b,
> +                      Tuplesortstate *state)
> +{
> +    BrinTuple  *tuple1;
> [...]
> +    tuple1 = &((BrinSortTuple *) a)->tuple;
> [...]

I'm fairly sure that this cast (and it's neighbour) is incorrect and
should be the following instead:

+    tuple1 = &((BrinSortTuple *) (a->tuple))->tuple;

Additionally, I think the following would be a better approach here,
as we wouldn't need to do pointer-chasing:

+ static int
+ comparetup_index_brin(const SortTuple *a, const SortTuple *b,
+                      Tuplesortstate *state)
+ {
+    Assert(TuplesortstateGetPublic(state)->haveDatum1);
+
+    if (DatumGetUInt32(a->datum1) > DatumGetUInt32(b->datum1))
+        return 1;
+    if (DatumGetUInt32(a->datum1) < DatumGetUInt32(b->datum1))
+        return -1;
+     /* silence compilers */
+    return 0;
+ }

---

Thanks for working on this!

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Thomas Munro
Дата: 11 июля 2023 г., 23:58:29
Сообщение: Re: Cleaning up threading code

Следующее

От: Marko Tiikkaja
Дата: 12 июля 2023 г., 01:33:51
Сообщение: Re: Avoid unused value (src/fe_utils/print.c)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel CREATE INDEX for BRIN indexes

Предыдущее

Следующее