Обсуждение: Parallel heap vacuum

Поиск

Список

Период

Сортировка

Re: Parallel heap vacuum

От

Amit Kapila

Дата:

28 июня 2024 г., 12:06:23

On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> # Benchmark results
>
> * Test-1: parallel heap scan on the table without indexes
>
> I created 20GB table, made garbage on the table, and run vacuum while
> changing parallel degree:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> Here are the results (total time and heap scan time):
>
> PARALLEL 0: 21.99 s (single process)
> PARALLEL 1: 11.39 s
> PARALLEL 2:   8.36 s
> PARALLEL 3:   6.14 s
> PARALLEL 4:   5.08 s
>
> * Test-2: parallel heap scan on the table with one index
>
> I used a similar table to the test case 1 but created one btree index on it:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 600000000); --- 20GB table
> create index on test (a);
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> I've measured the total execution time as well as the time of each
> vacuum phase (from left heap scan time, index vacuum time, and heap
> vacuum time):
>
> PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
>
> Overall, I can see the parallel heap scan in lazy vacuum has a decent
> scalability; In both test-1 and test-2, the execution time of heap
> scan got ~4x faster with 4 parallel workers. On the other hand, when
> it comes to the total vacuum execution time, I could not see much
> performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> worker (12.75 vs. 11.39).
>
> I think the reason is the shared TidStore is not very scalable since
> we have a single lock on it. In all cases in the test-1, we don't use
> the shared TidStore since all dead tuples are removed during heap
> pruning. So the scalability was better overall than in test-2. In
> parallel 0 case in test-2, we use the local TidStore, and from
> parallel degree of 1 in test-2, we use the shared TidStore and
> parallel worker concurrently update it. Also, I guess that the lookup
> performance of the local TidStore is better than the shared TidStore's
> lookup performance because of the differences between a bump context
> and an DSA area. I think that this difference contributed the fact
> that index vacuuming got slower (16.74 vs. 22.04).
>
> There are two obvious improvement ideas to improve overall vacuum
> execution time: (1) improve the shared TidStore scalability and (2)
> support parallel heap vacuum. For (1), several ideas are proposed by
> the ART authors[1]. I've not tried these ideas but it might be
> applicable to our ART implementation. But I prefer to start with (2)
> since it would be easier. Feedback is very welcome.
>

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

--
With Regards,
Amit Kapila.

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

08 июля 2024 г., 06:14:13

On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > # Benchmark results
> >
> > * Test-1: parallel heap scan on the table without indexes
> >
> > I created 20GB table, made garbage on the table, and run vacuum while
> > changing parallel degree:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > Here are the results (total time and heap scan time):
> >
> > PARALLEL 0: 21.99 s (single process)
> > PARALLEL 1: 11.39 s
> > PARALLEL 2:   8.36 s
> > PARALLEL 3:   6.14 s
> > PARALLEL 4:   5.08 s
> >
> > * Test-2: parallel heap scan on the table with one index
> >
> > I used a similar table to the test case 1 but created one btree index on it:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > create index on test (a);
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > I've measured the total execution time as well as the time of each
> > vacuum phase (from left heap scan time, index vacuum time, and heap
> > vacuum time):
> >
> > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
> >
> > Overall, I can see the parallel heap scan in lazy vacuum has a decent
> > scalability; In both test-1 and test-2, the execution time of heap
> > scan got ~4x faster with 4 parallel workers. On the other hand, when
> > it comes to the total vacuum execution time, I could not see much
> > performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> > and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> > worker (12.75 vs. 11.39).
> >
> > I think the reason is the shared TidStore is not very scalable since
> > we have a single lock on it. In all cases in the test-1, we don't use
> > the shared TidStore since all dead tuples are removed during heap
> > pruning. So the scalability was better overall than in test-2. In
> > parallel 0 case in test-2, we use the local TidStore, and from
> > parallel degree of 1 in test-2, we use the shared TidStore and
> > parallel worker concurrently update it. Also, I guess that the lookup
> > performance of the local TidStore is better than the shared TidStore's
> > lookup performance because of the differences between a bump context
> > and an DSA area. I think that this difference contributed the fact
> > that index vacuuming got slower (16.74 vs. 22.04).
> >

Thank you for the comments!

> > There are two obvious improvement ideas to improve overall vacuum
> > execution time: (1) improve the shared TidStore scalability and (2)
> > support parallel heap vacuum. For (1), several ideas are proposed by
> > the ART authors[1]. I've not tried these ideas but it might be
> > applicable to our ART implementation. But I prefer to start with (2)
> > since it would be easier. Feedback is very welcome.
> >
>
> Starting with (2) sounds like a reasonable approach. We should study a
> few more things like (a) the performance results where there are 3-4
> indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1:  74.88 s (17.11, 44.43, 13.25)
PARALLEL 2:  71.15 s (14.13, 44.82, 12.12)
PARALLEL 3:  46.78 s (10.74, 24.50, 11.43)
PARALLEL 4:  46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

> (b) What is the reason for performance improvement seen with
> only heap scans. We normally get benefits of parallelism because of
> using multiple CPUs but parallelizing scans (I/O) shouldn't give much
> benefits. Is it possible that you are seeing benefits because most of
> the data is either in shared_buffers or in memory? We can probably try
> vacuuming tables by restarting the nodes to ensure the data is not in
> memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0:  32.75 s
PARALLEL 1:  17.46 s
PARALLEL 2:  13.41 s
PARALLEL 3:  10.31 s
PARALLEL 4:    8.48 s

With the above two tests, I used the updated patch that I just submitted[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Parallel heap vacuum

От

"Hayato Kuroda (Fujitsu)"

Дата:

25 июля 2024 г., 09:58:03

Dear Sawada-san,

> Thank you for the test!
> 
> I could reproduce this issue and it's a bug; it skipped even
> non-all-visible pages. I've attached the new version patch.
> 
> BTW since we compute the number of parallel workers for the heap scan
> based on the table size, it's possible that we launch multiple workers
> even if most blocks are all-visible. It seems to be better if we
> calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

I started to read your codes but takes much time because I've never seen before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

2.
I compared parallel heap scan and found that it does not have compute_worker API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the heap structure,
so should we add the API for table scan as well?)

[1]:
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’
  Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
                                  ^~
....
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

26 июля 2024 г., 18:02:22

On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for the test!
> >
> > I could reproduce this issue and it's a bug; it skipped even
> > non-all-visible pages. I've attached the new version patch.
> >
> > BTW since we compute the number of parallel workers for the heap scan
> > based on the table size, it's possible that we launch multiple workers
> > even if most blocks are all-visible. It seems to be better if we
> > calculate it based on (relpages - relallvisible).
>
> Thanks for updating the patch. I applied and confirmed all pages are scanned.
> I used almost the same script (just changed max_parallel_maintenance_workers)
> and got below result. I think the tendency was the same as yours.
>
> ```
> parallel 0: 61114.369 ms
> parallel 1: 34870.316 ms
> parallel 2: 23598.115 ms
> parallel 3: 17732.363 ms
> parallel 4: 15203.271 ms
> parallel 5: 13836.025 ms
> ```

Thank you for testing!

>
> I started to read your codes but takes much time because I've never seen before...
> Below part contains initial comments.
>
> 1.
> This patch cannot be built when debug mode is enabled. See [1].
> IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared.
> So you should update like " vacrel->counters->NewRelminMxid".

Right, will fix.

> 2.
> I compared parallel heap scan and found that it does not have compute_worker API.
> Can you clarify the reason why there is an inconsistency?
> (I feel it is intentional because the calculation logic seems to depend on the heap structure,
> so should we add the API for table scan as well?)

There is room to consider a better API design, but yes, the reason is
that the calculation logic depends on table AM implementation. For
example, I thought it might make sense to consider taking the number
of all-visible pages into account for the calculation of the number of
parallel workers as we don't want to launch many workers on the table
where most pages are all-visible. Which might not work for other table
AMs.

I'm updating the patch to implement parallel heap vacuum and will
share the updated patch. It might take time as it requires to
implement shared iteration support in radx tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

23 октября 2024 г., 02:54:34

Sorry for the very late reply.

On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Thank you for testing!
>
> I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
> and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
>
> > > 2.
> > > I compared parallel heap scan and found that it does not have compute_worker
> > API.
> > > Can you clarify the reason why there is an inconsistency?
> > > (I feel it is intentional because the calculation logic seems to depend on the
> > heap structure,
> > > so should we add the API for table scan as well?)
> >
> > There is room to consider a better API design, but yes, the reason is
> > that the calculation logic depends on table AM implementation. For
> > example, I thought it might make sense to consider taking the number
> > of all-visible pages into account for the calculation of the number of
> > parallel workers as we don't want to launch many workers on the table
> > where most pages are all-visible. Which might not work for other table
> > AMs.
> >
>
> Okay, thanks for confirming. I wanted to ask others as well.
>
>
> > I'm updating the patch to implement parallel heap vacuum and will
> > share the updated patch. It might take time as it requires to
> > implement shared iteration support in radx tree.
>
> Here are other preliminary comments for v2 patch. This does not contain
> cosmetic ones.
>
> 1.
> Shared data structure PHVShared does not contain the mutex lock. Is it intentional
> because they are accessed by leader only after parallel workers exit?
>

Yes, the fields in PHVShared are read-only for workers. Since no
concurrent reads/writes happen on these fields we don't need to
protect them.

> 2.
> Per my understanding, the vacuuming goes like below steps.
>
> a. paralell workers are launched for scanning pages
> b. leader waits until scans are done
> c. leader does vacuum alone (you may extend here...)
> d. parallel workers are launched again to cleanup indeces
>
> If so, can we reuse parallel workers for the cleanup? Or, this is painful
> engineering than the benefit?

I've not thought of this idea but I think it's possible from a
technical perspective. It saves overheads of relaunching workers but
I'm not sure how much it would help for a better performance and I'm
concerned it would make the code complex. For example, different
numbers of workers might be required for table vacuuming and index
vacuuming. So we would end up increasing or decreasing workers.

> 3.
> According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
> "parallel worker ..." Can we extend this to improve the trackability in the
> pg_stat_activity?

It would be a good improvement for better trackability. But I think we
should do that in a separate patch as it's not just a problem for
parallel heap vacuum.

>
> 4.
> I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
> bottleneck when tid is added to the shared TidStore. My another primitive idea
> is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
> and gather after the heap scanning. If you extend like parallel workers do vacuuming,
> the gathering may not be needed: they can access own TidStore and clean up.
> One downside is that the memory consumption may be quite large.
>

Interesting idea. Suppose we support parallel heap vacuum as well, we
wouldn't need locks and to support shared-iteration on TidStore. I
think each worker should use a fraction of maintenance_work_mem.
However, one downside would be that we need to check as many TidStore
as workers during index vacuuming.

FYI I've implemented the parallel heap vacuum part and am doing some
benchmark tests. I'll share the updated patches along with test
results this week.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Parallel heap vacuum

От

"Hayato Kuroda (Fujitsu)"

Дата:

11 ноября 2024 г., 16:08:43

Dear Sawda-san,

> 
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

Another approach is to restrict TidStoreBeginIterate() to support only the local one.
How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

11 ноября 2024 г., 21:15:17

On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawda-san,
>
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
>
> Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot
> for Linux/Windows said ok.
> I'm still learning the feature so I can post only one comment :-(.
>
> I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

vignesh C

Дата:

12 ноября 2024 г., 14:21:08

On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.
>

I just started reviewing the patch and found the following comments
while going through the patch:
1) I felt we should add some documentation for this at [1].

2) Can we add some tests in vacuum_parallel with tables having no
indexes and having dead tuples.

3) This should be included in typedefs.list:
3.a)
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats
+{
+       BlockNumber scanned_pages;      /* # pages examined (not
skipped via VM) */
+       BlockNumber removed_pages;      /* # pages removed by relation
truncation */
+       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */

3.b) Similarly this too:
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared
+{
+       bool            aggressive;
+       bool            skipwithvm;
+

3.c) Similarly this too:
+/* Per-worker scan state for parallel heap vacuum scan */
+typedef struct PHVScanWorkerState
+{
+       bool            initialized;

3.d)  Similarly this too:
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+       /* Parallel scan description shared among parallel workers */

4) Since we are initializing almost all the members of structure,
should we use palloc0 in this case:
+       scan_stats = palloc(sizeof(LVRelScanStats));
+       scan_stats->scanned_pages = 0;
+       scan_stats->removed_pages = 0;
+       scan_stats->frozen_pages = 0;
+       scan_stats->lpdead_item_pages = 0;
+       scan_stats->missed_dead_pages = 0;
+       scan_stats->nonempty_pages = 0;
+
+       /* Initialize remaining counters (be tidy) */
+       scan_stats->tuples_deleted = 0;
+       scan_stats->tuples_frozen = 0;
+       scan_stats->lpdead_items = 0;
+       scan_stats->live_tuples = 0;
+       scan_stats->recently_dead_tuples = 0;
+       scan_stats->missed_dead_tuples = 0;

5) typo paralle should be parallel
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

[1] - https://www.postgresql.org/docs/devel/sql-vacuum.html

Regards,
Vignesh

RE: Parallel heap vacuum

От

"Hayato Kuroda (Fujitsu)"

Дата:

13 ноября 2024 г., 14:10:43

Dear Sawada-san,

> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
> 
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

Okay, thanks for the description. I felt it is OK to keep.

I read 0001 again and here are comments.

01. vacuumlazy.c
```
+#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
+#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
+#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
```

I checked other DMS keys used for parallel work, and they seems to have name
like PARALEL_KEY_XXX. Can we follow it?

02. LVRelState
```
+    BlockNumber next_fsm_block_to_vacuum;
```

Only the attribute does not have comments Can we add like:
"Next freespace map page to be checked"?

03. parallel_heap_vacuum_gather_scan_stats
```
+        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
```

Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
in 0004. Can you move it?

04. heap_parallel_vacuum_estimate
```
+
+    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
+                                              &shared_len, &pscanwork_len);
+
+    /* space for PHVShared */
+    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for ParallelBlockTableScanDesc */
+    pscan_len = table_block_parallelscan_estimate(rel);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+    /* space for per-worker scan state, PHVScanWorkerState */
+    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
+    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
+    shm_toc_estimate_keys(&pcxt->estimator, 1);
```

I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
Can we remove table_block_parallelscan_estimate() and mul_size() from here?

05. Idea

Can you update documentations?

06. Idea

AFAICS pg_stat_progress_vacuum does not contain information related with the
parallel execution. How do you think adding an attribute which shows a list of pids?
Not sure it is helpful for users but it can show the parallelism.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

16 ноября 2024 г., 01:44:04

On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > TidStoreBeginIterateShared() is designed for multiple parallel workers
> > to iterate a shared TidStore. During an iteration, parallel workers
> > share the iteration state and iterate the underlying radix tree while
> > taking appropriate locks. Therefore, it's available only for a shared
> > TidStore. This is required to implement the parallel heap vacuum,
> > where multiple parallel workers do the iteration on the shared
> > TidStore.
> >
> > On the other hand, TidStoreBeginIterate() is designed for a single
> > process to iterate a TidStore. It accepts even a shared TidStore as
> > you mentioned, but during an iteration there is no inter-process
> > coordination such as locking. When it comes to parallel vacuum,
> > supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> > cover the case where we use only parallel index vacuum but not
> > parallel heap scan/vacuum. In this case, we need to store dead tuple
> > TIDs on the shared TidStore during heap scan so parallel workers can
> > use it during index vacuum. But it's not necessary to use
> > TidStoreBeginIterateShared() because only one (leader) process does
> > heap vacuum.
>
> Okay, thanks for the description. I felt it is OK to keep.
>
> I read 0001 again and here are comments.

Thank you for the review comments!

>
> 01. vacuumlazy.c
> ```
> +#define LV_PARALLEL_SCAN_SHARED         0xFFFF0001
> +#define LV_PARALLEL_SCAN_DESC           0xFFFF0002
> +#define LV_PARALLEL_SCAN_DESC_WORKER    0xFFFF0003
> ```
>
> I checked other DMS keys used for parallel work, and they seems to have name
> like PARALEL_KEY_XXX. Can we follow it?

Yes. How about LV_PARALLEL_KEY_XXX?

>
> 02. LVRelState
> ```
> +    BlockNumber next_fsm_block_to_vacuum;
> ```
>
> Only the attribute does not have comments Can we add like:
> "Next freespace map page to be checked"?

Agreed. I'll add a comment "next block to check for FSM vacuum*.

>
> 03. parallel_heap_vacuum_gather_scan_stats
> ```
> +        vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
> ```
>
> Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined
> in 0004. Can you move it?

Will remove.

>
> 04. heap_parallel_vacuum_estimate
> ```
> +
> +    heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len,
> +                                              &shared_len, &pscanwork_len);
> +
> +    /* space for PHVShared */
> +    shm_toc_estimate_chunk(&pcxt->estimator, shared_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for ParallelBlockTableScanDesc */
> +    pscan_len = table_block_parallelscan_estimate(rel);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscan_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> +
> +    /* space for per-worker scan state, PHVScanWorkerState */
> +    pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers);
> +    shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len);
> +    shm_toc_estimate_keys(&pcxt->estimator, 1);
> ```
>
> I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size().
> Can we remove table_block_parallelscan_estimate() and mul_size() from here?

Yes, it's an oversight. Will remove.

>
> 05. Idea
>
> Can you update documentations?

Will update the doc as well.

>
> 06. Idea
>
> AFAICS pg_stat_progress_vacuum does not contain information related with the
> parallel execution. How do you think adding an attribute which shows a list of pids?
> Not sure it is helpful for users but it can show the parallelism.

I think it's possible to show the parallelism even today (for parallel
index vacuuming):

=# select leader.pid, leader.query, array_agg(worker.pid) from
pg_stat_activity as leader, pg_stat_activity as worker,
pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and
leader.pid = v.pid group by 1, 2;
   pid   |        query        |     array_agg
---------+---------------------+-------------------
 2952103 | vacuum (verbose) t; | {2952257,2952258}
(1 row)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

19 ноября 2024 г., 03:17:25

On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
> >
>
> I just started reviewing the patch and found the following comments
> while going through the patch:
> 1) I felt we should add some documentation for this at [1].
>
> 2) Can we add some tests in vacuum_parallel with tables having no
> indexes and having dead tuples.
>
> 3) This should be included in typedefs.list:
> 3.a)
> +/*
> + * Relation statistics collected during heap scanning and need to be
> shared among
> + * parallel vacuum workers.
> + */
> +typedef struct LVRelScanStats
> +{
> +       BlockNumber scanned_pages;      /* # pages examined (not
> skipped via VM) */
> +       BlockNumber removed_pages;      /* # pages removed by relation
> truncation */
> +       BlockNumber frozen_pages;       /* # pages with newly frozen tuples */
>
> 3.b) Similarly this too:
> +/*
> + * Struct for information that need to be shared among parallel vacuum workers
> + */
> +typedef struct PHVShared
> +{
> +       bool            aggressive;
> +       bool            skipwithvm;
> +
>
> 3.c) Similarly this too:
> +/* Per-worker scan state for parallel heap vacuum scan */
> +typedef struct PHVScanWorkerState
> +{
> +       bool            initialized;
>
> 3.d)  Similarly this too:
> +/* Struct for parallel heap vacuum */
> +typedef struct PHVState
> +{
> +       /* Parallel scan description shared among parallel workers */
>
> 4) Since we are initializing almost all the members of structure,
> should we use palloc0 in this case:
> +       scan_stats = palloc(sizeof(LVRelScanStats));
> +       scan_stats->scanned_pages = 0;
> +       scan_stats->removed_pages = 0;
> +       scan_stats->frozen_pages = 0;
> +       scan_stats->lpdead_item_pages = 0;
> +       scan_stats->missed_dead_pages = 0;
> +       scan_stats->nonempty_pages = 0;
> +
> +       /* Initialize remaining counters (be tidy) */
> +       scan_stats->tuples_deleted = 0;
> +       scan_stats->tuples_frozen = 0;
> +       scan_stats->lpdead_items = 0;
> +       scan_stats->live_tuples = 0;
> +       scan_stats->recently_dead_tuples = 0;
> +       scan_stats->missed_dead_tuples = 0;
>
> 5) typo paralle should be parallel
> +/*
> + * Return the number of parallel workers for a parallel vacuum scan of this
> + * relation.
> + */
> +static inline int
> +table_paralle_vacuum_compute_workers(Relation rel, int requested)
> +{
> +       return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
> +}
>

Thank you for the comments! I'll address these comments in the next
version patch.

BTW while updating the patch, I found that we might want to launch
different numbers of workers for scanning heap and vacuuming heap. The
number of parallel workers is determined based on the number of blocks
in the table. However, even if this number is high, it could happen
that we want to launch fewer workers to vacuum heap pages when there
are not many pages having garbage. And the number of workers for
vacuuming heap could vary on each vacuum pass. I'm considering
implementing it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Parallel heap vacuum

От

"Hayato Kuroda (Fujitsu)"

Дата:

26 ноября 2024 г., 08:03:01

Dear Swada-san,

> 
> BTW while updating the patch, I found that we might want to launch
> different numbers of workers for scanning heap and vacuuming heap. The
> number of parallel workers is determined based on the number of blocks
> in the table. However, even if this number is high, it could happen
> that we want to launch fewer workers to vacuum heap pages when there
> are not many pages having garbage. And the number of workers for
> vacuuming heap could vary on each vacuum pass. I'm considering
> implementing it.

Just to clarify - this idea looks good to me. I imagine you will add new APIs for
tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming.
If other tableam developers want to use the same number of workers as scanning,
they can pass the same function to the pointer. Is it right?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

От

Peter Smith

Дата:

06 декабря 2024 г., 01:37:33

Hi Sawada-San,

FYI, the patch 0001 fails to build stand-alone

vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’:
vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
   vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
                     ^
vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named
‘vacuumed_pages’
   vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages;
                                           ^
make[4]: *** [vacuumlazy.o] Error 1


It appears to be using a struct field which is not even introduced
until the patch 0004 of the patch set.

======
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Parallel heap vacuum

От

Peter Smith

Дата:

06 декабря 2024 г., 09:16:33

Hi Sawada-San,

I started to look at patch v4-0001 in this thread.

It is quite a big patch so this is a WIP, and these below are just the
comments I have so far.

======
src/backend/access/heap/vacuumlazy.c

1.1.
+/*
+ * Relation statistics collected during heap scanning and need to be
shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelScanStats

The comment wording is not quite right.

/Relation statistics collected during heap scanning/Relation
statistics that are collected during heap scanning/

~~~

1.2
+/*
+ * Struct for information that need to be shared among parallel vacuum workers
+ */
+typedef struct PHVShared

The comment wording is not quite right.

/that need to be shared/that needs to be shared/

~~~

1.3.
+/* Struct for parallel heap vacuum */
+typedef struct PHVState
+{
+ /* Parallel scan description shared among parallel workers */
+ ParallelBlockTableScanDesc pscandesc;
+
+ /* Shared information */
+ PHVShared  *shared;

If 'pscandesc' is described as 'shared among parallel workers', should
that field be within 'PHVShared' instead?

~~~

1.4.
  /* Initialize page counters explicitly (be tidy) */
- vacrel->scanned_pages = 0;
- vacrel->removed_pages = 0;
- vacrel->frozen_pages = 0;
- vacrel->lpdead_item_pages = 0;
- vacrel->missed_dead_pages = 0;
- vacrel->nonempty_pages = 0;
- /* dead_items_alloc allocates vacrel->dead_items later on */
+ scan_stats = palloc(sizeof(LVRelScanStats));
+ scan_stats->scanned_pages = 0;
+ scan_stats->removed_pages = 0;
+ scan_stats->frozen_pages = 0;
+ scan_stats->lpdead_item_pages = 0;
+ scan_stats->missed_dead_pages = 0;
+ scan_stats->nonempty_pages = 0;
+
+ /* Initialize remaining counters (be tidy) */
+ scan_stats->tuples_deleted = 0;
+ scan_stats->tuples_frozen = 0;
+ scan_stats->lpdead_items = 0;
+ scan_stats->live_tuples = 0;
+ scan_stats->recently_dead_tuples = 0;
+ scan_stats->missed_dead_tuples = 0;
+
+ vacrel->scan_stats = scan_stats;

1.4a.
Or, maybe just palloc0 this and provide a comment to say all counters
have been zapped to 0.

~

1.4b.
Maybe you don't need that 'scan_stats' variable; just assign the
palloc0 directly to the field instead.

~~~

1.5.
- vacrel->missed_dead_tuples = 0;
+ /* dead_items_alloc allocates vacrel->dead_items later on */

The patch seems to have moved this "dead_items_alloc" comment to now
be below the "Allocate/initialize output statistics state" stuff (??).

======
src/backend/commands/vacuumparallel.c

parallel_vacuum_init:

1.6.
  int parallel_workers = 0;
+ int nworkers_table;
+ int nworkers_index;

The local vars and function params are named like this (here and in
other functions). But the struct field names say 'nworkers_for_XXX'
e.g.
shared->nworkers_for_table = nworkers_table;
shared->nworkers_for_index = nworkers_index;

It may be better to use these 'nworkers_for_table' and
'nworkers_for_index' names consistently everywhere.

~~~

parallel_vacuum_compute_workers:

1.7.
- int parallel_workers;
+ int parallel_workers_table = 0;
+ int parallel_workers_index = 0;
+
+ *nworkers_table = 0;
+ *nworkers_index = 0;

The local variables 'parallel_workers_table' and
'parallel_workers_index; are hardly needed because AFAICT those
results can be directly assigned to *nworkers_table and
*nworkers_index.

~~~

parallel_vacuum_process_all_indexes:

1.8.
  /* Reinitialize parallel context to relaunch parallel workers */
- if (num_index_scans > 0)
+ if (num_index_scans > 0 || pvs->num_table_scans > 0)
  ReinitializeParallelDSM(pvs->pcxt);

I don't know if it is feasible or even makes sense to change, but
somehow it seemed strange that the 'num_index_scans' field is not
co-located with the 'num_table_scans' in the ParallelVacuumState. If
this is doable, then lots of functions also would no longer need to
pass 'num_index_scans' since they are already passing 'pvs'.

~~~

parallel_vacuum_table_scan_begin:

1.9.
+ (errmsg(ngettext("launched %d parallel vacuum worker for table
processing (planned: %d)",
+ "launched %d parallel vacuum workers for table processing (planned: %d)",
+ pvs->pcxt->nworkers_launched),

Isn't this the same as errmsg_plural?

~~~

1.10.

+/* Return the array of indexes associated to the given table to be vacuumed */
+Relation *
+parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes)

Even though the function comment can fit on one line it is nicer to
use a block-style comment with a period, like below. It then will be
consistent with other function comments (e.g.
parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc).
There are multiple places that this review comment can apply to.

(also typo /associated to/associated with/)

SUGGESTION
/*
 * Return the array of indexes associated with the given table to be vacuumed.
 */

~~~

parallel_vacuum_get_nworkers_table:
parallel_vacuum_get_nworkers_index:

1.11.
+/* Return the number of workers for parallel table vacuum */
+int
+parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_table;
+}
+
+/* Return the number of workers for parallel index processing */
+int
+parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs)
+{
+ return pvs->shared->nworkers_for_index;
+}
+

Are these functions needed? AFAICT, they are called only from macros
where it seems just as easy to reference the pvs fields directly.

~~~

parallel_vacuum_process_table:

1.12.
+/*
+ * A parallel worker invokes table-AM specified vacuum scan callback.
+ */
+static void
+parallel_vacuum_process_table(ParallelVacuumState *pvs)
+{
+ Assert(VacuumActiveNWorkers);

Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan);

~~~

1.13.
- /* Process indexes to perform vacuum/cleanup */
- parallel_vacuum_process_safe_indexes(&pvs);
+ if (pvs.shared->do_vacuum_table_scan)
+ {
+ parallel_vacuum_process_table(&pvs);
+ }
+ else
+ {
+ ErrorContextCallback errcallback;
+
+ /* Setup error traceback support for ereport() */
+ errcallback.callback = parallel_vacuum_error_callback;
+ errcallback.arg = &pvs;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+ }

There are still some functions following this code (like
'shm_toc_lookup') that could potentially raise ERRORs. But, now the
error_context_stack is getting assigned/reset earlier than previously
was the case. Is that going to be a potential problem?

======
src/include/access/tableam.h

1.14.
+ /*
+ * Compute the amount of DSM space AM need in the parallel table vacuum.
+ *

Maybe reword this comment to be more like table_parallelscan_estimate.

SUGGESTION
Estimate the size of shared memory that the parallel table vacuum needs for AM.


~~~

1.15.
+/*
+ * Estimate the size of shared memory needed for a parallel vacuum scan of this
+ * of this relation.
+ */
+static inline void
+table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt,
int nworkers,
+    void *state)
+{
+ rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Initialize shared memory area for a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt,
int nworkers,
+ void *state)
+{
+ rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state);
+}
+
+/*
+ * Start a parallel vacuum scan of this relation.
+ */
+static inline void
+table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs,
+    ParallelWorkerContext *pwcxt)
+{
+ rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt);
+}
+

All of the "Callbacks for parallel table vacuum." had comments saying
"Not called if parallel table vacuum is disabled.". So, IIUC that
means all of these table_parallel_vacuum_XXX functions (other than the
compute_workers one) could have Assert(nworkers > 0); just to
double-check that is true.

~~~

table_paralle_vacuum_compute_workers:

1.16.
+static inline int
+table_paralle_vacuum_compute_workers(Relation rel, int requested)
+{
+ return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested);
+}

Typo in function name. /paralle/parallel/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel heap vacuum

От

Peter Smith

Дата:

09 декабря 2024 г., 06:19:10

Hi Sawada-San.

FWIW, here are the remainder of my review comments for the patch v4-0001

======
src/backend/access/heap/vacuumlazy.c

lazy_scan_heap:

2.1.

+ /*
+ * Do the actual work. If parallel heap vacuum is active, we scan and
+ * vacuum heap with parallel workers.
+ */

/with/using/

~~~

2.2.
+ if (ParallelHeapVacuumIsActive(vacrel))
+ do_parallel_lazy_scan_heap(vacrel);
+ else
+ do_lazy_scan_heap(vacrel);

The do_lazy_scan_heap() returns a boolean and according to that
function comment it should always be true if it is not using the
parallel heap scan. So should we get the function return value here
and Assert that it is true?

~~~

2.3.

Start uppercase even for all the single line comments for consistency
with exasiting code.

e.g.
+ /* report that everything is now scanned */

e.g
+ /* now we can compute the new value for pg_class.reltuples */

e.g.
+ /* report all blocks vacuumed */

~~~

heap_vac_scan_next_block_parallel:

2.4.
+/*
+ * A parallel scan variant of heap_vac_scan_next_block.
+ *
+ * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization.
+ */
+static bool
+heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno,
+   bool *all_visible_according_to_vm)

The function comment should explain the return value.

~~~

2.5.
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+ {
+
+ if (vacrel->aggressive)
+ break;

Unnecessary whitespace.

~~~

dead_items_alloc:

2.6.
+ /*
+ * We initialize parallel heap scan/vacuuming or index vacuuming
+ * or both based on the table size and the number of indexes. Note
+ * that only one worker can be used for an index, we invoke
+ * parallelism for index vacuuming only if there are at least two
+ * indexes on a table.
+ */
  vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
     vacrel->nindexes, nworkers,
     vac_work_mem,
     vacrel->verbose ? INFO : DEBUG2,
-    vacrel->bstrategy);
+    vacrel->bstrategy, (void *) vacrel);

Is this information misplaced? Why describe here "only one worker" and
"at least two indexes on a table" I don't see anything here checking
those conditions.

~~~

heap_parallel_vacuum_compute_workers:

2.7.
+ /*
+ * Select the number of workers based on the log of the size of the
+ * relation.  This probably needs to be a good deal more
+ * sophisticated, but we need something here for now.  Note that the
+ * upper limit of the min_parallel_table_scan_size GUC is chosen to
+ * prevent overflow here.
+ */

The "This probably needs to..." part maybe should have an "XXX" marker
in the comment which AFAIK is used to highlight current decisions and
potential for future changes.

~~~

heap_parallel_vacuum_initialize:

2.8.
There is inconsistent capitalization of the single-line comments in
this function. The same occurs in many functions in this file. but it
is just a bit more obvious in this one. Please see all the others too.

~~~

parallel_heap_complete_unfinised_scan:

2.9.
+static void
+parallel_heap_complete_unfinised_scan(LVRelState *vacrel)

TYPO in function name /unfinised/unfinished/

~~~

2.10.
+ if (!wstate->maybe_have_blocks)
+
+ continue;

Unnecessary blank line.

~~~

2.11.
+
+ /* Attache the worker's scan state and do heap scan */
+ vacrel->phvstate->myscanstate = wstate;
+ scan_done = do_lazy_scan_heap(vacrel);

/Attache/Attach/

~~~

2.12.
+ /*
+ * We don't need to gather the scan statistics here because statistics
+ * have already been accumulated the leaders statistics directly.
+ */

"have already been accumulated the leaders" -- missing word there somewhere?

~~~

do_parallel_lazy_scan_heap:

2.13.
+ /*
+ * If the heap scan paused in the middle of the table due to full of
+ * dead_items TIDs, perform a round of index and heap vacuuming.
+ */
+ if (!scan_done)
+ {
+ /* Perform a round of index and heap vacuuming */
+ vacrel->consider_bypass_optimization = false;
+ lazy_vacuum(vacrel);
+
+ /*
+ * Vacuum the Free Space Map to make newly-freed space visible on
+ * upper-level FSM pages.
+ */
+ if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum)
+ {
+ /*
+ * min_blkno should have already been updated when gathering
+ * statistics
+ */
+ FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum,
+ vacrel->phvstate->min_blkno + 1);
+ vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno;
+ }
+
+ /* Report that we are once again scanning the heap */
+ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+ PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+ /* re-launcher workers */
+ vacrel->phvstate->nworkers_launched =
+ parallel_vacuum_table_scan_begin(vacrel->pvs);
+
+ continue;
+ }
+
+ /* We reach the end of the table */
+ break;

Instead of:

if (!scan_done)
{
   <other code ...>
   continue;
}

break;

Won't it be better to refactor like:

SUGGESTION
if (scan_done)
  break;

<other code...>

~~~

2.14.
+ /*
+ * The parallel heap vacuum finished, but it's possible that some workers
+ * have allocated blocks but not processed yet. This can happen for
+ * example when workers exit because of full of dead_items TIDs and the
+ * leader process could launch fewer workers in the next cycle.
+ */

There seem to be some missing words:

e.g. /not processed yet./not processed them yet./
e.g. /because of full of dead_items/because they are full of dead_items/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel heap vacuum

От

Tomas Vondra

Дата:

09 декабря 2024 г., 21:47:37

Hi,

Thanks for working on this. I took a quick look at this today, to do
some basic review. I plan to do a bunch of testing, but that's mostly to
get a better idea of what kind of improvements to expect - the initial
results look quite nice and sensible.

A couple basic comments:

1) I really like the idea of introducing "compute_workers" callback to
the heap AM interface. I faced a similar issue with calculating workers
for index builds, because right now plan_create_index_workers is doing
that the logic works for btree, but really not brin etc. It didn't occur
to me we might make this part of the index AM ...

2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
because it really has nothing to do with planning / paths. I realize it
needs the min_parallel_table_scan_size, but it doesn't seem right. I
guess it's a sign this bit of code (calculating parallel workers based
on log of relation size) should in some "shared" location.

3) The difference in naming ParallelVacuumState vs. PHVState is a bit
weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
make it consistent and clear.

4) I think it would be good to have some sort of README explaining how
the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
while to realize how the workers coordinate which blocks to scan.

5) Wouldn't it be better to introduce the scan_stats (grouping some of
the fields in a separate patch)? Seems entirely independent from the
parallel part, so doing it separately would make it easier to review.
Also, maybe reference the fields through scan_stats->x, instead of
through vacrel->scan_stats->x, when there's the pointer.

6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
AFAIK it's not a statistic, it's actually a parameter affecting the
decisions, right?

7) I find it a bit strange that heap_vac_scan_next_block() needs to
check if it's a parallel scan, and redirect to the parallel callback. I
mean, shouldn't the caller know which callback to invoke? Why should the
serial callback care about this?

8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
current block. Can you explain?

9) I'm a bit confused why the code checks IsParallelWorker() in so many
places. Doesn't that mean the leader can't participate in the vacuum
like a regular worker?

10) I'm not quite sure I understand the comments at the end of
do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
means "index vacuuming", right? And then it says "pause without invoking
index and heap vacuuming" but isn't the whole point of this block to do
that cleanup so that the TidStore can be discarded? Maybe I just don't
understand how the work is divided between the leader and workers ...

11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
the patch still builds fine for me.


thanks

-- 
Tomas Vondra

RE: Parallel heap vacuum

От

"Hayato Kuroda (Fujitsu)"

Дата:

11 декабря 2024 г., 05:34:29

Dear Tomas,

> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

+1, so let's keep the proposed style. Or, can we even propose the idea
to table/index access method API?
I've considered bit and the point seemed that which arguments should be required.

> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

I love the idea, it is quite helpful for reviewers like me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

11 декабря 2024 г., 11:07:02

On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> Thanks for working on this. I took a quick look at this today, to do
> some basic review. I plan to do a bunch of testing, but that's mostly to
> get a better idea of what kind of improvements to expect - the initial
> results look quite nice and sensible.

Thank you for reviewing the patch!

> A couple basic comments:
>
> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

Thanks.

>
> 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
> because it really has nothing to do with planning / paths. I realize it
> needs the min_parallel_table_scan_size, but it doesn't seem right. I
> guess it's a sign this bit of code (calculating parallel workers based
> on log of relation size) should in some "shared" location.

True. The same is actually true also for vacuumparallel.c. It includes
optimizer/paths.h to use min_parallel_index_scan_size.

>
> 3) The difference in naming ParallelVacuumState vs. PHVState is a bit
> weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
> make it consistent and clear.

With the patch, since ParallelVacuumState is no longer dedicated for
parallel index vacuuming we cannot rename them in this way. Both
parallel table scanning/vacuuming and parallel index vacuuming can use
the same ParallelVacuumState instance. The heap-specific necessary
data for parallel heap scanning and vacuuming are stored in PHVState.

>
> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

+1. I will add README in the next version patch.

>
> 5) Wouldn't it be better to introduce the scan_stats (grouping some of
> the fields in a separate patch)? Seems entirely independent from the
> parallel part, so doing it separately would make it easier to review.
> Also, maybe reference the fields through scan_stats->x, instead of
> through vacrel->scan_stats->x, when there's the pointer.

Agreed.

> 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
> AFAIK it's not a statistic, it's actually a parameter affecting the
> decisions, right?

Right. It would be better to move them to a separate struct or somewhere.

>
> 7) I find it a bit strange that heap_vac_scan_next_block() needs to
> check if it's a parallel scan, and redirect to the parallel callback. I
> mean, shouldn't the caller know which callback to invoke? Why should the
> serial callback care about this?

do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is
called in serial vacuum and parallel vacuum cases. I wanted to make
heap_vac_scan_next_block() workable in both cases. I think it also
makes sense to have do_lazy_scan_heap() calls either function
depending on parallel scan being enabled.

>
> 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
> current block. Can you explain?

The workers' current block numbers are used to calculate the minimum
block number where we've scanned so far. In serial scan case, we
vacuum FSM of the particular block range for every
VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan
case, it doesn't make sense to vacuum FSM in that way because we might
not have processed some blocks in the block range. So the idea is to
calculate the minimum block number where we've scanned so far and
vacuum FSM of the range of consecutive already-scanned blocks.

>
> 9) I'm a bit confused why the code checks IsParallelWorker() in so many
> places. Doesn't that mean the leader can't participate in the vacuum
> like a regular worker?

I used '!isParallelWorker()' for some jobs that should be done only by
the leader process. For example, checking failsafe mode, vacuuming FSM
etc.

>
> 10) I'm not quite sure I understand the comments at the end of
> do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
> means "index vacuuming", right?

It means both index vacuuming and heap vacuuming.

> And then it says "pause without invoking
> index and heap vacuuming" but isn't the whole point of this block to do
> that cleanup so that the TidStore can be discarded? Maybe I just don't
> understand how the work is divided between the leader and workers ...

The comment needs to be updated. But what the patch does is that when
the memory usage of the shared TidStore reaches the limit, worker
processes exit after updating the shared statistics, and then the
leader invokes (parallel) index vacuuming and parallel heap vacuuming.
Since the different number workers could be used for parallel heap
scan, parallel index vacuuming, and parallel heap vacuuming, the
leader process waits for all workers to finish at end of each phase.

> 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this
> the patch still builds fine for me.

Oh, I might have missed something. I'll check if it's really necessary.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Tomas Vondra

Дата:

15 декабря 2024 г., 00:24:44

On 12/13/24 00:04, Tomas Vondra wrote:
> ...
>
> The main difference is here:
> 
> 
> master / no parallel workers:
> 
>   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> 
> 1 parallel worker:
> 
>   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> 
> 
> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> essentially just those with deleted tuples, which is ~1/20 of pages.
> That's close to the 15x speedup.
> 
> This effect is clearest without indexes, but it does affect even runs
> with indexes - having to scan the indexes makes it much less pronounced,
> though. However, these indexes are pretty massive (about the same size
> as the table) - multiple times larger than the table. Chances are it'd
> be clearer on realistic data sets.
> 
> So the question is - is this correct? And if yes, why doesn't the
> regular (serial) vacuum do that?
> 
> There's some more strange things, though. For example, how come the avg
> read rate is 0.000 MB/s?
> 
>   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> 
> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> pages are in shared buffers (thanks to the DELETE earlier in that session).
> 

OK, after looking into this a bit more I think the reason is rather
simple - SKIP_PAGES_THRESHOLD.

With serial runs, we end up scanning all pages, because even with an
update every 5000 tuples, that's still only ~25 pages apart, well within
the 32-page window. So we end up skipping no pages, scan and vacuum all
everything.

But parallel runs have this skipping logic disabled, or rather the logic
that switches to sequential scans if the gap is less than 32 pages.

IMHO this raises two questions:

1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
sequential scans is the pages are close enough. Maybe there is a reason
for this difference? Workers can reduce the difference between random
and sequential I/0, similarly to prefetching. But that just means the
workers should use a lower threshold, e.g. as

    SKIP_PAGES_THRESHOLD / nworkers

or something like that? I don't see this discussed in this thread.

2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
storage. If I can get an order of magnitude improvement (or more than
that) by disabling the threshold, and just doing random I/O, maybe
there's time to adjust it a bit.

regards

-- 
Tomas Vondra

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

20 декабря 2024 г., 01:05:30

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 12/13/24 00:04, Tomas Vondra wrote:
> > ...
> >
> > The main difference is here:
> >
> >
> > master / no parallel workers:
> >
> >   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >
> > 1 parallel worker:
> >
> >   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >
> >
> > Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> > essentially just those with deleted tuples, which is ~1/20 of pages.
> > That's close to the 15x speedup.
> >
> > This effect is clearest without indexes, but it does affect even runs
> > with indexes - having to scan the indexes makes it much less pronounced,
> > though. However, these indexes are pretty massive (about the same size
> > as the table) - multiple times larger than the table. Chances are it'd
> > be clearer on realistic data sets.
> >
> > So the question is - is this correct? And if yes, why doesn't the
> > regular (serial) vacuum do that?
> >
> > There's some more strange things, though. For example, how come the avg
> > read rate is 0.000 MB/s?
> >
> >   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >
> > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> > pages are in shared buffers (thanks to the DELETE earlier in that session).
> >
>
> OK, after looking into this a bit more I think the reason is rather
> simple - SKIP_PAGES_THRESHOLD.
>
> With serial runs, we end up scanning all pages, because even with an
> update every 5000 tuples, that's still only ~25 pages apart, well within
> the 32-page window. So we end up skipping no pages, scan and vacuum all
> everything.
>
> But parallel runs have this skipping logic disabled, or rather the logic
> that switches to sequential scans if the gap is less than 32 pages.
>
>
> IMHO this raises two questions:
>
> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> sequential scans is the pages are close enough. Maybe there is a reason
> for this difference? Workers can reduce the difference between random
> and sequential I/0, similarly to prefetching. But that just means the
> workers should use a lower threshold, e.g. as
>
>     SKIP_PAGES_THRESHOLD / nworkers
>
> or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

>
> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> storage. If I can get an order of magnitude improvement (or more than
> that) by disabling the threshold, and just doing random I/O, maybe
> there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Tomas Vondra

Дата:

25 декабря 2024 г., 19:52:51


On 12/19/24 23:05, Masahiko Sawada wrote:
> On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 12/13/24 00:04, Tomas Vondra wrote:
>>> ...
>>>
>>> The main difference is here:
>>>
>>>
>>> master / no parallel workers:
>>>
>>>   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
>>>
>>> 1 parallel worker:
>>>
>>>   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
>>>
>>>
>>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
>>> essentially just those with deleted tuples, which is ~1/20 of pages.
>>> That's close to the 15x speedup.
>>>
>>> This effect is clearest without indexes, but it does affect even runs
>>> with indexes - having to scan the indexes makes it much less pronounced,
>>> though. However, these indexes are pretty massive (about the same size
>>> as the table) - multiple times larger than the table. Chances are it'd
>>> be clearer on realistic data sets.
>>>
>>> So the question is - is this correct? And if yes, why doesn't the
>>> regular (serial) vacuum do that?
>>>
>>> There's some more strange things, though. For example, how come the avg
>>> read rate is 0.000 MB/s?
>>>
>>>   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
>>>
>>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
>>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
>>> pages are in shared buffers (thanks to the DELETE earlier in that session).
>>>
>>
>> OK, after looking into this a bit more I think the reason is rather
>> simple - SKIP_PAGES_THRESHOLD.
>>
>> With serial runs, we end up scanning all pages, because even with an
>> update every 5000 tuples, that's still only ~25 pages apart, well within
>> the 32-page window. So we end up skipping no pages, scan and vacuum all
>> everything.
>>
>> But parallel runs have this skipping logic disabled, or rather the logic
>> that switches to sequential scans if the gap is less than 32 pages.
>>
>>
>> IMHO this raises two questions:
>>
>> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
>> sequential scans is the pages are close enough. Maybe there is a reason
>> for this difference? Workers can reduce the difference between random
>> and sequential I/0, similarly to prefetching. But that just means the
>> workers should use a lower threshold, e.g. as
>>
>>     SKIP_PAGES_THRESHOLD / nworkers
>>
>> or something like that? I don't see this discussed in this thread.
> 
> Each parallel heap scan worker allocates a chunk of blocks which is
> 8192 blocks at maximum, so we would need to use the
> SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
> need to evaluate the differences anyway. WIll do the benchmark test
> and share the results.
> 

Right. I don't think this really matters for small tables, and for large
tables the chunks should be fairly large (possibly up to 8192 blocks),
in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
case. There might be differences at boundaries between chunks, but that
seems like a minor / expected detail. I haven't checked know if the code
would need to change / how much.

>>
>> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
>> storage. If I can get an order of magnitude improvement (or more than
>> that) by disabling the threshold, and just doing random I/O, maybe
>> there's time to adjust it a bit.
> 
> Yeah, you've started a thread for this so let's discuss it there.
> 

OK. FWIW as suggested in the other thread, it doesn't seem to be merely
a question of VACUUM performance, as not skipping pages gives vacuum the
opportunity to do cleanup that would otherwise need to happen later.

If only for this reason, I think it would be good to keep the serial and
parallel vacuum consistent.


regards

-- 
Tomas Vondra

Re: Parallel heap vacuum

От

Dilip Kumar

Дата:

17 января, 12:42:57

On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>

IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved that issue or its being included in this patch set?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

17 января, 20:13:08

On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >
>
>
> IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved
thatissue or its being included in this patch set? 

I recall we had some discussion on changes to group locking for
implementing parallel heap vacuum, but I don't remember if we have a
blocker now.

One problem we previously had was that since the relation extension
locks were not in conflict between parallel workers and the leader,
multiple workers could extend the visibility map simultaneously. This
problem was fixed by commit 85f6b49c2c.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Dilip Kumar

Дата:

18 января, 14:12:11

On Fri, Jan 17, 2025 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >
>
>
> IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved that issue or its being included in this patch set?

I recall we had some discussion on changes to group locking for
implementing parallel heap vacuum, but I don't remember if we have a
blocker now.

One problem we previously had was that since the relation extension
locks were not in conflict between parallel workers and the leader,
multiple workers could extend the visibility map simultaneously. This
problem was fixed by commit 85f6b49c2c.

Yes, that's correct. As part of that commit, we made the relation extension lock conflict among group members, ensuring that multiple workers cannot acquire it simultaneously. Additionally, this cannot cause a deadlock because no other locks are held while the relation extension lock is being held.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

21 января, 22:05:49

On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> Thanks for the new patches. I've repeated my benchmarking on v8, and I
> agree this looks fine - the speedups are reasonable and match what I'd
> expect on this hardware. I don't see any suspicious results like with
> the earlier patches, where it got much faster thanks to the absence of
> SKIP_PAGE_THRESHOLD logic.
>
> Attached is the benchmarking script, CSV with raw results, and then also
> two PDF reports comparing visualizing the impact of the patch by
> comparing it to current master.
>
> * parallel-vacuum-duration.pdf - Duration of the vacuum, and duration
> relative to master (green - faster, read - slower). The patch is clearly
> an improvement, with speedup up to ~3x depending on the index count and
> a fraction of updated rows.
>
> * parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM
> VERBOSE. With the patch it can reach up to ~3GB/s, which is about the
> max value possible on this hardware - so that's nice. I'll try to test
> this on a better storage, to see how far it can go.

Thank you for doing a performance benchmark. These results make sense to me.

> I haven't done any actual code review on the new patches, I'll try to
> find time for that sometime next week.

Thank you!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

13 февраля, 01:36:11

On Tue, Jan 21, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote:
> >
> > Hi,
> >
> > Thanks for the new patches. I've repeated my benchmarking on v8, and I
> > agree this looks fine - the speedups are reasonable and match what I'd
> > expect on this hardware. I don't see any suspicious results like with
> > the earlier patches, where it got much faster thanks to the absence of
> > SKIP_PAGE_THRESHOLD logic.
> >
> > Attached is the benchmarking script, CSV with raw results, and then also
> > two PDF reports comparing visualizing the impact of the patch by
> > comparing it to current master.
> >
> > * parallel-vacuum-duration.pdf - Duration of the vacuum, and duration
> > relative to master (green - faster, read - slower). The patch is clearly
> > an improvement, with speedup up to ~3x depending on the index count and
> > a fraction of updated rows.
> >
> > * parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM
> > VERBOSE. With the patch it can reach up to ~3GB/s, which is about the
> > max value possible on this hardware - so that's nice. I'll try to test
> > this on a better storage, to see how far it can go.
>
> Thank you for doing a performance benchmark. These results make sense to me.
>
> > I haven't done any actual code review on the new patches, I'll try to
> > find time for that sometime next week.
>
> Thank you!

Since we introduced the eagar vacuum scan (052026c9b9), I need to
update the parallel heap vacuum patch. After thinking of how to
integrate these two features, I find some complexities. The region
size used by eager vacuum scan and the chunk size used by parallel
table scan are different. While the region is fixed size the chunk
becomes smaller as we scan the table. A chunk of the table that a
parallel vacuum worker took could be across different regions or be
within one region, and different parallel heap vacuum workers might
scan the same region. And parallel heap vacuum workers could be
scanning different regions of the table simultaneously.

During eager vacuum scan, we reset the eager_scan_remaining_fails
counter when we start to scan the new region. So if we want to make
parallel heap vacuum behaves exactly the same way as the
single-progress vacuum in terms of the eager vacuum scan, we would
need to have the eager_scan_remaining_fails counters for each region
so that the workers can decrement it corresponding to the region of
the block that the worker is scanning. But I'm concerned that it makes
the logic very complex. I'd like to avoid making newly introduced
codes more complex by adding yet another new code on top of that.

Another idea is to disable the eager vacuum scan when parallel heap
vacuum is enabled. It might look like just avoiding difficult things
but it could make sense in a sense. The eager vacuum scan is aimed to
amortize the aggressive vacuum by incrementally freezing pages that
are potentially frozen by the next aggressive vacuum. On the other
hand, parallel heap vacuum is available only in manual VACUUM and
would be used to remove garbage on a large table as soon as possible
or to freeze the entire table to avoid reaching the XID limit. So I
think it might make sense to disable the eager vacuum scan when
parallel vacuum.

Thoughts?
Thoughts?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

John Naylor

Дата:

14 февраля, 07:16:06

On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> During eager vacuum scan, we reset the eager_scan_remaining_fails
> counter when we start to scan the new region. So if we want to make
> parallel heap vacuum behaves exactly the same way as the
> single-progress vacuum in terms of the eager vacuum scan, we would
> need to have the eager_scan_remaining_fails counters for each region
> so that the workers can decrement it corresponding to the region of
> the block that the worker is scanning. But I'm concerned that it makes
> the logic very complex. I'd like to avoid making newly introduced
> codes more complex by adding yet another new code on top of that.

Would it be simpler to make only phase III parallel? In other words,
how much of the infrastructure and complexity needed for parallel
phase I is also needed for phase III?

--
John Naylor
Amazon Web Services

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

14 февраля, 11:38:24

On Thu, Feb 13, 2025 at 8:16 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > During eager vacuum scan, we reset the eager_scan_remaining_fails
> > counter when we start to scan the new region. So if we want to make
> > parallel heap vacuum behaves exactly the same way as the
> > single-progress vacuum in terms of the eager vacuum scan, we would
> > need to have the eager_scan_remaining_fails counters for each region
> > so that the workers can decrement it corresponding to the region of
> > the block that the worker is scanning. But I'm concerned that it makes
> > the logic very complex. I'd like to avoid making newly introduced
> > codes more complex by adding yet another new code on top of that.
>
> Would it be simpler to make only phase III parallel?

Yes, I think so.

>  In other words,
> how much of the infrastructure and complexity needed for parallel
> phase I is also needed for phase III?

Both phases need some common changes to the parallel vacuum
infrastructure so that we can launch parallel workers using
ParallelVacuumContext for phase I and III and parallel vacuum workers
can do its task on the heap table. Specifically, we need to change
vacuumparallel.c to work also in parallel heap vacuum case, and add
some table AM callbacks. Other than that, these phases have different
complexity. As for supporting parallelism for phase III, changes to
lazy vacuum would not be very complex, it needs to change both the
radix tree and TidStore to support shared iteration through.
Supporting parallelism of phase I is more complex since it integrates
parallel table scan with lazy heap scan.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Melanie Plageman

Дата:

15 февраля, 01:21:25

On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Since we introduced the eagar vacuum scan (052026c9b9), I need to
> update the parallel heap vacuum patch. After thinking of how to
> integrate these two features, I find some complexities. The region
> size used by eager vacuum scan and the chunk size used by parallel
> table scan are different. While the region is fixed size the chunk
> becomes smaller as we scan the table. A chunk of the table that a
> parallel vacuum worker took could be across different regions or be
> within one region, and different parallel heap vacuum workers might
> scan the same region. And parallel heap vacuum workers could be
> scanning different regions of the table simultaneously.

Ah, I see. What are the chunk size ranges? I picked a 32 MB region
size after a little testing and mostly because it seemed reasonable. I
think it would be fine to use different region size. Parallel workers
could just consider the chunk they get an eager scan region (unless it
is too small or too large -- then it might not make sense).

> During eager vacuum scan, we reset the eager_scan_remaining_fails
> counter when we start to scan the new region. So if we want to make
> parallel heap vacuum behaves exactly the same way as the
> single-progress vacuum in terms of the eager vacuum scan, we would
> need to have the eager_scan_remaining_fails counters for each region
> so that the workers can decrement it corresponding to the region of
> the block that the worker is scanning. But I'm concerned that it makes
> the logic very complex. I'd like to avoid making newly introduced
> codes more complex by adding yet another new code on top of that.

I don't think it would have to behave exactly the same. I think we
just don't want to add a lot of complexity or make it hard to reason
about.

Since the failure rate is defined as a percent, couldn't we just have
parallel workers set eager_scan_remaining_fails when they get their
chunk assignment (as a percentage of their chunk size)? (I haven't
looked at the code, so maybe this doesn't make sense).

For the success cap, we could have whoever hits it first disable eager
scanning for all future assigned chunks.

> Another idea is to disable the eager vacuum scan when parallel heap
> vacuum is enabled. It might look like just avoiding difficult things
> but it could make sense in a sense. The eager vacuum scan is aimed to
> amortize the aggressive vacuum by incrementally freezing pages that
> are potentially frozen by the next aggressive vacuum. On the other
> hand, parallel heap vacuum is available only in manual VACUUM and
> would be used to remove garbage on a large table as soon as possible
> or to freeze the entire table to avoid reaching the XID limit. So I
> think it might make sense to disable the eager vacuum scan when
> parallel vacuum.

Do we only do parallelism in manual vacuum because we don't want to
use up too many parallel workers for a maintenance subsystem? I never
really tried to find out why parallel index vacuuming is only in
manual vacuum. I assume you made the same choice they did for the same
reasons.

If the idea is to never allow parallelism in vacuum, then I think
disabling eager scanning during manual parallel vacuum seems
reasonable. People could use vacuum freeze if they want more freezing.

Also, if you start with only doing parallelism for the third phase of
heap vacuuming (second pass over the heap), this wouldn't be a problem
because eager scanning only impacts the first phase.

- Melanie

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

17 февраля, 21:10:31

On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Since we introduced the eagar vacuum scan (052026c9b9), I need to
> > update the parallel heap vacuum patch. After thinking of how to
> > integrate these two features, I find some complexities. The region
> > size used by eager vacuum scan and the chunk size used by parallel
> > table scan are different. While the region is fixed size the chunk
> > becomes smaller as we scan the table. A chunk of the table that a
> > parallel vacuum worker took could be across different regions or be
> > within one region, and different parallel heap vacuum workers might
> > scan the same region. And parallel heap vacuum workers could be
> > scanning different regions of the table simultaneously.

Thank you for your feedback.

> Ah, I see. What are the chunk size ranges? I picked a 32 MB region
> size after a little testing and mostly because it seemed reasonable. I
> think it would be fine to use different region size. Parallel workers
> could just consider the chunk they get an eager scan region (unless it
> is too small or too large -- then it might not make sense).

The maximum chunk size is 8192 blocks, 64MB. As we scan the table, we
ramp down the chunk size. It would eventually become 1.

>
> > During eager vacuum scan, we reset the eager_scan_remaining_fails
> > counter when we start to scan the new region. So if we want to make
> > parallel heap vacuum behaves exactly the same way as the
> > single-progress vacuum in terms of the eager vacuum scan, we would
> > need to have the eager_scan_remaining_fails counters for each region
> > so that the workers can decrement it corresponding to the region of
> > the block that the worker is scanning. But I'm concerned that it makes
> > the logic very complex. I'd like to avoid making newly introduced
> > codes more complex by adding yet another new code on top of that.
>
> I don't think it would have to behave exactly the same. I think we
> just don't want to add a lot of complexity or make it hard to reason
> about.
>
> Since the failure rate is defined as a percent, couldn't we just have
> parallel workers set eager_scan_remaining_fails when they get their
> chunk assignment (as a percentage of their chunk size)? (I haven't
> looked at the code, so maybe this doesn't make sense).

IIUC since the chunk size eventually becomes 1, we cannot simply just
have parallel workers set the failure rate to its assigned chunk.

>
> For the success cap, we could have whoever hits it first disable eager
> scanning for all future assigned chunks.

Agreed.

>
> > Another idea is to disable the eager vacuum scan when parallel heap
> > vacuum is enabled. It might look like just avoiding difficult things
> > but it could make sense in a sense. The eager vacuum scan is aimed to
> > amortize the aggressive vacuum by incrementally freezing pages that
> > are potentially frozen by the next aggressive vacuum. On the other
> > hand, parallel heap vacuum is available only in manual VACUUM and
> > would be used to remove garbage on a large table as soon as possible
> > or to freeze the entire table to avoid reaching the XID limit. So I
> > think it might make sense to disable the eager vacuum scan when
> > parallel vacuum.
>
> Do we only do parallelism in manual vacuum because we don't want to
> use up too many parallel workers for a maintenance subsystem? I never
> really tried to find out why parallel index vacuuming is only in
> manual vacuum. I assume you made the same choice they did for the same
> reasons.
>
> If the idea is to never allow parallelism in vacuum, then I think
> disabling eager scanning during manual parallel vacuum seems
> reasonable. People could use vacuum freeze if they want more freezing.

IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.

> Also, if you start with only doing parallelism for the third phase of
> heap vacuuming (second pass over the heap), this wouldn't be a problem
> because eager scanning only impacts the first phase.

Right. I'm inclined to support only the second heap pass as the first
step. If we support parallelism only for the second pass, it cannot
help speed up freezing the entire table in emergency situations, but
it would be beneficial for cases where a big table have a large amount
of spread garbage.

At least, I'm going to reorganize the patch set to support parallelism
for the second pass first and then the first heap pass.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Melanie Plageman

Дата:

19 февраля, 03:43:42

On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Since the failure rate is defined as a percent, couldn't we just have
> > parallel workers set eager_scan_remaining_fails when they get their
> > chunk assignment (as a percentage of their chunk size)? (I haven't
> > looked at the code, so maybe this doesn't make sense).
>
> IIUC since the chunk size eventually becomes 1, we cannot simply just
> have parallel workers set the failure rate to its assigned chunk.

Yep. The ranges are too big (1-8192). The behavior would be too
different from serial.

> > Also, if you start with only doing parallelism for the third phase of
> > heap vacuuming (second pass over the heap), this wouldn't be a problem
> > because eager scanning only impacts the first phase.
>
> Right. I'm inclined to support only the second heap pass as the first
> step. If we support parallelism only for the second pass, it cannot
> help speed up freezing the entire table in emergency situations, but
> it would be beneficial for cases where a big table have a large amount
> of spread garbage.
>
> At least, I'm going to reorganize the patch set to support parallelism
> for the second pass first and then the first heap pass.

Makes sense.

- Melanie

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

20 февраля, 04:31:32

On Tue, Feb 18, 2025 at 4:43 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > Since the failure rate is defined as a percent, couldn't we just have
> > > parallel workers set eager_scan_remaining_fails when they get their
> > > chunk assignment (as a percentage of their chunk size)? (I haven't
> > > looked at the code, so maybe this doesn't make sense).
> >
> > IIUC since the chunk size eventually becomes 1, we cannot simply just
> > have parallel workers set the failure rate to its assigned chunk.
>
> Yep. The ranges are too big (1-8192). The behavior would be too
> different from serial.
>
> > > Also, if you start with only doing parallelism for the third phase of
> > > heap vacuuming (second pass over the heap), this wouldn't be a problem
> > > because eager scanning only impacts the first phase.
> >
> > Right. I'm inclined to support only the second heap pass as the first
> > step. If we support parallelism only for the second pass, it cannot
> > help speed up freezing the entire table in emergency situations, but
> > it would be beneficial for cases where a big table have a large amount
> > of spread garbage.
> >
> > At least, I'm going to reorganize the patch set to support parallelism
> > for the second pass first and then the first heap pass.
>
> Makes sense.

I've attached the updated patches. In this version, I focused on
parallelizing only the second pass over the heap. It's more
straightforward than supporting the first pass, it still requires many
preliminary changes though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Parallel heap vacuum

От

John Naylor

Дата:

24 февраля, 07:51:22

On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Right. I'm inclined to support only the second heap pass as the first
> step. If we support parallelism only for the second pass, it cannot
> help speed up freezing the entire table in emergency situations, but
> it would be beneficial for cases where a big table have a large amount
> of spread garbage.

I started looking at the most recent patch set for this, and while
looking back over the thread, a couple random thoughts occurred to me:

On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> - Patched
> parallel 0: 53468.734 (15990, 28296, 9465)
> parallel 1: 35712.613 ( 8254, 23569, 3700)

These measurements were done before before phase I and III were
stream-ified, but even here those phases were already the quickest
ones for a modest number of indexes and a single worker. I think it
would have an even bigger difference when only a small percentage of
pages need scanning/vacuuming, because it still has to read the entire
index. It's a bit unfortunate that the simplest heap phase to
parallelize is also the quickest to begin with.

Pre-existing issue: We don't get anywhere near 2x improvement in phase
II for 2 parallel index scans. We've known for a while that the shared
memory TID store has more overhead than private memory, and here that
overhead is about the same as the entirety of phase III with a single
worker! It may be time to look into mitigations there, independently
of this thread.

The same commit that made the parallel scanning patch more difficult
should also reduce the risk of having a large amount of freezing work
at once in the first place. (There are still plenty of systemic things
that can go wrong, of course, and it's always good if unpleasant work
finishes faster.)

I seem to recall a proposal from David Rowley to (something like)
batch gathering xids for visibility checks during executor scans, but
if so I can't find it in the archives. It's possible some similar work
might speed up heap scanning in a more localized fashion.

On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:

> > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
>
> TidStoreBeginIterateShared() is designed for multiple parallel workers
> to iterate a shared TidStore. During an iteration, parallel workers
> share the iteration state and iterate the underlying radix tree while
> taking appropriate locks. Therefore, it's available only for a shared
> TidStore. This is required to implement the parallel heap vacuum,
> where multiple parallel workers do the iteration on the shared
> TidStore.
>
> On the other hand, TidStoreBeginIterate() is designed for a single
> process to iterate a TidStore. It accepts even a shared TidStore as
> you mentioned, but during an iteration there is no inter-process
> coordination such as locking. When it comes to parallel vacuum,
> supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> cover the case where we use only parallel index vacuum but not
> parallel heap scan/vacuum. In this case, we need to store dead tuple
> TIDs on the shared TidStore during heap scan so parallel workers can
> use it during index vacuum. But it's not necessary to use
> TidStoreBeginIterateShared() because only one (leader) process does
> heap vacuum.

The takeaway I got is that the word "shared" is used for two different
concepts. Future readers are not going to think to look back at this
email for explanation. At the least there needs to be a different word
for the new concept.

There are quite a lot of additions to radix_tree.h and tidstore.c to
make it work this way, including a new "control object", new locks,
new atomic variables. I'm not sure it's really necessary. Is it
possible to just store the "most recent block heap-vacuumed" in the
shared vacuum state? Then each backend would lock the TID store,
iterate starting at the next block, and unlock the store when it has
enough blocks. Sometimes my brainstorms are unworkable for some reason
I failed to think about, but this way seems 95% simpler -- we would
only need to teach the existing iteration machinery to take a "start
key".

--
John Naylor
Amazon Web Services

Re: Parallel heap vacuum

От

Jim Nasby

Дата:

24 февраля, 23:35:31

On Mon, Feb 17, 2025 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> If the idea is to never allow parallelism in vacuum, then I think
> disabling eager scanning during manual parallel vacuum seems
> reasonable. People could use vacuum freeze if they want more freezing.

IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.

That's assuming that the database is running with autovacuum_cost_delay > 0. There's presumably some number of systems that intentionally do not throttle autovacuum. Another consideration is that people are free to set vacuum_cost_page_miss = 0; in those cases it would still be useful to parallelize at least phase 1 and 2. Of course, folks can also set vacuum_cost_page_dirty = 0, in which case parallelization would help phase 2 and 3.

In terms of less hypothetical scenarios, I've definitely run into cases where 1 table accounts for 90%+ of the space used by a database. Parallelism would help ensure the single table is still processed in a timely fashion. (Assuming non-default settings for vacuum throttling.)

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

25 февраля, 04:14:19

On Sun, Feb 23, 2025 at 8:51 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Right. I'm inclined to support only the second heap pass as the first
> > step. If we support parallelism only for the second pass, it cannot
> > help speed up freezing the entire table in emergency situations, but
> > it would be beneficial for cases where a big table have a large amount
> > of spread garbage.
>
> I started looking at the most recent patch set for this, and while
> looking back over the thread, a couple random thoughts occurred to me:
>
> On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > - Patched
> > parallel 0: 53468.734 (15990, 28296, 9465)
> > parallel 1: 35712.613 ( 8254, 23569, 3700)
>
> These measurements were done before before phase I and III were
> stream-ified, but even here those phases were already the quickest
> ones for a modest number of indexes and a single worker. I think it
> would have an even bigger difference when only a small percentage of
> pages need scanning/vacuuming, because it still has to read the entire
> index. It's a bit unfortunate that the simplest heap phase to
> parallelize is also the quickest to begin with.
>
> Pre-existing issue: We don't get anywhere near 2x improvement in phase
> II for 2 parallel index scans. We've known for a while that the shared
> memory TID store has more overhead than private memory, and here that
> overhead is about the same as the entirety of phase III with a single
> worker! It may be time to look into mitigations there, independently
> of this thread.

Thank you for the comments.

I've done simple performance benchmarks and attached the results. In
this test, I prepared a table having one integer column and 3GB in
size. The 'fraction' column means the fraction of pages with deleted
rows; for example fraction=1000 means that the table has pages with
deleted rows every 1000 pages. The results show the duration for each
phase in addition to the total duration (for PATCHED only) and
compares the total duration between the HEAD and the PATCHED.

What I can see from these results was that we might not benefit much
from parallelizing phase III, unfortunately. Although in the best case
the phase III got about 2x speedup, as for the total duration it's
about only 10% speedup. My analysis for these results matches what
John mentioned; phase III is already the fastest phase and accounts
only ~10% of the total execution time, and the overhead of shared
TidStore offsets the speedup of phase III.

> The same commit that made the parallel scanning patch more difficult
> should also reduce the risk of having a large amount of freezing work
> at once in the first place. (There are still plenty of systemic things
> that can go wrong, of course, and it's always good if unpleasant work
> finishes faster.)

I think that vacuum would still need to scan a large amount of blocks
of the table especially when it is very large and heavily modified.
Parallel heap vacuum (only phase I) would be beneficial in case where
autovacuum could not catch up. And we might want to consider using
parallel heap vacuum also in autovacuum while integrating it with
eagar freeze scan.

> I seem to recall a proposal from David Rowley to (something like)
> batch gathering xids for visibility checks during executor scans, but
> if so I can't find it in the archives. It's possible some similar work
> might speed up heap scanning in a more localized fashion.

Interesting.

> On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
>
> > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
> >
> > TidStoreBeginIterateShared() is designed for multiple parallel workers
> > to iterate a shared TidStore. During an iteration, parallel workers
> > share the iteration state and iterate the underlying radix tree while
> > taking appropriate locks. Therefore, it's available only for a shared
> > TidStore. This is required to implement the parallel heap vacuum,
> > where multiple parallel workers do the iteration on the shared
> > TidStore.
> >
> > On the other hand, TidStoreBeginIterate() is designed for a single
> > process to iterate a TidStore. It accepts even a shared TidStore as
> > you mentioned, but during an iteration there is no inter-process
> > coordination such as locking. When it comes to parallel vacuum,
> > supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> > cover the case where we use only parallel index vacuum but not
> > parallel heap scan/vacuum. In this case, we need to store dead tuple
> > TIDs on the shared TidStore during heap scan so parallel workers can
> > use it during index vacuum. But it's not necessary to use
> > TidStoreBeginIterateShared() because only one (leader) process does
> > heap vacuum.
>
> The takeaway I got is that the word "shared" is used for two different
> concepts. Future readers are not going to think to look back at this
> email for explanation. At the least there needs to be a different word
> for the new concept.
>
> There are quite a lot of additions to radix_tree.h and tidstore.c to
> make it work this way, including a new "control object", new locks,
> new atomic variables. I'm not sure it's really necessary. Is it
> possible to just store the "most recent block heap-vacuumed" in the
> shared vacuum state? Then each backend would lock the TID store,
> iterate starting at the next block, and unlock the store when it has
> enough blocks. Sometimes my brainstorms are unworkable for some reason
> I failed to think about, but this way seems 95% simpler -- we would
> only need to teach the existing iteration machinery to take a "start
> key".

Interesting idea. While it might be worth evaluating this idea, given
that the phase III accounts only a small portion of the total vacuum
execution time, it might be better to switch focusing on parallelizing
phase I.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

parallel_heap_vacuum_test_20250224.pdf

Re: Parallel heap vacuum

От

Melanie Plageman

Дата:

25 февраля, 20:58:47

On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> What I can see from these results was that we might not benefit much
> from parallelizing phase III, unfortunately. Although in the best case
> the phase III got about 2x speedup, as for the total duration it's
> about only 10% speedup. My analysis for these results matches what
> John mentioned; phase III is already the fastest phase and accounts
> only ~10% of the total execution time, and the overhead of shared
> TidStore offsets the speedup of phase III.

So, are you proposing to drop the patches for parallelizing phase III
for now? If so, are you planning on posting a set of patches just to
parallelize phase I? I haven't looked at the prelim refactoring
patches to see if they have independent value. What do you think is
reasonable for us to try and do in the next few weeks?

> > The same commit that made the parallel scanning patch more difficult
> > should also reduce the risk of having a large amount of freezing work
> > at once in the first place. (There are still plenty of systemic things
> > that can go wrong, of course, and it's always good if unpleasant work
> > finishes faster.)
>
> I think that vacuum would still need to scan a large amount of blocks
> of the table especially when it is very large and heavily modified.
> Parallel heap vacuum (only phase I) would be beneficial in case where
> autovacuum could not catch up. And we might want to consider using
> parallel heap vacuum also in autovacuum while integrating it with
> eagar freeze scan.

I'd be interested to hear more about this.

- Melanie

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

26 февраля, 01:14:19

On Tue, Feb 25, 2025 at 9:59 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > What I can see from these results was that we might not benefit much
> > from parallelizing phase III, unfortunately. Although in the best case
> > the phase III got about 2x speedup, as for the total duration it's
> > about only 10% speedup. My analysis for these results matches what
> > John mentioned; phase III is already the fastest phase and accounts
> > only ~10% of the total execution time, and the overhead of shared
> > TidStore offsets the speedup of phase III.
>
> So, are you proposing to drop the patches for parallelizing phase III
> for now? If so, are you planning on posting a set of patches just to
> parallelize phase I? I haven't looked at the prelim refactoring
> patches to see if they have independent value. What do you think is
> reasonable for us to try and do in the next few weeks?

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Melanie Plageman

Дата:

26 февраля, 01:44:11

On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Given that we have only about one month until the feature freeze, I
> find that it's realistic to introduce either one parallelism for PG18
> and at least we might want to implement the one first that is more
> beneficial and helpful for users. Since we found that parallel phase
> III is not very efficient in many cases, I'm thinking that in terms of
> PG18 development, we might want to switch focus to parallel phase I,
> and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

- Melanie

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

26 февраля, 03:49:00

On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Given that we have only about one month until the feature freeze, I
> > find that it's realistic to introduce either one parallelism for PG18
> > and at least we might want to implement the one first that is more
> > beneficial and helpful for users. Since we found that parallel phase
> > III is not very efficient in many cases, I'm thinking that in terms of
> > PG18 development, we might want to switch focus to parallel phase I,
> > and then go for phase III if we have time.
>
> Okay, well let me know how I can be helpful. Should I be reviewing a
> version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

03 марта, 12:28:57

On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Given that we have only about one month until the feature freeze, I
> > > find that it's realistic to introduce either one parallelism for PG18
> > > and at least we might want to implement the one first that is more
> > > beneficial and helpful for users. Since we found that parallel phase
> > > III is not very efficient in many cases, I'm thinking that in terms of
> > > PG18 development, we might want to switch focus to parallel phase I,
> > > and then go for phase III if we have time.
> >
> > Okay, well let me know how I can be helpful. Should I be reviewing a
> > version that is already posted?
>
> Thank you so much. I'm going to submit the latest patches in a few
> days for parallelizing the phase I. I would appreciate it if you could
> review that version.
>

I've attached the updated patches that make the phase I (heap
scanning) parallel. I'll share the benchmark results soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

04 марта, 02:24:06

On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Given that we have only about one month until the feature freeze, I
> > > > find that it's realistic to introduce either one parallelism for PG18
> > > > and at least we might want to implement the one first that is more
> > > > beneficial and helpful for users. Since we found that parallel phase
> > > > III is not very efficient in many cases, I'm thinking that in terms of
> > > > PG18 development, we might want to switch focus to parallel phase I,
> > > > and then go for phase III if we have time.
> > >
> > > Okay, well let me know how I can be helpful. Should I be reviewing a
> > > version that is already posted?
> >
> > Thank you so much. I'm going to submit the latest patches in a few
> > days for parallelizing the phase I. I would appreciate it if you could
> > review that version.
> >
>
> I've attached the updated patches that make the phase I (heap
> scanning) parallel. I'll share the benchmark results soon.
>

I've attached the benchmark test results.

Overall, with the parallel heap scan (phase I), the vacuum got speedup
much. On the other hand, looking at each phase I can see performance
regressions in some cases:

First, we can see the regression on a table with one index due to
overhead of the shared TidStore. Currently, we disable parallel index
vacuuming if the table has only one index as the leader process always
takes one index. With this patch, we enable parallel heap scan even if
the parallel index vacuuming is disabled, ending up using the shared
TidStore. In the benchmark test, while the regression due to that
overhead is about ~25% the speedup by parallel heap scan is 50%~, so
the performance number is good overall. I think we can improve the
shared TidStore in the future.

Another performance regression I can see in the results is that heap
vacuum phase (phase III) got slower with the patch. It's weired to me
since I don't touch the code of heap vacuum phase. I'm still
investigating the cause.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

parallel_heap_vacuum_test_20250303.pdf

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

05 марта, 03:54:20

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> Another performance regression I can see in the results is that heap
> vacuum phase (phase III) got slower with the patch. It's weired to me
> since I don't touch the code of heap vacuum phase. I'm still
> investigating the cause.

I have investigated this regression. I've confirmed that In both
scenarios (patched and unpatched), the entire table and its associated
indexes were loaded into the shared buffer before the vacuum. Then,
the 'perf record' analysis, focused specifically on the heap vacuum
phase of the patched code, revealed numerous soft page faults
occurring:

    62.37%    13.90%  postgres  postgres            [.] lazy_vacuum_heap_rel
            |
            |--52.44%--lazy_vacuum_heap_rel
            |          |
            |          |--46.33%--lazy_vacuum_heap_page (inlined)
            |          |          |
            |          |          |--32.42%--heap_page_is_all_visible (inlined)
            |          |          |          |
            |          |          |          |--26.46%--HeapTupleSatisfiesVacuum
            |          |          |          |
HeapTupleSatisfiesVacuumHorizon
            |          |          |          |
HeapTupleHeaderXminCommitted (inlined)
            |          |          |          |          |
            |          |          |          |           --18.52%--page_fault
            |          |          |          |                     do_page_fault
            |          |          |          |
__do_page_fault
            |          |          |          |
handle_mm_fault
            |          |          |          |
__handle_mm_fault
            |          |          |          |
handle_pte_fault
            |          |          |          |                     |
            |          |          |          |
|--16.53%--filemap_map_pages
            |          |          |          |                     |          |
            |          |          |          |                     |
        --2.63%--alloc_set_pte
            |          |          |          |                     |
                  pfn_pte
            |          |          |          |                     |
            |          |          |          |
--1.99%--pmd_page_vaddr
            |          |          |          |
            |          |          |           --1.99%--TransactionIdPrecedes

I did not observe these page faults in the 'perf record' results for
the HEAD version. Furthermore, when I disabled parallel heap vacuum
while keeping parallel index vacuuming enabled, the regression
disappeared. Based on these findings, the likely cause of the
regression appears to be that during parallel heap vacuum operations,
table blocks were loaded into the shared buffer by parallel vacuum
workers. However, in the heap vacuum phase, the leader process needed
to process all blocks, resulting in soft page faults while creating
Page Table Entries (PTEs). Without the patch, the backend process had
already created PTEs during the heap scan, thus preventing these
faults from occurring during the heap vacuum phase.

It appears to be an inherent side effect of utilizing parallel
queries. Given this understanding, it's likely an acceptable trade-off
that we can accommodate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Peter Smith

Дата:

07 марта, 04:33:21

Some minor review comments for patch v10-0001.

======
src/include/access/tableam.h

1.
 struct IndexInfo;
+struct ParallelVacuumState;
+struct ParallelContext;
+struct ParallelWorkerContext;
 struct SampleScanState;

Use alphabetical order for consistency with existing code.

~~~

2.
+ /*
+ * Estimate the size of shared memory that the parallel table vacuum needs
+ * for AM
+ *

2a.
Missing period (.)

2b.
Change the wording to be like below, for consistency with the other
'estimate' function comments, and for consistency with the comment
where this function is implemented.

Estimate the size of shared memory needed for a parallel table vacuum
of this relation.

~~~

3.
+ * The state_out is the output parameter so that an arbitrary data can be
+ * passed to the subsequent callback, parallel_vacuum_remove_dead_items.

Typo? "an arbitrary data"

~~~

4. General/Asserts

All the below functions have a comment saying "Not called if parallel
table vacuum is disabled."
- parallel_vacuum_estimate
- parallel_vacuum_initialize
- parallel_vacuum_initialize_worker
- parallel_vacuum_collect_dead_items

But, it's only a comment. I wondered if they should all have some
Assert as an integrity check on that.

~~~

5.
+/*
+ * Return the number of parallel workers for a parallel vacuum scan of this
+ * relation.
+ */

"Return the number" or "Compute the number"?
The comment should match the comment in the fwd declaration of this function.

~~~

6.
+/*
+ * Perform a parallel vacuums scan to collect dead items.
+ */

6a.
"Perform" or "Execute"?
The comment should match the one in the fwd declaration of this function.

6b.
Typo "vacuums"

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel heap vacuum

От

Peter Smith

Дата:

07 марта, 08:05:00

Hi Sawada-San.

FYI. I am observing the following test behaviour:

I apply patch v10-0001, do a clean rebuild and run 'make check', and
all tests are OK.

Then, after I apply just patch v10-0002 on top of 0001, do a clean
rebuild and run 'make check' there are many test fails.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel heap vacuum

От

Peter Smith

Дата:

07 марта, 09:55:39

Hi Sawada-San. Here are some review comments for patch v10-0002.

======
src/backend/access/heap/heapam_handler.c

1.
  .scan_bitmap_next_block = heapam_scan_bitmap_next_block,
  .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
  .scan_sample_next_block = heapam_scan_sample_next_block,
- .scan_sample_next_tuple = heapam_scan_sample_next_tuple
+ .scan_sample_next_tuple = heapam_scan_sample_next_tuple,
+
+ .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,

Pretty much every other heam AM method here has a 'heapam_' prefix. Is
it better to be consistent and call the new function
'heapam_parallel_vacuum_compute_workers' instead of
'heap_parallel_vacuum_compute_workers'?

======
src/backend/access/heap/vacuumlazy.c

2.
+/*
+ * Compute the number of workers for parallel heap vacuum.
+ *
+ * Disabled so far.
+ */
+int
+heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
+{
+ return 0;
+}
+

Instead of saying "Disabled so far", the function comment maybe should say:
"Return 0 means parallel heap vacuum is disabled."

Then the comment doesn't need to churn later when the function gets
implemented in later patches.

======
src/backend/commands/vacuumparallel.c

3.
+/* The kind of parallel vacuum work */
+typedef enum
+{
+ PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */
+ PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */
+} PVWorkItem;
+

Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for
the field name: 'work_phase' seems more appropriate.

~~~

4.
+ /*
+ * Processing indexes or removing dead tuples from the table.
+ */
+ PVWorkItem work_item;

Missing question mark for this comment?

~~~

5.
+ /*
+ * The number of workers for parallel table vacuuming. If > 0, the
+ * parallel table vacuum is enabled.
+ */
+ int nworkers_for_table;
+

I guess this field will never be negative. So is it simpler to modify
the comment to say:
"If 0, parallel table vacuum is disabled."

~~~

parallel_vacuum_init:

6.
+ /* A parallel vacuum must be requested */
  Assert(nrequested_workers >= 0);

It's not very intuitive to say the user requested a parallel vacuum
when the 'nrequested_workers' is 0. I felt some more commentary is
needed here or in the function header; it seems nrequested_workers ==
0 has a special meaning of having the system decide the number of
workers.

~~~

parallel_vacuum_compute_workers:

7.
 static int
-parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
nrequested,
+parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
+ int nrequested, int *nworkers_table_p,
  bool *will_parallel_vacuum)
 {
  int nindexes_parallel = 0;
  int nindexes_parallel_bulkdel = 0;
  int nindexes_parallel_cleanup = 0;
- int parallel_workers;
+ int nworkers_table = 0;
+ int nworkers_index = 0;
+
+ *nworkers_table_p = 0;

7a.
AFAICT you can just remove the variable 'nworkers_table', and instead
call the 'nworkers_table_p' parameter as 'nworkers_table'

~

7b.
IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now
that the patch introduces parallel table vacuuming it makes this
existing 'will_parallel_vacuum' name generic/misleading. Maybe it now
needs to be called 'will_parallel_index_vacuum' or similar (in all
places).

~~~

8.
+ if (nrequested > 0)
+ {
+ /*
+ * If the parallel degree is specified, accept that as the number of
+ * parallel workers for table vacuum (though still cap at
+ * max_parallel_maintenance_workers).
+ */
+ nworkers_table = Min(nrequested, max_parallel_maintenance_workers);
+ }
+ else
+ {
+ /* Compute the number of workers for parallel table scan */
+ nworkers_table = table_parallel_vacuum_compute_workers(rel,
+    max_parallel_maintenance_workers);
+
+ Assert(nworkers_table <= max_parallel_maintenance_workers);
+ }
+

The Assert can be outside of the if/else because it is the same for both.

~~~

9.
  /* No index supports parallel vacuum */
- if (nindexes_parallel <= 0)
- return 0;
-
- /* Compute the parallel degree */
- parallel_workers = (nrequested > 0) ?
- Min(nrequested, nindexes_parallel) : nindexes_parallel;
+ if (nindexes_parallel > 0)
+ {
+ /* Take into account the requested number of workers */
+ nworkers_index = (nrequested > 0) ?
+ Min(nrequested, nindexes_parallel) : nindexes_parallel;

- /* Cap by max_parallel_maintenance_workers */
- parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
+ /* Cap by max_parallel_maintenance_workers */
+ nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
+ }

"No index supports..." seems to be an old comment that is not correct
for this new code block.

~~~

10.
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+

Double blank line after function 'parallel_vacuum_process_table'

~~~

main:

11.
+ /* Initialize AM-specific vacuum state for parallel table vacuuming */
+ if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ ParallelWorkerContext pwcxt;
+
+ pwcxt.toc = toc;
+ pwcxt.seg = seg;
+ table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
+ &state);
+ }
+

Wondering if this code can be done in the same if block later: "if
(pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)"

~~~

12.
+ if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
+ {
+ /* Scan the table to collect dead items */
+ parallel_vacuum_process_table(&pvs, state);
+ }
+ else
+ {
+ Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
+
+ /* Process indexes to perform vacuum/cleanup */
+ parallel_vacuum_process_safe_indexes(&pvs);
+ }

Would this if/else be better implemented as a 'switch' for the possible phases?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

07 марта, 20:36:07

On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > Given that we have only about one month until the feature freeze, I
> > > > > find that it's realistic to introduce either one parallelism for PG18
> > > > > and at least we might want to implement the one first that is more
> > > > > beneficial and helpful for users. Since we found that parallel phase
> > > > > III is not very efficient in many cases, I'm thinking that in terms of
> > > > > PG18 development, we might want to switch focus to parallel phase I,
> > > > > and then go for phase III if we have time.
> > > >
> > > > Okay, well let me know how I can be helpful. Should I be reviewing a
> > > > version that is already posted?
> > >
> > > Thank you so much. I'm going to submit the latest patches in a few
> > > days for parallelizing the phase I. I would appreciate it if you could
> > > review that version.
> > >
> >
> > I've attached the updated patches that make the phase I (heap
> > scanning) parallel. I'll share the benchmark results soon.
> >
>
> I've attached the benchmark test results.
>
> Overall, with the parallel heap scan (phase I), the vacuum got speedup
> much. On the other hand, looking at each phase I can see performance
> regressions in some cases:
>
> First, we can see the regression on a table with one index due to
> overhead of the shared TidStore. Currently, we disable parallel index
> vacuuming if the table has only one index as the leader process always
> takes one index. With this patch, we enable parallel heap scan even if
> the parallel index vacuuming is disabled, ending up using the shared
> TidStore. In the benchmark test, while the regression due to that
> overhead is about ~25% the speedup by parallel heap scan is 50%~, so
> the performance number is good overall. I think we can improve the
> shared TidStore in the future.
>
> Another performance regression I can see in the results is that heap
> vacuum phase (phase III) got slower with the patch. It's weired to me
> since I don't touch the code of heap vacuum phase. I'm still
> investigating the cause.
>

Discussing with Amit offlist, I've run another benchmark test where no
data is loaded on the shared buffer. In the previous test, I loaded
all table blocks before running vacuum, so it was the best case. The
attached test results showed the worst case.

Overall, while the numbers seem not stable, the phase I got sped up a
bit, but not as scalable as expected, which is not surprising. Please
note that the test results shows that the phase III also got sped up
but this is because in parallel vacuum we use more ring buffers than
the single process vacuum. So we need to compare the only phase I time
in terms of the benefit of the parallelism.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

parallel_heap_vacuum_20250307.pdf

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

08 марта, 08:30:18

On Thu, Mar 6, 2025 at 5:33 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Some minor review comments for patch v10-0001.
>
> ======
> src/include/access/tableam.h
>
> 1.
>  struct IndexInfo;
> +struct ParallelVacuumState;
> +struct ParallelContext;
> +struct ParallelWorkerContext;
>  struct SampleScanState;
>
> Use alphabetical order for consistency with existing code.
>
> ~~~
>
> 2.
> + /*
> + * Estimate the size of shared memory that the parallel table vacuum needs
> + * for AM
> + *
>
> 2a.
> Missing period (.)
>
> 2b.
> Change the wording to be like below, for consistency with the other
> 'estimate' function comments, and for consistency with the comment
> where this function is implemented.
>
> Estimate the size of shared memory needed for a parallel table vacuum
> of this relation.
>
> ~~~
>
> 3.
> + * The state_out is the output parameter so that an arbitrary data can be
> + * passed to the subsequent callback, parallel_vacuum_remove_dead_items.
>
> Typo? "an arbitrary data"
>
> ~~~
>
> 4. General/Asserts
>
> All the below functions have a comment saying "Not called if parallel
> table vacuum is disabled."
> - parallel_vacuum_estimate
> - parallel_vacuum_initialize
> - parallel_vacuum_initialize_worker
> - parallel_vacuum_collect_dead_items
>
> But, it's only a comment. I wondered if they should all have some
> Assert as an integrity check on that.
>
> ~~~
>
> 5.
> +/*
> + * Return the number of parallel workers for a parallel vacuum scan of this
> + * relation.
> + */
>
> "Return the number" or "Compute the number"?
> The comment should match the comment in the fwd declaration of this function.
>
> ~~~
>
> 6.
> +/*
> + * Perform a parallel vacuums scan to collect dead items.
> + */
>
> 6a.
> "Perform" or "Execute"?
> The comment should match the one in the fwd declaration of this function.
>
> 6b.
> Typo "vacuums"
>

Thank you for reviewing the patch. I'll address these comments and
submit the updated version patches soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

08 марта, 09:41:57

On Thu, Mar 6, 2025 at 10:56 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Sawada-San. Here are some review comments for patch v10-0002.

Thank you for reviewing the patch.

>
> ======
> src/backend/access/heap/heapam_handler.c
>
> 1.
>   .scan_bitmap_next_block = heapam_scan_bitmap_next_block,
>   .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple,
>   .scan_sample_next_block = heapam_scan_sample_next_block,
> - .scan_sample_next_tuple = heapam_scan_sample_next_tuple
> + .scan_sample_next_tuple = heapam_scan_sample_next_tuple,
> +
> + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
>
> Pretty much every other heam AM method here has a 'heapam_' prefix. Is
> it better to be consistent and call the new function
> 'heapam_parallel_vacuum_compute_workers' instead of
> 'heap_parallel_vacuum_compute_workers'?

Hmm, given that the existing vacuum-related callback, heap_vacuum_rel,
uses 'heap' I guess it's okay with 'heap_' prefix for parallel vacuum
callbacks.

>
> ======
> src/backend/access/heap/vacuumlazy.c
>
> 2.
> +/*
> + * Compute the number of workers for parallel heap vacuum.
> + *
> + * Disabled so far.
> + */
> +int
> +heap_parallel_vacuum_compute_workers(Relation rel, int max_workers)
> +{
> + return 0;
> +}
> +
>
> Instead of saying "Disabled so far", the function comment maybe should say:
> "Return 0 means parallel heap vacuum is disabled."
>
> Then the comment doesn't need to churn later when the function gets
> implemented in later patches.

Updated.

>
> ======
> src/backend/commands/vacuumparallel.c
>
> 3.
> +/* The kind of parallel vacuum work */
> +typedef enum
> +{
> + PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */
> + PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */
> +} PVWorkItem;
> +
>
> Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for
> the field name: 'work_phase' seems more appropriate.

Agreed.

>
> ~~~
>
> 4.
> + /*
> + * Processing indexes or removing dead tuples from the table.
> + */
> + PVWorkItem work_item;
>
> Missing question mark for this comment?

Updated.

>
> ~~~
>
> 5.
> + /*
> + * The number of workers for parallel table vacuuming. If > 0, the
> + * parallel table vacuum is enabled.
> + */
> + int nworkers_for_table;
> +
>
> I guess this field will never be negative. So is it simpler to modify
> the comment to say:
> "If 0, parallel table vacuum is disabled."

Agreed.

>
> ~~~
>
> parallel_vacuum_init:
>
> 6.
> + /* A parallel vacuum must be requested */
>   Assert(nrequested_workers >= 0);
>
> It's not very intuitive to say the user requested a parallel vacuum
> when the 'nrequested_workers' is 0. I felt some more commentary is
> needed here or in the function header; it seems nrequested_workers ==
> 0 has a special meaning of having the system decide the number of
> workers.

Added some comments.

>
> ~~~
>
> parallel_vacuum_compute_workers:
>
> 7.
>  static int
> -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int
> nrequested,
> +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes,
> + int nrequested, int *nworkers_table_p,
>   bool *will_parallel_vacuum)
>  {
>   int nindexes_parallel = 0;
>   int nindexes_parallel_bulkdel = 0;
>   int nindexes_parallel_cleanup = 0;
> - int parallel_workers;
> + int nworkers_table = 0;
> + int nworkers_index = 0;
> +
> + *nworkers_table_p = 0;
>
> 7a.
> AFAICT you can just remove the variable 'nworkers_table', and instead
> call the 'nworkers_table_p' parameter as 'nworkers_table'

Yes, but I wanted to not modify the output parameter many times.

>
> ~
>
> 7b.
> IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now
> that the patch introduces parallel table vacuuming it makes this
> existing 'will_parallel_vacuum' name generic/misleading. Maybe it now
> needs to be called 'will_parallel_index_vacuum' or similar (in all
> places).

Agreed.

>
> ~~~
>
> 8.
> + if (nrequested > 0)
> + {
> + /*
> + * If the parallel degree is specified, accept that as the number of
> + * parallel workers for table vacuum (though still cap at
> + * max_parallel_maintenance_workers).
> + */
> + nworkers_table = Min(nrequested, max_parallel_maintenance_workers);
> + }
> + else
> + {
> + /* Compute the number of workers for parallel table scan */
> + nworkers_table = table_parallel_vacuum_compute_workers(rel,
> +    max_parallel_maintenance_workers);
> +
> + Assert(nworkers_table <= max_parallel_maintenance_workers);
> + }
> +
>
> The Assert can be outside of the if/else because it is the same for both.

This part has been removed by modifying the parallel degree computation logic.

>
> ~~~
>
> 9.
>   /* No index supports parallel vacuum */
> - if (nindexes_parallel <= 0)
> - return 0;
> -
> - /* Compute the parallel degree */
> - parallel_workers = (nrequested > 0) ?
> - Min(nrequested, nindexes_parallel) : nindexes_parallel;
> + if (nindexes_parallel > 0)
> + {
> + /* Take into account the requested number of workers */
> + nworkers_index = (nrequested > 0) ?
> + Min(nrequested, nindexes_parallel) : nindexes_parallel;
>
> - /* Cap by max_parallel_maintenance_workers */
> - parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers);
> + /* Cap by max_parallel_maintenance_workers */
> + nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers);
> + }
>
> "No index supports..." seems to be an old comment that is not correct
> for this new code block.

Removed.

>
> ~~~
>
> 10.
> + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
> +}
> +
> +
>
> Double blank line after function 'parallel_vacuum_process_table'

Removed.

>
> ~~~
>
> main:
>
> 11.
> + /* Initialize AM-specific vacuum state for parallel table vacuuming */
> + if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
> + {
> + ParallelWorkerContext pwcxt;
> +
> + pwcxt.toc = toc;
> + pwcxt.seg = seg;
> + table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt,
> + &state);
> + }
> +
>
> Wondering if this code can be done in the same if block later: "if
> (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)"

If we do that, we would end up calling InstrStartParallelQuery()
before the worker initialization. I guess we want to avoid counting
these usage during the initialization.

>
> ~~~
>
> 12.
> + if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)
> + {
> + /* Scan the table to collect dead items */
> + parallel_vacuum_process_table(&pvs, state);
> + }
> + else
> + {
> + Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES);
> +
> + /* Process indexes to perform vacuum/cleanup */
> + parallel_vacuum_process_safe_indexes(&pvs);
> + }
>
> Would this if/else be better implemented as a 'switch' for the possible phases?
>

Okay, changed.

I've attached the updated version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

On Fri, Mar 7, 2025 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached the updated version patches.

When testing the multi passes of table vacuuming, I found an issue.
With the current patch, both leader and parallel workers process stop
the phase 1 as soon as the shared TidStore size reaches to the limit,
and then the leader resumes the parallel heap scan after heap
vacuuming and index vacuuming. Therefore, as I described in the patch,
one tricky part of this patch is that it's possible that we launch
fewer workers than the previous time when resuming phase 1 after phase
2 and 3. In this case, since the previous parallel workers might have
already allocated some blocks in their chunk, newly launched workers
need to take over their parallel scan state. That's why in the patch
we store workers' ParallelBlockTableScanWorkerData in shared memory.
However, I found my assumption is wrong; in order to take over the
previous scan state properly we need to take over not only
ParallelBlockTableScanWorkerData but also ReadStream data as parallel
workers might have already queued some blocks for look-ahead in their
ReadStream. Looking at ReadStream codes, I find that it's not
realistic to store it into the shared memory.

One plausible solution would be that we don't use ReadStream in
parallel heap vacuum cases but directly use
table_block_parallelscan_xxx() instead. It works but we end up having
two different scan methods for parallel and non-parallel lazy heap
scan. I've implemented this idea in the attached v12 patches.

Other than the above change, I've fixed some bugs and addressed review
comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

On Sun, Mar 23, 2025 at 10:01 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > If we use ParallelBlockTableScanDesc with streaming read like the
> > patch did, we would also need to somehow rewind the number of blocks
> > allocated to workers. The problem I had with such usage was that a
> > parallel vacuum worker allocated a new chunk of blocks when doing
> > look-ahead reading and therefore advanced
> > ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
> > we unpin the remaining buffers in the queue by a new functionality and
> > a parallel worker resumes the phase 1 from the last processed block,
> > we would lose some blocks in already allocated chunks unless we rewind
> > ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
> > data. However, since a worker might have already allocated multiple
> > chunks it would not be easy to rewind these scan state data.
>
> Ah I didn't realize rewinding the state would be difficult. It seems
> like the easiest way to make sure those blocks are done is to add them
> back to the counter somehow. And I don't suppose there is some way to
> save these not yet done block assignments somewhere and give them to
> the workers who restart phase I to process on the second pass?

It might be possible to store the not-yet-done-blocks in DSA to pass
them to the next workers. But it would make the codes more complex.

> > Another idea is that parallel workers don't exit phase 1 until it
> > consumes all pinned buffers in the queue, even if the memory usage of
> > TidStore exceeds the limit. It would need to add new functionality to
> > the read stream to disable the look-ahead reading. Since we could use
> > much memory while processing these buffers, exceeding the memory
> > limit, we can trigger this mode when the memory usage of TidStore
> > reaches 70% of the limit or so. On the other hand, it means that we
> > would not use the streaming read for the blocks in this mode, which is
> > not efficient.
>
> That might work. And/or maybe you could start decreasing the size of
> block assignment chunks when the memory usage of TidStore reaches a
> certain level. I don't know how much that would help or how fiddly it
> would be.

I've tried this idea in the attached version patch. I've started with
a simple approach; once the TidStore reaches the limit,
heap_vac_scan_next_block(), a callback for the read stream, begins to
return InvalidBlockNumber. We continue phase 1 until the read stream
is exhausted.

>
> > So we would need to
> > invent a way to stop and resume the read stream in the middle during
> > parallel scan.
>
> As for needing to add new read stream functionality, we actually
> probably don't have to. If you use read_stream_end() ->
> read_stream_reset(), it resets the distance to 0, so then
> read_stream_next_buffer() should just end up unpinning the buffers and
> freeing the per buffer data. I think the easiest way to implement this
> is to think about it as ending a read stream and starting a new one
> next time you start phase I and not as pausing and resuming the read
> stream. And anyway, maybe it's better not to keep a bunch of pinned
> buffers and allocated memory hanging around while doing what could be
> very long index scans.

You're right. I've studied the read stream code and figured out how to
use it. In the attached patch, we end the read stream at the end of
phase 1 and start a new read stream, as you suggested.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > You're right. I've studied the read stream code and figured out how to
> > > use it. In the attached patch, we end the read stream at the end of
> > > phase 1 and start a new read stream, as you suggested.
> >
> > I've started looking at this patch set some more.
>
> Thank you for reviewing the patch!
>
> >
> > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
> > codepath and run out of unskippable blocks in the current chunk and
> > then go back to get another chunk (goto retry) but we are near the
> > memory limit so we can't get another block
> > (!dead_items_check_memory_limit()), could we get an infinite loop?
> >
> > Or even incorrectly encroach on another worker's block? Asking that
> > because of this math
> >             end_block = next_block +
> >
> > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;
>
> You're right. We should make sure that reset next_block is reset to
> InvalidBlockNumber at the beginning of the retry loop.
>
> >
> > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
> > and we are in the goto retry loop, it seems like we could keep
> > incrementing next_block even when we shouldn't be.
>
> Right. Will fix.
>
> >
> > I just want to make sure that the skip pages optimization works with
> > the parallel block assignment and the low memory read stream
> > wind-down.
> >
> > I also think you do not need to expose
> > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
> > only called in heap-specific code and the logic seems very
> > heap-related. If table AMs want something to skip some blocks, they
> > could easily implement it.
>
> Agreed. Will remove it.
>
> >
> > On another topic, I think it would be good to have a comment above
> > this code in parallel_lazy_scan_gather_scan_results(), stating why we
> > are very sure it is correct.
> >         Assert(TransactionIdIsValid(data->NewRelfrozenXid));
> >         Assert(MultiXactIdIsValid(data->NewRelminMxid));
> >
> >         if (TransactionIdPrecedes(data->NewRelfrozenXid,
> > vacrel->scan_data->NewRelfrozenXid))
> >             vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
> >
> >         if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
> > vacrel->scan_data->NewRelminMxid))
> >             vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
> >
> >         if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
> >             vacrel->scan_data->nonempty_pages = data->nonempty_pages;
> >
> >         vacrel->scan_data->skippedallvis |= data->skippedallvis;
> >
> > Parallel relfrozenxid advancement sounds scary, and scary things are
> > best with comments. Even though the way this works is intuitive, I
> > think it is worth pointing out that this part is important to get
> > right so future programmers know how important it is.
> >
> > One thing I was wondering about is if there are any implications of
> > different workers having different values in their GlobalVisState.
> > GlobalVisState can be updated during vacuum, so even if they start out
> > with the same values, that could diverge. It is probably okay since it
> > just controls what tuples are removable. Some workers may remove fewer
> > tuples than they absolutely could, and this is probably okay.
> >
>
> Good point.
>
> > And if it is okay for each worker to have different GlobalVisState
> > then maybe you shouldn't have a GlobalVisState in shared memory. If
> > you look at GlobalVisTestFor() it just returns the memory address of
> > that global variable in the backend. So, it seems like it might be
> > better to just let each parallel worker use their own backend local
> > GlobalVisState and not try to put it in shared memory and copy it from
> > one worker to the other workers when initializing them. I'm not sure.
> > At the very least, there should be a comment explaining why you've
> > done it the way you have done it.
>
> Agreed. IIUC it's not a problem even if parallel workers use their own
> GlobalVisState. I'll make that change and remove the 0004 patch which
> exposes GlobalVisState.
>
> I'll send the updated patch soon.

I've attached the updated patches. This version includes the comments
from Melanie, some bug fixes, and comment updates.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

01 апреля, 00:05:12

On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > You're right. I've studied the read stream code and figured out how to
> > > > use it. In the attached patch, we end the read stream at the end of
> > > > phase 1 and start a new read stream, as you suggested.
> > >
> > > I've started looking at this patch set some more.
> >
> > Thank you for reviewing the patch!
> >
> > >
> > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
> > > codepath and run out of unskippable blocks in the current chunk and
> > > then go back to get another chunk (goto retry) but we are near the
> > > memory limit so we can't get another block
> > > (!dead_items_check_memory_limit()), could we get an infinite loop?
> > >
> > > Or even incorrectly encroach on another worker's block? Asking that
> > > because of this math
> > >             end_block = next_block +
> > >
> > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;
> >
> > You're right. We should make sure that reset next_block is reset to
> > InvalidBlockNumber at the beginning of the retry loop.
> >
> > >
> > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
> > > and we are in the goto retry loop, it seems like we could keep
> > > incrementing next_block even when we shouldn't be.
> >
> > Right. Will fix.
> >
> > >
> > > I just want to make sure that the skip pages optimization works with
> > > the parallel block assignment and the low memory read stream
> > > wind-down.
> > >
> > > I also think you do not need to expose
> > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
> > > only called in heap-specific code and the logic seems very
> > > heap-related. If table AMs want something to skip some blocks, they
> > > could easily implement it.
> >
> > Agreed. Will remove it.
> >
> > >
> > > On another topic, I think it would be good to have a comment above
> > > this code in parallel_lazy_scan_gather_scan_results(), stating why we
> > > are very sure it is correct.
> > >         Assert(TransactionIdIsValid(data->NewRelfrozenXid));
> > >         Assert(MultiXactIdIsValid(data->NewRelminMxid));
> > >
> > >         if (TransactionIdPrecedes(data->NewRelfrozenXid,
> > > vacrel->scan_data->NewRelfrozenXid))
> > >             vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
> > >
> > >         if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
> > > vacrel->scan_data->NewRelminMxid))
> > >             vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
> > >
> > >         if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
> > >             vacrel->scan_data->nonempty_pages = data->nonempty_pages;
> > >
> > >         vacrel->scan_data->skippedallvis |= data->skippedallvis;
> > >
> > > Parallel relfrozenxid advancement sounds scary, and scary things are
> > > best with comments. Even though the way this works is intuitive, I
> > > think it is worth pointing out that this part is important to get
> > > right so future programmers know how important it is.
> > >
> > > One thing I was wondering about is if there are any implications of
> > > different workers having different values in their GlobalVisState.
> > > GlobalVisState can be updated during vacuum, so even if they start out
> > > with the same values, that could diverge. It is probably okay since it
> > > just controls what tuples are removable. Some workers may remove fewer
> > > tuples than they absolutely could, and this is probably okay.
> > >
> >
> > Good point.
> >
> > > And if it is okay for each worker to have different GlobalVisState
> > > then maybe you shouldn't have a GlobalVisState in shared memory. If
> > > you look at GlobalVisTestFor() it just returns the memory address of
> > > that global variable in the backend. So, it seems like it might be
> > > better to just let each parallel worker use their own backend local
> > > GlobalVisState and not try to put it in shared memory and copy it from
> > > one worker to the other workers when initializing them. I'm not sure.
> > > At the very least, there should be a comment explaining why you've
> > > done it the way you have done it.
> >
> > Agreed. IIUC it's not a problem even if parallel workers use their own
> > GlobalVisState. I'll make that change and remove the 0004 patch which
> > exposes GlobalVisState.
> >
> > I'll send the updated patch soon.
>
> I've attached the updated patches. This version includes the comments
> from Melanie, some bug fixes, and comment updates.

Rebased the patches as they conflicted with recent commits.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Parallel heap vacuum

От

Masahiko Sawada

Дата:

02 апреля, 00:30:08

On Mon, Mar 31, 2025 at 2:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman
> > > <melanieplageman@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > You're right. I've studied the read stream code and figured out how to
> > > > > use it. In the attached patch, we end the read stream at the end of
> > > > > phase 1 and start a new read stream, as you suggested.
> > > >
> > > > I've started looking at this patch set some more.
> > >
> > > Thank you for reviewing the patch!
> > >
> > > >
> > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD
> > > > codepath and run out of unskippable blocks in the current chunk and
> > > > then go back to get another chunk (goto retry) but we are near the
> > > > memory limit so we can't get another block
> > > > (!dead_items_check_memory_limit()), could we get an infinite loop?
> > > >
> > > > Or even incorrectly encroach on another worker's block? Asking that
> > > > because of this math
> > > >             end_block = next_block +
> > > >
> > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1;
> > >
> > > You're right. We should make sure that reset next_block is reset to
> > > InvalidBlockNumber at the beginning of the retry loop.
> > >
> > > >
> > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0
> > > > and we are in the goto retry loop, it seems like we could keep
> > > > incrementing next_block even when we shouldn't be.
> > >
> > > Right. Will fix.
> > >
> > > >
> > > > I just want to make sure that the skip pages optimization works with
> > > > the parallel block assignment and the low memory read stream
> > > > wind-down.
> > > >
> > > > I also think you do not need to expose
> > > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is
> > > > only called in heap-specific code and the logic seems very
> > > > heap-related. If table AMs want something to skip some blocks, they
> > > > could easily implement it.
> > >
> > > Agreed. Will remove it.
> > >
> > > >
> > > > On another topic, I think it would be good to have a comment above
> > > > this code in parallel_lazy_scan_gather_scan_results(), stating why we
> > > > are very sure it is correct.
> > > >         Assert(TransactionIdIsValid(data->NewRelfrozenXid));
> > > >         Assert(MultiXactIdIsValid(data->NewRelminMxid));
> > > >
> > > >         if (TransactionIdPrecedes(data->NewRelfrozenXid,
> > > > vacrel->scan_data->NewRelfrozenXid))
> > > >             vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid;
> > > >
> > > >         if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid,
> > > > vacrel->scan_data->NewRelminMxid))
> > > >             vacrel->scan_data->NewRelminMxid = data->NewRelminMxid;
> > > >
> > > >         if (data->nonempty_pages < vacrel->scan_data->nonempty_pages)
> > > >             vacrel->scan_data->nonempty_pages = data->nonempty_pages;
> > > >
> > > >         vacrel->scan_data->skippedallvis |= data->skippedallvis;
> > > >
> > > > Parallel relfrozenxid advancement sounds scary, and scary things are
> > > > best with comments. Even though the way this works is intuitive, I
> > > > think it is worth pointing out that this part is important to get
> > > > right so future programmers know how important it is.
> > > >
> > > > One thing I was wondering about is if there are any implications of
> > > > different workers having different values in their GlobalVisState.
> > > > GlobalVisState can be updated during vacuum, so even if they start out
> > > > with the same values, that could diverge. It is probably okay since it
> > > > just controls what tuples are removable. Some workers may remove fewer
> > > > tuples than they absolutely could, and this is probably okay.
> > > >
> > >
> > > Good point.
> > >
> > > > And if it is okay for each worker to have different GlobalVisState
> > > > then maybe you shouldn't have a GlobalVisState in shared memory. If
> > > > you look at GlobalVisTestFor() it just returns the memory address of
> > > > that global variable in the backend. So, it seems like it might be
> > > > better to just let each parallel worker use their own backend local
> > > > GlobalVisState and not try to put it in shared memory and copy it from
> > > > one worker to the other workers when initializing them. I'm not sure.
> > > > At the very least, there should be a comment explaining why you've
> > > > done it the way you have done it.
> > >
> > > Agreed. IIUC it's not a problem even if parallel workers use their own
> > > GlobalVisState. I'll make that change and remove the 0004 patch which
> > > exposes GlobalVisState.
> > >
> > > I'll send the updated patch soon.
> >
> > I've attached the updated patches. This version includes the comments
> > from Melanie, some bug fixes, and comment updates.
>
> Rebased the patches as they conflicted with recent commits.
>

I've attached the new version patch. There are no major changes; I
fixed some typos, improved the comment, and removed duplicated codes.
Also, I've updated the commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

On 9/8/25 17:40, Melanie Plageman wrote:
> On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman
>> <melanieplageman@gmail.com> wrote:
>>
>>> If you do parallel worker setup below heap_vacuum_rel(), then how are
>>> you supposed to use those workers to do non-heap table vacuuming?
>>
>> IIUC non-heap tables can call parallel_vacuum_init() in its
>> relation_vacuum table AM callback implementation in order to
>> initialize parallel table vacuum, parallel index vacuum, or both.
> 
> Yep, it's just that that is more like a library helper function.
> Currently, in master, the flow is:
> 
> leader does:
> heap_vacuum_rel()
>   dead_items_alloc()
>     parallel_vacuum_init()
>       passes parallel_vacuum_main() to parallel context
>   lazy_scan_heap()
>     ...
>     parallel_vacuum_process_all_indexes()
>       LaunchParallelWorkers()
>       ...
>       parallel_vacuum_process_[un]safe_indexes()
> 
> For the leader, I don't remember where your patch moves
> LaunchParallelWorkers() -- since it would have to move to before the
> first heap vacuuming phase. When I was imagining what made the most
> sense to make parallelism agnostic of heap, it seemed like it would be
> moving LaunchParallelWorkers() above heap_vacuum_rel(). But I
> recognize that is a pretty big change, and I haven't tried it, so I
> don't know all the complications. (sounds like you mention some
> below).
> 
> Basically, right now in master, we expose some functions for parallel
> index vacuuming that people can call in their own table implementation
> of table vacuuming, but we don't invoke any of them in table-AM
> agnostic code. That may work fine for just index vacuuming, but now
> that we are trying to make parts of the table vacuuming itself
> extensible, the interface feels more awkward.
> 
> I won't push the idea further because I'm not even sure if it works,
> but the thing that felt the most natural to me was pushing parallelism
> above the table AM.
> 
> ...

I took a quick look at the patch this week. I don't have a very strong
opinion on the changes to table AM API, and I somewhat agree with this
impression. It's not clear to me why we should be adding callbacks that
are AM-specific (and only ever called from that particular AM) to the
common AM interface.

I keep thinking about how we handle parallelism in index builds. The
index AM API did not get a bunch of new callbacks, it's all handled
within the existing ambuild() callback. Shouldn't we be doing something
like that for relation_vacuum()?

That is, each AM-specific relation_vacuum() would be responsible for
starting / stopping workers and all that. relation_vacuum would need
some flag indicating it's OK to use parallelism, of course. We might add
some shared infrastructure to make that more convenient, of course.

This is roughly how index builds handle parallelism. It's not perfect
(e.g. the plan_create_index_workers can do unexpected stuff for
non-btree indexes).

Interestingly enough, this seems to be the exact opposite of what
Melanie proposed above, i.e. moving the parallelism "above" the table
AM. Which AFAICS means we'd keep the new table AM callbacks, but move
the calls "up" above the AM-specific parts.

That should be possible, and maybe even cleaner. But ISTM it'd require a
much more extensive refactoring of how vacuum works. Haven't tried, ofc.

I also repeated the stress test / benchmark, measuring impact with
different number of indexes, amount of deleted tuples, etc. Attached is
a PDF summarizing the results for each part of the patch series (with
different number of workers). For this tests, the improvements are
significant only with no indexes - then the 0004 patch saves 30-50%. But
as soon as indexes are added, the index cleanup starts to dominate.

It's just an assessment, I'm not saying we shouldn't parallelize the
heap cleanup. I assume there are workloads for which patches will make
much more difference. But, what would such cases look like? If I want to
maximize the impact of this patch series, what should I do?

FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very
extensive, so I did that (in order to run the tests), attached is the
result as v19.

This also reminds I heard a couple complaints we don't allow parallelism
in autovacuum. We have parallel index vacuum, but it's disabled in
autovacuum (and users really don't want to run manual vacuum).

That question is almost entirely orthogonal to this patch, of course.
I'm not suggesting this patch has (or even should) to do anything about
it. But I wonder if you have any thoughts on that?

I believe the reason why parallelism is disabled in autovacuum is that
we want autovacuum to be a background process, with minimal disruption
to user workload. It probably wouldn't be that hard to allow autovacuum
to do parallel stuff, but it feels similar to adding autovacuum workers.
That's rarely the solution, without increasing the cost limit.

regards

-- 
Tomas Vondra

On Wed, Sep 17, 2025 at 5:24 PM Tomas Vondra <tomas@vondra.me> wrote:
>
>
>
> On 9/18/25 01:18, Masahiko Sawada wrote:
> > For your information, while the implementation itself is relatively
> > straightforward, we're still facing one unresolved issue; the system
> > doesn't support reloading the configuration file during parallel mode,
> > but it is necessary for autovacuum to update vacuum cost delay
> > parameters[2].
> >
>
> Hmmm, that's annoying :-(
>
> If it happens only for max_stack_depth (when setting it based on
> environment) maybe we could simply skip that when in parallel mode? But
> it's ugly, and we probably will have the same issue for any other
> GUC_ALLOW_IN_PARALLEL option in a file ...

We have the same issue for all non-GUC_ALLOW_IN_PARALLEL parameters.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Parallel heap vacuum

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения