Обсуждение: Parallel heap vacuum
Hi all, The parallel vacuum we have today supports only for index vacuuming. Therefore, while multiple workers can work on different indexes in parallel, the heap table is always processed by the single process. I'd like to propose $subject, which enables us to have multiple workers running on the single heap table. This would be helpful to speedup vacuuming for tables without indexes or tables with INDEX_CLENAUP = off. I've attached a PoC patch for this feature. It implements only parallel heap scans in lazyvacum. We can extend this feature to support parallel heap vacuum as well in the future or in the same patch. # Overall idea (for parallel heap scan in lazy vacuum) At the beginning of vacuum, we determine how many workers to launch based on the table size like other parallel query operations. The number of workers is capped by max_parallel_maitenance_workers. Once we decided to use parallel heap scan, we prepared DSM to share data among parallel workers and leader. The information include at least the vacuum option such as aggressive, the counters collected during lazy vacuum such as scanned_pages, vacuum cutoff such as VacuumCutoffs and GlobalVisState, and parallel scan description. Before starting heap scan in lazy vacuum, we launch parallel workers and then each worker (and the leader) process different blocks. Each worker does HOT-pruning on pages and collects dead tuple TIDs. When adding dead tuple TIDs, workers need to hold an exclusive lock on TidStore. At the end of heap scan phase, workers exit and the leader will wait for all workers to exit. After that, the leader process gather the counters collected by parallel workers, and compute the oldest relfrozenxid (and relminmxid). Then if parallel index vacuum is also enabled, we launch other parallel workers for parallel index vacuuming. When it comes to parallel heap scan in lazy vacuum, I think we can use the table_block_parallelscan_XXX() family. One tricky thing we need to deal with is that if the TideStore memory usage reaches the limit, we stop the parallel scan, do index vacuum and table vacuum, and then resume the parallel scan from the previous state. In order to do that, in the patch, we store ParallelBlockTableScanWorker, per-worker parallel scan state, into DSM so that different parallel workers can resume the scan using the same parallel scan state. In addition to that, since we could end up launching fewer workers than requested, it could happen that some ParallelBlockTableScanWorker data is used once and never be used while remaining unprocessed blocks. To handle this case, in the patch, the leader process checks at the end of the parallel scan if there is an uncompleted parallel scan. If so, the leader process does the scan using worker's ParallelBlockTableScanWorker data on behalf of workers. # Discussions I'm somewhat convinced the brief design of this feature, but there are some points regarding the implementation we need to discuss. In the patch, I extended vacuumparalle.c to support parallel table scan (and vacuum in the future). So I was required to add some table AM callbacks such as DSM size estimation, DSM initialization, and actual table scans etc. We need to verify these APIs are appropriate. Specifically, if we want to support both parallel heap scan and parallel heap vacuum, do we want to add separate callbacks for them? It could be overkill since such a 2-pass vacuum strategy is specific to heap AM. As another implementation idea, we might want to implement parallel heap scan/vacuum in lazyvacuum.c while minimizing changes for vacuumparallel.c. That way, we would not need to add table AM callbacks. However, we would end up having duplicate codes related to parallel operation in vacuum such as vacuum delays. Also, we might need to add some functions to share GlobalVisState among parallel workers, since GlobalVisState is a private struct. Other points I'm somewhat uncomfortable with or need to be discussed remain in the code with XXX comments. # Benchmark results * Test-1: parallel heap scan on the table without indexes I created 20GB table, made garbage on the table, and run vacuum while changing parallel degree: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; Here are the results (total time and heap scan time): PARALLEL 0: 21.99 s (single process) PARALLEL 1: 11.39 s PARALLEL 2: 8.36 s PARALLEL 3: 6.14 s PARALLEL 4: 5.08 s * Test-2: parallel heap scan on the table with one index I used a similar table to the test case 1 but created one btree index on it: create unlogged table test (a int) with (autovacuum_enabled = off); insert into test select generate_series(1, 600000000); --- 20GB table create index on test (a); delete from test where a % 5 = 0; vacuum (verbose, parallel 0) test; I've measured the total execution time as well as the time of each vacuum phase (from left heap scan time, index vacuum time, and heap vacuum time): PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) Overall, I can see the parallel heap scan in lazy vacuum has a decent scalability; In both test-1 and test-2, the execution time of heap scan got ~4x faster with 4 parallel workers. On the other hand, when it comes to the total vacuum execution time, I could not see much performance improvement in test-2 (45.11 vs. 35.84). Looking at the results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), and heap scan in case 2 was not as fast as in case 1 with 1 parallel worker (12.75 vs. 11.39). I think the reason is the shared TidStore is not very scalable since we have a single lock on it. In all cases in the test-1, we don't use the shared TidStore since all dead tuples are removed during heap pruning. So the scalability was better overall than in test-2. In parallel 0 case in test-2, we use the local TidStore, and from parallel degree of 1 in test-2, we use the shared TidStore and parallel worker concurrently update it. Also, I guess that the lookup performance of the local TidStore is better than the shared TidStore's lookup performance because of the differences between a bump context and an DSA area. I think that this difference contributed the fact that index vacuuming got slower (16.74 vs. 22.04). There are two obvious improvement ideas to improve overall vacuum execution time: (1) improve the shared TidStore scalability and (2) support parallel heap vacuum. For (1), several ideas are proposed by the ART authors[1]. I've not tried these ideas but it might be applicable to our ART implementation. But I prefer to start with (2) since it would be easier. Feedback is very welcome. Regards, [1] https://db.in.tum.de/~leis/papers/artsync.pdf -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > # Benchmark results > > * Test-1: parallel heap scan on the table without indexes > > I created 20GB table, made garbage on the table, and run vacuum while > changing parallel degree: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > Here are the results (total time and heap scan time): > > PARALLEL 0: 21.99 s (single process) > PARALLEL 1: 11.39 s > PARALLEL 2: 8.36 s > PARALLEL 3: 6.14 s > PARALLEL 4: 5.08 s > > * Test-2: parallel heap scan on the table with one index > > I used a similar table to the test case 1 but created one btree index on it: > > create unlogged table test (a int) with (autovacuum_enabled = off); > insert into test select generate_series(1, 600000000); --- 20GB table > create index on test (a); > delete from test where a % 5 = 0; > vacuum (verbose, parallel 0) test; > > I've measured the total execution time as well as the time of each > vacuum phase (from left heap scan time, index vacuum time, and heap > vacuum time): > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > scalability; In both test-1 and test-2, the execution time of heap > scan got ~4x faster with 4 parallel workers. On the other hand, when > it comes to the total vacuum execution time, I could not see much > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > worker (12.75 vs. 11.39). > > I think the reason is the shared TidStore is not very scalable since > we have a single lock on it. In all cases in the test-1, we don't use > the shared TidStore since all dead tuples are removed during heap > pruning. So the scalability was better overall than in test-2. In > parallel 0 case in test-2, we use the local TidStore, and from > parallel degree of 1 in test-2, we use the shared TidStore and > parallel worker concurrently update it. Also, I guess that the lookup > performance of the local TidStore is better than the shared TidStore's > lookup performance because of the differences between a bump context > and an DSA area. I think that this difference contributed the fact > that index vacuuming got slower (16.74 vs. 22.04). > > There are two obvious improvement ideas to improve overall vacuum > execution time: (1) improve the shared TidStore scalability and (2) > support parallel heap vacuum. For (1), several ideas are proposed by > the ART authors[1]. I've not tried these ideas but it might be > applicable to our ART implementation. But I prefer to start with (2) > since it would be easier. Feedback is very welcome. > Starting with (2) sounds like a reasonable approach. We should study a few more things like (a) the performance results where there are 3-4 indexes, (b) What is the reason for performance improvement seen with only heap scans. We normally get benefits of parallelism because of using multiple CPUs but parallelizing scans (I/O) shouldn't give much benefits. Is it possible that you are seeing benefits because most of the data is either in shared_buffers or in memory? We can probably try vacuuming tables by restarting the nodes to ensure the data is not in memory. -- With Regards, Amit Kapila.
Dear Sawada-san, > The parallel vacuum we have today supports only for index vacuuming. > Therefore, while multiple workers can work on different indexes in > parallel, the heap table is always processed by the single process. > I'd like to propose $subject, which enables us to have multiple > workers running on the single heap table. This would be helpful to > speedup vacuuming for tables without indexes or tables with > INDEX_CLENAUP = off. Sounds great. IIUC, vacuuming is still one of the main weak point of postgres. > I've attached a PoC patch for this feature. It implements only > parallel heap scans in lazyvacum. We can extend this feature to > support parallel heap vacuum as well in the future or in the same > patch. Before diving into deep, I tested your PoC but found unclear point. When the vacuuming is requested with parallel > 0 with almost the same workload as yours, only the first page was scanned and cleaned up. When parallel was set to zero, I got: ``` INFO: vacuuming "postgres.public.test" INFO: finished vacuuming "postgres.public.test": index scans: 0 pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total) tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable removable cutoff: 752, which was 0 XIDs old when operation ended new relfrozenxid: 739, which is 1 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied WAL usage: 1 records, 1 full page images, 937 bytes system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s ``` This meant that all pages were surely scanned and dead tuples were removed. However, when parallel was set to one, I got another result: ``` INFO: vacuuming "postgres.public.test" INFO: launched 1 parallel vacuum worker for table scanning (planned: 1) INFO: finished vacuuming "postgres.public.test": index scans: 0 pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total) tuples: 12 removed, 0 remain, 0 are dead but not yet removable removable cutoff: 752, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s buffer usage: 96 hits, 660 misses, 6 dirtied WAL usage: 1 records, 1 full page images, 937 bytes system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s ``` It looked like that only a page was scanned and 12 tuples were removed. It looks very strange for me... Attached script emulate my test. IIUC it was almost the same as yours, but the instance was restarted before vacuuming. Can you reproduce and see the reason? Based on the requirement I can provide further information. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > The parallel vacuum we have today supports only for index vacuuming. > > Therefore, while multiple workers can work on different indexes in > > parallel, the heap table is always processed by the single process. > > I'd like to propose $subject, which enables us to have multiple > > workers running on the single heap table. This would be helpful to > > speedup vacuuming for tables without indexes or tables with > > INDEX_CLENAUP = off. > > Sounds great. IIUC, vacuuming is still one of the main weak point of postgres. > > > I've attached a PoC patch for this feature. It implements only > > parallel heap scans in lazyvacum. We can extend this feature to > > support parallel heap vacuum as well in the future or in the same > > patch. > > Before diving into deep, I tested your PoC but found unclear point. > When the vacuuming is requested with parallel > 0 with almost the same workload > as yours, only the first page was scanned and cleaned up. > > When parallel was set to zero, I got: > ``` > INFO: vacuuming "postgres.public.test" > INFO: finished vacuuming "postgres.public.test": index scans: 0 > pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total) > tuples: 120000000 removed, 480000000 remain, 0 are dead but not yet removable > removable cutoff: 752, which was 0 XIDs old when operation ended > new relfrozenxid: 739, which is 1 XIDs ahead of previous value > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed > avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s > buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied > WAL usage: 1 records, 1 full page images, 937 bytes > system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s > ``` > > This meant that all pages were surely scanned and dead tuples were removed. > However, when parallel was set to one, I got another result: > > ``` > INFO: vacuuming "postgres.public.test" > INFO: launched 1 parallel vacuum worker for table scanning (planned: 1) > INFO: finished vacuuming "postgres.public.test": index scans: 0 > pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total) > tuples: 12 removed, 0 remain, 0 are dead but not yet removable > removable cutoff: 752, which was 0 XIDs old when operation ended > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed > avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s > buffer usage: 96 hits, 660 misses, 6 dirtied > WAL usage: 1 records, 1 full page images, 937 bytes > system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s > ``` > > It looked like that only a page was scanned and 12 tuples were removed. > It looks very strange for me... > > Attached script emulate my test. IIUC it was almost the same as yours, but > the instance was restarted before vacuuming. > > Can you reproduce and see the reason? Based on the requirement I can > provide further information. Thank you for the test! I could reproduce this issue and it's a bug; it skipped even non-all-visible pages. I've attached the new version patch. BTW since we compute the number of parallel workers for the heap scan based on the table size, it's possible that we launch multiple workers even if most blocks are all-visible. It seems to be better if we calculate it based on (relpages - relallvisible). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > # Benchmark results > > > > * Test-1: parallel heap scan on the table without indexes > > > > I created 20GB table, made garbage on the table, and run vacuum while > > changing parallel degree: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > Here are the results (total time and heap scan time): > > > > PARALLEL 0: 21.99 s (single process) > > PARALLEL 1: 11.39 s > > PARALLEL 2: 8.36 s > > PARALLEL 3: 6.14 s > > PARALLEL 4: 5.08 s > > > > * Test-2: parallel heap scan on the table with one index > > > > I used a similar table to the test case 1 but created one btree index on it: > > > > create unlogged table test (a int) with (autovacuum_enabled = off); > > insert into test select generate_series(1, 600000000); --- 20GB table > > create index on test (a); > > delete from test where a % 5 = 0; > > vacuum (verbose, parallel 0) test; > > > > I've measured the total execution time as well as the time of each > > vacuum phase (from left heap scan time, index vacuum time, and heap > > vacuum time): > > > > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48) > > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23) > > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45) > > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65) > > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83) > > > > Overall, I can see the parallel heap scan in lazy vacuum has a decent > > scalability; In both test-1 and test-2, the execution time of heap > > scan got ~4x faster with 4 parallel workers. On the other hand, when > > it comes to the total vacuum execution time, I could not see much > > performance improvement in test-2 (45.11 vs. 35.84). Looking at the > > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster > > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04), > > and heap scan in case 2 was not as fast as in case 1 with 1 parallel > > worker (12.75 vs. 11.39). > > > > I think the reason is the shared TidStore is not very scalable since > > we have a single lock on it. In all cases in the test-1, we don't use > > the shared TidStore since all dead tuples are removed during heap > > pruning. So the scalability was better overall than in test-2. In > > parallel 0 case in test-2, we use the local TidStore, and from > > parallel degree of 1 in test-2, we use the shared TidStore and > > parallel worker concurrently update it. Also, I guess that the lookup > > performance of the local TidStore is better than the shared TidStore's > > lookup performance because of the differences between a bump context > > and an DSA area. I think that this difference contributed the fact > > that index vacuuming got slower (16.74 vs. 22.04). > > Thank you for the comments! > > There are two obvious improvement ideas to improve overall vacuum > > execution time: (1) improve the shared TidStore scalability and (2) > > support parallel heap vacuum. For (1), several ideas are proposed by > > the ART authors[1]. I've not tried these ideas but it might be > > applicable to our ART implementation. But I prefer to start with (2) > > since it would be easier. Feedback is very welcome. > > > > Starting with (2) sounds like a reasonable approach. We should study a > few more things like (a) the performance results where there are 3-4 > indexes, Here are the results with 4 indexes (and restarting the server before the benchmark): PARALLEL 0: 115.48 s (32.76, 64.46, 18.24) PARALLEL 1: 74.88 s (17.11, 44.43, 13.25) PARALLEL 2: 71.15 s (14.13, 44.82, 12.12) PARALLEL 3: 46.78 s (10.74, 24.50, 11.43) PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap scan and 3 workers for index vacuum) > (b) What is the reason for performance improvement seen with > only heap scans. We normally get benefits of parallelism because of > using multiple CPUs but parallelizing scans (I/O) shouldn't give much > benefits. Is it possible that you are seeing benefits because most of > the data is either in shared_buffers or in memory? We can probably try > vacuuming tables by restarting the nodes to ensure the data is not in > memory. I think it depends on the storage performance. FYI I use an EC2 instance (m6id.metal). I've run the same benchmark script (table with no index) with restarting the server before executing the vacuum, and here are the results: PARALLEL 0: 32.75 s PARALLEL 1: 17.46 s PARALLEL 2: 13.41 s PARALLEL 3: 10.31 s PARALLEL 4: 8.48 s With the above two tests, I used the updated patch that I just submitted[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > Thank you for the test! > > I could reproduce this issue and it's a bug; it skipped even > non-all-visible pages. I've attached the new version patch. > > BTW since we compute the number of parallel workers for the heap scan > based on the table size, it's possible that we launch multiple workers > even if most blocks are all-visible. It seems to be better if we > calculate it based on (relpages - relallvisible). Thanks for updating the patch. I applied and confirmed all pages are scanned. I used almost the same script (just changed max_parallel_maintenance_workers) and got below result. I think the tendency was the same as yours. ``` parallel 0: 61114.369 ms parallel 1: 34870.316 ms parallel 2: 23598.115 ms parallel 3: 17732.363 ms parallel 4: 15203.271 ms parallel 5: 13836.025 ms ``` I started to read your codes but takes much time because I've never seen before... Below part contains initial comments. 1. This patch cannot be built when debug mode is enabled. See [1]. IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared. So you should update like " vacrel->counters->NewRelminMxid". 2. I compared parallel heap scan and found that it does not have compute_worker API. Can you clarify the reason why there is an inconsistency? (I feel it is intentional because the calculation logic seems to depend on the heap structure, so should we add the API for table scan as well?) [1]: ``` vacuumlazy.c: In function ‘lazy_scan_prune’: vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no member named ‘NewRelminMxid’ Assert(MultiXactIdIsValid(vacrel->NewRelminMxid)); ^~ .... ``` Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Jul 25, 2024 at 2:58 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > Thank you for the test! > > > > I could reproduce this issue and it's a bug; it skipped even > > non-all-visible pages. I've attached the new version patch. > > > > BTW since we compute the number of parallel workers for the heap scan > > based on the table size, it's possible that we launch multiple workers > > even if most blocks are all-visible. It seems to be better if we > > calculate it based on (relpages - relallvisible). > > Thanks for updating the patch. I applied and confirmed all pages are scanned. > I used almost the same script (just changed max_parallel_maintenance_workers) > and got below result. I think the tendency was the same as yours. > > ``` > parallel 0: 61114.369 ms > parallel 1: 34870.316 ms > parallel 2: 23598.115 ms > parallel 3: 17732.363 ms > parallel 4: 15203.271 ms > parallel 5: 13836.025 ms > ``` Thank you for testing! > > I started to read your codes but takes much time because I've never seen before... > Below part contains initial comments. > > 1. > This patch cannot be built when debug mode is enabled. See [1]. > IIUC, this was because NewRelminMxid was moved from struct LVRelState to PHVShared. > So you should update like " vacrel->counters->NewRelminMxid". Right, will fix. > 2. > I compared parallel heap scan and found that it does not have compute_worker API. > Can you clarify the reason why there is an inconsistency? > (I feel it is intentional because the calculation logic seems to depend on the heap structure, > so should we add the API for table scan as well?) There is room to consider a better API design, but yes, the reason is that the calculation logic depends on table AM implementation. For example, I thought it might make sense to consider taking the number of all-visible pages into account for the calculation of the number of parallel workers as we don't want to launch many workers on the table where most pages are all-visible. Which might not work for other table AMs. I'm updating the patch to implement parallel heap vacuum and will share the updated patch. It might take time as it requires to implement shared iteration support in radx tree. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > Thank you for testing! I tried to profile the vacuuming with the larger case (40 workers for the 20G table) and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks. > > 2. > > I compared parallel heap scan and found that it does not have compute_worker > API. > > Can you clarify the reason why there is an inconsistency? > > (I feel it is intentional because the calculation logic seems to depend on the > heap structure, > > so should we add the API for table scan as well?) > > There is room to consider a better API design, but yes, the reason is > that the calculation logic depends on table AM implementation. For > example, I thought it might make sense to consider taking the number > of all-visible pages into account for the calculation of the number of > parallel workers as we don't want to launch many workers on the table > where most pages are all-visible. Which might not work for other table > AMs. > Okay, thanks for confirming. I wanted to ask others as well. > I'm updating the patch to implement parallel heap vacuum and will > share the updated patch. It might take time as it requires to > implement shared iteration support in radx tree. Here are other preliminary comments for v2 patch. This does not contain cosmetic ones. 1. Shared data structure PHVShared does not contain the mutex lock. Is it intentional because they are accessed by leader only after parallel workers exit? 2. Per my understanding, the vacuuming goes like below steps. a. paralell workers are launched for scanning pages b. leader waits until scans are done c. leader does vacuum alone (you may extend here...) d. parallel workers are launched again to cleanup indeces If so, can we reuse parallel workers for the cleanup? Or, this is painful engineering than the benefit? 3. According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as "parallel worker ..." Can we extend this to improve the trackability in the pg_stat_activity? 4. I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a bottleneck when tid is added to the shared TidStore. My another primitive idea is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?) and gather after the heap scanning. If you extend like parallel workers do vacuuming, the gathering may not be needed: they can access own TidStore and clean up. One downside is that the memory consumption may be quite large. How do you think? Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Sorry for the very late reply. On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > Thank you for testing! > > I tried to profile the vacuuming with the larger case (40 workers for the 20G table) > and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks. > > > > 2. > > > I compared parallel heap scan and found that it does not have compute_worker > > API. > > > Can you clarify the reason why there is an inconsistency? > > > (I feel it is intentional because the calculation logic seems to depend on the > > heap structure, > > > so should we add the API for table scan as well?) > > > > There is room to consider a better API design, but yes, the reason is > > that the calculation logic depends on table AM implementation. For > > example, I thought it might make sense to consider taking the number > > of all-visible pages into account for the calculation of the number of > > parallel workers as we don't want to launch many workers on the table > > where most pages are all-visible. Which might not work for other table > > AMs. > > > > Okay, thanks for confirming. I wanted to ask others as well. > > > > I'm updating the patch to implement parallel heap vacuum and will > > share the updated patch. It might take time as it requires to > > implement shared iteration support in radx tree. > > Here are other preliminary comments for v2 patch. This does not contain > cosmetic ones. > > 1. > Shared data structure PHVShared does not contain the mutex lock. Is it intentional > because they are accessed by leader only after parallel workers exit? > Yes, the fields in PHVShared are read-only for workers. Since no concurrent reads/writes happen on these fields we don't need to protect them. > 2. > Per my understanding, the vacuuming goes like below steps. > > a. paralell workers are launched for scanning pages > b. leader waits until scans are done > c. leader does vacuum alone (you may extend here...) > d. parallel workers are launched again to cleanup indeces > > If so, can we reuse parallel workers for the cleanup? Or, this is painful > engineering than the benefit? I've not thought of this idea but I think it's possible from a technical perspective. It saves overheads of relaunching workers but I'm not sure how much it would help for a better performance and I'm concerned it would make the code complex. For example, different numbers of workers might be required for table vacuuming and index vacuuming. So we would end up increasing or decreasing workers. > 3. > According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as > "parallel worker ..." Can we extend this to improve the trackability in the > pg_stat_activity? It would be a good improvement for better trackability. But I think we should do that in a separate patch as it's not just a problem for parallel heap vacuum. > > 4. > I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a > bottleneck when tid is added to the shared TidStore. My another primitive idea > is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?) > and gather after the heap scanning. If you extend like parallel workers do vacuuming, > the gathering may not be needed: they can access own TidStore and clean up. > One downside is that the memory consumption may be quite large. > Interesting idea. Suppose we support parallel heap vacuum as well, we wouldn't need locks and to support shared-iteration on TidStore. I think each worker should use a fraction of maintenance_work_mem. However, one downside would be that we need to check as many TidStore as workers during index vacuuming. FYI I've implemented the parallel heap vacuum part and am doing some benchmark tests. I'll share the updated patches along with test results this week. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawda-san, > > I've attached new version patches that fixes failures reported by > cfbot. I hope these changes make cfbot happy. Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot for Linux/Windows said ok. I'm still learning the feature so I can post only one comment :-(. I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? Another approach is to restrict TidStoreBeginIterate() to support only the local one. How do you think? Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawda-san, > > > > > I've attached new version patches that fixes failures reported by > > cfbot. I hope these changes make cfbot happy. > > Thanks for updating the patch and sorry for delaying the reply. I confirmed cfbot > for Linux/Windows said ok. > I'm still learning the feature so I can post only one comment :-(. > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? TidStoreBeginIterateShared() is designed for multiple parallel workers to iterate a shared TidStore. During an iteration, parallel workers share the iteration state and iterate the underlying radix tree while taking appropriate locks. Therefore, it's available only for a shared TidStore. This is required to implement the parallel heap vacuum, where multiple parallel workers do the iteration on the shared TidStore. On the other hand, TidStoreBeginIterate() is designed for a single process to iterate a TidStore. It accepts even a shared TidStore as you mentioned, but during an iteration there is no inter-process coordination such as locking. When it comes to parallel vacuum, supporting TidStoreBeginIterate() on a shared TidStore is necessary to cover the case where we use only parallel index vacuum but not parallel heap scan/vacuum. In this case, we need to store dead tuple TIDs on the shared TidStore during heap scan so parallel workers can use it during index vacuum. But it's not necessary to use TidStoreBeginIterateShared() because only one (leader) process does heap vacuum. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached new version patches that fixes failures reported by > cfbot. I hope these changes make cfbot happy. > I just started reviewing the patch and found the following comments while going through the patch: 1) I felt we should add some documentation for this at [1]. 2) Can we add some tests in vacuum_parallel with tables having no indexes and having dead tuples. 3) This should be included in typedefs.list: 3.a) +/* + * Relation statistics collected during heap scanning and need to be shared among + * parallel vacuum workers. + */ +typedef struct LVRelScanStats +{ + BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */ + BlockNumber removed_pages; /* # pages removed by relation truncation */ + BlockNumber frozen_pages; /* # pages with newly frozen tuples */ 3.b) Similarly this too: +/* + * Struct for information that need to be shared among parallel vacuum workers + */ +typedef struct PHVShared +{ + bool aggressive; + bool skipwithvm; + 3.c) Similarly this too: +/* Per-worker scan state for parallel heap vacuum scan */ +typedef struct PHVScanWorkerState +{ + bool initialized; 3.d) Similarly this too: +/* Struct for parallel heap vacuum */ +typedef struct PHVState +{ + /* Parallel scan description shared among parallel workers */ 4) Since we are initializing almost all the members of structure, should we use palloc0 in this case: + scan_stats = palloc(sizeof(LVRelScanStats)); + scan_stats->scanned_pages = 0; + scan_stats->removed_pages = 0; + scan_stats->frozen_pages = 0; + scan_stats->lpdead_item_pages = 0; + scan_stats->missed_dead_pages = 0; + scan_stats->nonempty_pages = 0; + + /* Initialize remaining counters (be tidy) */ + scan_stats->tuples_deleted = 0; + scan_stats->tuples_frozen = 0; + scan_stats->lpdead_items = 0; + scan_stats->live_tuples = 0; + scan_stats->recently_dead_tuples = 0; + scan_stats->missed_dead_tuples = 0; 5) typo paralle should be parallel +/* + * Return the number of parallel workers for a parallel vacuum scan of this + * relation. + */ +static inline int +table_paralle_vacuum_compute_workers(Relation rel, int requested) +{ + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); +} [1] - https://www.postgresql.org/docs/devel/sql-vacuum.html Regards, Vignesh
Dear Sawada-san, > TidStoreBeginIterateShared() is designed for multiple parallel workers > to iterate a shared TidStore. During an iteration, parallel workers > share the iteration state and iterate the underlying radix tree while > taking appropriate locks. Therefore, it's available only for a shared > TidStore. This is required to implement the parallel heap vacuum, > where multiple parallel workers do the iteration on the shared > TidStore. > > On the other hand, TidStoreBeginIterate() is designed for a single > process to iterate a TidStore. It accepts even a shared TidStore as > you mentioned, but during an iteration there is no inter-process > coordination such as locking. When it comes to parallel vacuum, > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > cover the case where we use only parallel index vacuum but not > parallel heap scan/vacuum. In this case, we need to store dead tuple > TIDs on the shared TidStore during heap scan so parallel workers can > use it during index vacuum. But it's not necessary to use > TidStoreBeginIterateShared() because only one (leader) process does > heap vacuum. Okay, thanks for the description. I felt it is OK to keep. I read 0001 again and here are comments. 01. vacuumlazy.c ``` +#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001 +#define LV_PARALLEL_SCAN_DESC 0xFFFF0002 +#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003 ``` I checked other DMS keys used for parallel work, and they seems to have name like PARALEL_KEY_XXX. Can we follow it? 02. LVRelState ``` + BlockNumber next_fsm_block_to_vacuum; ``` Only the attribute does not have comments Can we add like: "Next freespace map page to be checked"? 03. parallel_heap_vacuum_gather_scan_stats ``` + vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ``` Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined in 0004. Can you move it? 04. heap_parallel_vacuum_estimate ``` + + heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len, + &shared_len, &pscanwork_len); + + /* space for PHVShared */ + shm_toc_estimate_chunk(&pcxt->estimator, shared_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for ParallelBlockTableScanDesc */ + pscan_len = table_block_parallelscan_estimate(rel); + shm_toc_estimate_chunk(&pcxt->estimator, pscan_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + /* space for per-worker scan state, PHVScanWorkerState */ + pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers); + shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len); + shm_toc_estimate_keys(&pcxt->estimator, 1); ``` I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size(). Can we remove table_block_parallelscan_estimate() and mul_size() from here? 05. Idea Can you update documentations? 06. Idea AFAICS pg_stat_progress_vacuum does not contain information related with the parallel execution. How do you think adding an attribute which shows a list of pids? Not sure it is helpful for users but it can show the parallelism. Best regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Nov 13, 2024 at 3:10 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > TidStoreBeginIterateShared() is designed for multiple parallel workers > > to iterate a shared TidStore. During an iteration, parallel workers > > share the iteration state and iterate the underlying radix tree while > > taking appropriate locks. Therefore, it's available only for a shared > > TidStore. This is required to implement the parallel heap vacuum, > > where multiple parallel workers do the iteration on the shared > > TidStore. > > > > On the other hand, TidStoreBeginIterate() is designed for a single > > process to iterate a TidStore. It accepts even a shared TidStore as > > you mentioned, but during an iteration there is no inter-process > > coordination such as locking. When it comes to parallel vacuum, > > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > > cover the case where we use only parallel index vacuum but not > > parallel heap scan/vacuum. In this case, we need to store dead tuple > > TIDs on the shared TidStore during heap scan so parallel workers can > > use it during index vacuum. But it's not necessary to use > > TidStoreBeginIterateShared() because only one (leader) process does > > heap vacuum. > > Okay, thanks for the description. I felt it is OK to keep. > > I read 0001 again and here are comments. Thank you for the review comments! > > 01. vacuumlazy.c > ``` > +#define LV_PARALLEL_SCAN_SHARED 0xFFFF0001 > +#define LV_PARALLEL_SCAN_DESC 0xFFFF0002 > +#define LV_PARALLEL_SCAN_DESC_WORKER 0xFFFF0003 > ``` > > I checked other DMS keys used for parallel work, and they seems to have name > like PARALEL_KEY_XXX. Can we follow it? Yes. How about LV_PARALLEL_KEY_XXX? > > 02. LVRelState > ``` > + BlockNumber next_fsm_block_to_vacuum; > ``` > > Only the attribute does not have comments Can we add like: > "Next freespace map page to be checked"? Agreed. I'll add a comment "next block to check for FSM vacuum*. > > 03. parallel_heap_vacuum_gather_scan_stats > ``` > + vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; > ``` > > Note that `scan_stats->vacuumed_pages` does not exist in 0001, it is defined > in 0004. Can you move it? Will remove. > > 04. heap_parallel_vacuum_estimate > ``` > + > + heap_parallel_estimate_shared_memory_size(rel, nworkers, &pscan_len, > + &shared_len, &pscanwork_len); > + > + /* space for PHVShared */ > + shm_toc_estimate_chunk(&pcxt->estimator, shared_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + > + /* space for ParallelBlockTableScanDesc */ > + pscan_len = table_block_parallelscan_estimate(rel); > + shm_toc_estimate_chunk(&pcxt->estimator, pscan_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > + > + /* space for per-worker scan state, PHVScanWorkerState */ > + pscanwork_len = mul_size(sizeof(PHVScanWorkerState), nworkers); > + shm_toc_estimate_chunk(&pcxt->estimator, pscanwork_len); > + shm_toc_estimate_keys(&pcxt->estimator, 1); > ``` > > I feel pscan_len and pscanwork_len are calclated in heap_parallel_estimate_shared_memory_size(). > Can we remove table_block_parallelscan_estimate() and mul_size() from here? Yes, it's an oversight. Will remove. > > 05. Idea > > Can you update documentations? Will update the doc as well. > > 06. Idea > > AFAICS pg_stat_progress_vacuum does not contain information related with the > parallel execution. How do you think adding an attribute which shows a list of pids? > Not sure it is helpful for users but it can show the parallelism. I think it's possible to show the parallelism even today (for parallel index vacuuming): =# select leader.pid, leader.query, array_agg(worker.pid) from pg_stat_activity as leader, pg_stat_activity as worker, pg_stat_progress_vacuum as v where leader.pid = worker.leader_pid and leader.pid = v.pid group by 1, 2; pid | query | array_agg ---------+---------------------+------------------- 2952103 | vacuum (verbose) t; | {2952257,2952258} (1 row) Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Nov 12, 2024 at 3:21 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 30 Oct 2024 at 22:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached new version patches that fixes failures reported by > > cfbot. I hope these changes make cfbot happy. > > > > I just started reviewing the patch and found the following comments > while going through the patch: > 1) I felt we should add some documentation for this at [1]. > > 2) Can we add some tests in vacuum_parallel with tables having no > indexes and having dead tuples. > > 3) This should be included in typedefs.list: > 3.a) > +/* > + * Relation statistics collected during heap scanning and need to be > shared among > + * parallel vacuum workers. > + */ > +typedef struct LVRelScanStats > +{ > + BlockNumber scanned_pages; /* # pages examined (not > skipped via VM) */ > + BlockNumber removed_pages; /* # pages removed by relation > truncation */ > + BlockNumber frozen_pages; /* # pages with newly frozen tuples */ > > 3.b) Similarly this too: > +/* > + * Struct for information that need to be shared among parallel vacuum workers > + */ > +typedef struct PHVShared > +{ > + bool aggressive; > + bool skipwithvm; > + > > 3.c) Similarly this too: > +/* Per-worker scan state for parallel heap vacuum scan */ > +typedef struct PHVScanWorkerState > +{ > + bool initialized; > > 3.d) Similarly this too: > +/* Struct for parallel heap vacuum */ > +typedef struct PHVState > +{ > + /* Parallel scan description shared among parallel workers */ > > 4) Since we are initializing almost all the members of structure, > should we use palloc0 in this case: > + scan_stats = palloc(sizeof(LVRelScanStats)); > + scan_stats->scanned_pages = 0; > + scan_stats->removed_pages = 0; > + scan_stats->frozen_pages = 0; > + scan_stats->lpdead_item_pages = 0; > + scan_stats->missed_dead_pages = 0; > + scan_stats->nonempty_pages = 0; > + > + /* Initialize remaining counters (be tidy) */ > + scan_stats->tuples_deleted = 0; > + scan_stats->tuples_frozen = 0; > + scan_stats->lpdead_items = 0; > + scan_stats->live_tuples = 0; > + scan_stats->recently_dead_tuples = 0; > + scan_stats->missed_dead_tuples = 0; > > 5) typo paralle should be parallel > +/* > + * Return the number of parallel workers for a parallel vacuum scan of this > + * relation. > + */ > +static inline int > +table_paralle_vacuum_compute_workers(Relation rel, int requested) > +{ > + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); > +} > Thank you for the comments! I'll address these comments in the next version patch. BTW while updating the patch, I found that we might want to launch different numbers of workers for scanning heap and vacuuming heap. The number of parallel workers is determined based on the number of blocks in the table. However, even if this number is high, it could happen that we want to launch fewer workers to vacuum heap pages when there are not many pages having garbage. And the number of workers for vacuuming heap could vary on each vacuum pass. I'm considering implementing it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Swada-san, > > BTW while updating the patch, I found that we might want to launch > different numbers of workers for scanning heap and vacuuming heap. The > number of parallel workers is determined based on the number of blocks > in the table. However, even if this number is high, it could happen > that we want to launch fewer workers to vacuum heap pages when there > are not many pages having garbage. And the number of workers for > vacuuming heap could vary on each vacuum pass. I'm considering > implementing it. Just to clarify - this idea looks good to me. I imagine you will add new APIs for tableam like parallel_vacuum_compute_workers_for_scaning and parallel_vacuum_compute_workers_for_vacuuming. If other tableam developers want to use the same number of workers as scanning, they can pass the same function to the pointer. Is it right? Best regards, Hayato Kuroda FUJITSU LIMITED
Hi Sawada-San, FYI, the patch 0001 fails to build stand-alone vacuumlazy.c: In function ‘parallel_heap_vacuum_gather_scan_stats’: vacuumlazy.c:3739:21: error: ‘LVRelScanStats’ has no member named ‘vacuumed_pages’ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ^ vacuumlazy.c:3739:43: error: ‘LVRelScanStats’ has no member named ‘vacuumed_pages’ vacrel->scan_stats->vacuumed_pages += ss->vacuumed_pages; ^ make[4]: *** [vacuumlazy.o] Error 1 It appears to be using a struct field which is not even introduced until the patch 0004 of the patch set. ====== Kind Regards, Peter Smith. Fujitsu Australia.
Hi Sawada-San, I started to look at patch v4-0001 in this thread. It is quite a big patch so this is a WIP, and these below are just the comments I have so far. ====== src/backend/access/heap/vacuumlazy.c 1.1. +/* + * Relation statistics collected during heap scanning and need to be shared among + * parallel vacuum workers. + */ +typedef struct LVRelScanStats The comment wording is not quite right. /Relation statistics collected during heap scanning/Relation statistics that are collected during heap scanning/ ~~~ 1.2 +/* + * Struct for information that need to be shared among parallel vacuum workers + */ +typedef struct PHVShared The comment wording is not quite right. /that need to be shared/that needs to be shared/ ~~~ 1.3. +/* Struct for parallel heap vacuum */ +typedef struct PHVState +{ + /* Parallel scan description shared among parallel workers */ + ParallelBlockTableScanDesc pscandesc; + + /* Shared information */ + PHVShared *shared; If 'pscandesc' is described as 'shared among parallel workers', should that field be within 'PHVShared' instead? ~~~ 1.4. /* Initialize page counters explicitly (be tidy) */ - vacrel->scanned_pages = 0; - vacrel->removed_pages = 0; - vacrel->frozen_pages = 0; - vacrel->lpdead_item_pages = 0; - vacrel->missed_dead_pages = 0; - vacrel->nonempty_pages = 0; - /* dead_items_alloc allocates vacrel->dead_items later on */ + scan_stats = palloc(sizeof(LVRelScanStats)); + scan_stats->scanned_pages = 0; + scan_stats->removed_pages = 0; + scan_stats->frozen_pages = 0; + scan_stats->lpdead_item_pages = 0; + scan_stats->missed_dead_pages = 0; + scan_stats->nonempty_pages = 0; + + /* Initialize remaining counters (be tidy) */ + scan_stats->tuples_deleted = 0; + scan_stats->tuples_frozen = 0; + scan_stats->lpdead_items = 0; + scan_stats->live_tuples = 0; + scan_stats->recently_dead_tuples = 0; + scan_stats->missed_dead_tuples = 0; + + vacrel->scan_stats = scan_stats; 1.4a. Or, maybe just palloc0 this and provide a comment to say all counters have been zapped to 0. ~ 1.4b. Maybe you don't need that 'scan_stats' variable; just assign the palloc0 directly to the field instead. ~~~ 1.5. - vacrel->missed_dead_tuples = 0; + /* dead_items_alloc allocates vacrel->dead_items later on */ The patch seems to have moved this "dead_items_alloc" comment to now be below the "Allocate/initialize output statistics state" stuff (??). ====== src/backend/commands/vacuumparallel.c parallel_vacuum_init: 1.6. int parallel_workers = 0; + int nworkers_table; + int nworkers_index; The local vars and function params are named like this (here and in other functions). But the struct field names say 'nworkers_for_XXX' e.g. shared->nworkers_for_table = nworkers_table; shared->nworkers_for_index = nworkers_index; It may be better to use these 'nworkers_for_table' and 'nworkers_for_index' names consistently everywhere. ~~~ parallel_vacuum_compute_workers: 1.7. - int parallel_workers; + int parallel_workers_table = 0; + int parallel_workers_index = 0; + + *nworkers_table = 0; + *nworkers_index = 0; The local variables 'parallel_workers_table' and 'parallel_workers_index; are hardly needed because AFAICT those results can be directly assigned to *nworkers_table and *nworkers_index. ~~~ parallel_vacuum_process_all_indexes: 1.8. /* Reinitialize parallel context to relaunch parallel workers */ - if (num_index_scans > 0) + if (num_index_scans > 0 || pvs->num_table_scans > 0) ReinitializeParallelDSM(pvs->pcxt); I don't know if it is feasible or even makes sense to change, but somehow it seemed strange that the 'num_index_scans' field is not co-located with the 'num_table_scans' in the ParallelVacuumState. If this is doable, then lots of functions also would no longer need to pass 'num_index_scans' since they are already passing 'pvs'. ~~~ parallel_vacuum_table_scan_begin: 1.9. + (errmsg(ngettext("launched %d parallel vacuum worker for table processing (planned: %d)", + "launched %d parallel vacuum workers for table processing (planned: %d)", + pvs->pcxt->nworkers_launched), Isn't this the same as errmsg_plural? ~~~ 1.10. +/* Return the array of indexes associated to the given table to be vacuumed */ +Relation * +parallel_vacuum_get_table_indexes(ParallelVacuumState *pvs, int *nindexes) Even though the function comment can fit on one line it is nicer to use a block-style comment with a period, like below. It then will be consistent with other function comments (e.g. parallel_vacuum_table_scan_end, parallel_vacuum_process_table, etc). There are multiple places that this review comment can apply to. (also typo /associated to/associated with/) SUGGESTION /* * Return the array of indexes associated with the given table to be vacuumed. */ ~~~ parallel_vacuum_get_nworkers_table: parallel_vacuum_get_nworkers_index: 1.11. +/* Return the number of workers for parallel table vacuum */ +int +parallel_vacuum_get_nworkers_table(ParallelVacuumState *pvs) +{ + return pvs->shared->nworkers_for_table; +} + +/* Return the number of workers for parallel index processing */ +int +parallel_vacuum_get_nworkers_index(ParallelVacuumState *pvs) +{ + return pvs->shared->nworkers_for_index; +} + Are these functions needed? AFAICT, they are called only from macros where it seems just as easy to reference the pvs fields directly. ~~~ parallel_vacuum_process_table: 1.12. +/* + * A parallel worker invokes table-AM specified vacuum scan callback. + */ +static void +parallel_vacuum_process_table(ParallelVacuumState *pvs) +{ + Assert(VacuumActiveNWorkers); Maybe here also we should Assert(pvs.shared->do_vacuum_table_scan); ~~~ 1.13. - /* Process indexes to perform vacuum/cleanup */ - parallel_vacuum_process_safe_indexes(&pvs); + if (pvs.shared->do_vacuum_table_scan) + { + parallel_vacuum_process_table(&pvs); + } + else + { + ErrorContextCallback errcallback; + + /* Setup error traceback support for ereport() */ + errcallback.callback = parallel_vacuum_error_callback; + errcallback.arg = &pvs; + errcallback.previous = error_context_stack; + error_context_stack = &errcallback; + + /* Process indexes to perform vacuum/cleanup */ + parallel_vacuum_process_safe_indexes(&pvs); + + /* Pop the error context stack */ + error_context_stack = errcallback.previous; + } There are still some functions following this code (like 'shm_toc_lookup') that could potentially raise ERRORs. But, now the error_context_stack is getting assigned/reset earlier than previously was the case. Is that going to be a potential problem? ====== src/include/access/tableam.h 1.14. + /* + * Compute the amount of DSM space AM need in the parallel table vacuum. + * Maybe reword this comment to be more like table_parallelscan_estimate. SUGGESTION Estimate the size of shared memory that the parallel table vacuum needs for AM. ~~~ 1.15. +/* + * Estimate the size of shared memory needed for a parallel vacuum scan of this + * of this relation. + */ +static inline void +table_parallel_vacuum_estimate(Relation rel, ParallelContext *pcxt, int nworkers, + void *state) +{ + rel->rd_tableam->parallel_vacuum_estimate(rel, pcxt, nworkers, state); +} + +/* + * Initialize shared memory area for a parallel vacuum scan of this relation. + */ +static inline void +table_parallel_vacuum_initialize(Relation rel, ParallelContext *pcxt, int nworkers, + void *state) +{ + rel->rd_tableam->parallel_vacuum_initialize(rel, pcxt, nworkers, state); +} + +/* + * Start a parallel vacuum scan of this relation. + */ +static inline void +table_parallel_vacuum_scan(Relation rel, ParallelVacuumState *pvs, + ParallelWorkerContext *pwcxt) +{ + rel->rd_tableam->parallel_vacuum_scan_worker(rel, pvs, pwcxt); +} + All of the "Callbacks for parallel table vacuum." had comments saying "Not called if parallel table vacuum is disabled.". So, IIUC that means all of these table_parallel_vacuum_XXX functions (other than the compute_workers one) could have Assert(nworkers > 0); just to double-check that is true. ~~~ table_paralle_vacuum_compute_workers: 1.16. +static inline int +table_paralle_vacuum_compute_workers(Relation rel, int requested) +{ + return rel->rd_tableam->parallel_vacuum_compute_workers(rel, requested); +} Typo in function name. /paralle/parallel/ ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. FWIW, here are the remainder of my review comments for the patch v4-0001 ====== src/backend/access/heap/vacuumlazy.c lazy_scan_heap: 2.1. + /* + * Do the actual work. If parallel heap vacuum is active, we scan and + * vacuum heap with parallel workers. + */ /with/using/ ~~~ 2.2. + if (ParallelHeapVacuumIsActive(vacrel)) + do_parallel_lazy_scan_heap(vacrel); + else + do_lazy_scan_heap(vacrel); The do_lazy_scan_heap() returns a boolean and according to that function comment it should always be true if it is not using the parallel heap scan. So should we get the function return value here and Assert that it is true? ~~~ 2.3. Start uppercase even for all the single line comments for consistency with exasiting code. e.g. + /* report that everything is now scanned */ e.g + /* now we can compute the new value for pg_class.reltuples */ e.g. + /* report all blocks vacuumed */ ~~~ heap_vac_scan_next_block_parallel: 2.4. +/* + * A parallel scan variant of heap_vac_scan_next_block. + * + * In parallel vacuum scan, we don't use the SKIP_PAGES_THRESHOLD optimization. + */ +static bool +heap_vac_scan_next_block_parallel(LVRelState *vacrel, BlockNumber *blkno, + bool *all_visible_according_to_vm) The function comment should explain the return value. ~~~ 2.5. + if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0) + { + + if (vacrel->aggressive) + break; Unnecessary whitespace. ~~~ dead_items_alloc: 2.6. + /* + * We initialize parallel heap scan/vacuuming or index vacuuming + * or both based on the table size and the number of indexes. Note + * that only one worker can be used for an index, we invoke + * parallelism for index vacuuming only if there are at least two + * indexes on a table. + */ vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels, vacrel->nindexes, nworkers, vac_work_mem, vacrel->verbose ? INFO : DEBUG2, - vacrel->bstrategy); + vacrel->bstrategy, (void *) vacrel); Is this information misplaced? Why describe here "only one worker" and "at least two indexes on a table" I don't see anything here checking those conditions. ~~~ heap_parallel_vacuum_compute_workers: 2.7. + /* + * Select the number of workers based on the log of the size of the + * relation. This probably needs to be a good deal more + * sophisticated, but we need something here for now. Note that the + * upper limit of the min_parallel_table_scan_size GUC is chosen to + * prevent overflow here. + */ The "This probably needs to..." part maybe should have an "XXX" marker in the comment which AFAIK is used to highlight current decisions and potential for future changes. ~~~ heap_parallel_vacuum_initialize: 2.8. There is inconsistent capitalization of the single-line comments in this function. The same occurs in many functions in this file. but it is just a bit more obvious in this one. Please see all the others too. ~~~ parallel_heap_complete_unfinised_scan: 2.9. +static void +parallel_heap_complete_unfinised_scan(LVRelState *vacrel) TYPO in function name /unfinised/unfinished/ ~~~ 2.10. + if (!wstate->maybe_have_blocks) + + continue; Unnecessary blank line. ~~~ 2.11. + + /* Attache the worker's scan state and do heap scan */ + vacrel->phvstate->myscanstate = wstate; + scan_done = do_lazy_scan_heap(vacrel); /Attache/Attach/ ~~~ 2.12. + /* + * We don't need to gather the scan statistics here because statistics + * have already been accumulated the leaders statistics directly. + */ "have already been accumulated the leaders" -- missing word there somewhere? ~~~ do_parallel_lazy_scan_heap: 2.13. + /* + * If the heap scan paused in the middle of the table due to full of + * dead_items TIDs, perform a round of index and heap vacuuming. + */ + if (!scan_done) + { + /* Perform a round of index and heap vacuuming */ + vacrel->consider_bypass_optimization = false; + lazy_vacuum(vacrel); + + /* + * Vacuum the Free Space Map to make newly-freed space visible on + * upper-level FSM pages. + */ + if (vacrel->phvstate->min_blkno > vacrel->next_fsm_block_to_vacuum) + { + /* + * min_blkno should have already been updated when gathering + * statistics + */ + FreeSpaceMapVacuumRange(vacrel->rel, vacrel->next_fsm_block_to_vacuum, + vacrel->phvstate->min_blkno + 1); + vacrel->next_fsm_block_to_vacuum = vacrel->phvstate->min_blkno; + } + + /* Report that we are once again scanning the heap */ + pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, + PROGRESS_VACUUM_PHASE_SCAN_HEAP); + + /* re-launcher workers */ + vacrel->phvstate->nworkers_launched = + parallel_vacuum_table_scan_begin(vacrel->pvs); + + continue; + } + + /* We reach the end of the table */ + break; Instead of: if (!scan_done) { <other code ...> continue; } break; Won't it be better to refactor like: SUGGESTION if (scan_done) break; <other code...> ~~~ 2.14. + /* + * The parallel heap vacuum finished, but it's possible that some workers + * have allocated blocks but not processed yet. This can happen for + * example when workers exit because of full of dead_items TIDs and the + * leader process could launch fewer workers in the next cycle. + */ There seem to be some missing words: e.g. /not processed yet./not processed them yet./ e.g. /because of full of dead_items/because they are full of dead_items/ ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, Thanks for working on this. I took a quick look at this today, to do some basic review. I plan to do a bunch of testing, but that's mostly to get a better idea of what kind of improvements to expect - the initial results look quite nice and sensible. A couple basic comments: 1) I really like the idea of introducing "compute_workers" callback to the heap AM interface. I faced a similar issue with calculating workers for index builds, because right now plan_create_index_workers is doing that the logic works for btree, but really not brin etc. It didn't occur to me we might make this part of the index AM ... 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h because it really has nothing to do with planning / paths. I realize it needs the min_parallel_table_scan_size, but it doesn't seem right. I guess it's a sign this bit of code (calculating parallel workers based on log of relation size) should in some "shared" location. 3) The difference in naming ParallelVacuumState vs. PHVState is a bit weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to make it consistent and clear. 4) I think it would be good to have some sort of README explaining how the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a while to realize how the workers coordinate which blocks to scan. 5) Wouldn't it be better to introduce the scan_stats (grouping some of the fields in a separate patch)? Seems entirely independent from the parallel part, so doing it separately would make it easier to review. Also, maybe reference the fields through scan_stats->x, instead of through vacrel->scan_stats->x, when there's the pointer. 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats? AFAIK it's not a statistic, it's actually a parameter affecting the decisions, right? 7) I find it a bit strange that heap_vac_scan_next_block() needs to check if it's a parallel scan, and redirect to the parallel callback. I mean, shouldn't the caller know which callback to invoke? Why should the serial callback care about this? 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the current block. Can you explain? 9) I'm a bit confused why the code checks IsParallelWorker() in so many places. Doesn't that mean the leader can't participate in the vacuum like a regular worker? 10) I'm not quite sure I understand the comments at the end of do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that means "index vacuuming", right? And then it says "pause without invoking index and heap vacuuming" but isn't the whole point of this block to do that cleanup so that the TidStore can be discarded? Maybe I just don't understand how the work is divided between the leader and workers ... 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this the patch still builds fine for me. thanks -- Tomas Vondra
Dear Tomas, > 1) I really like the idea of introducing "compute_workers" callback to > the heap AM interface. I faced a similar issue with calculating workers > for index builds, because right now plan_create_index_workers is doing > that the logic works for btree, but really not brin etc. It didn't occur > to me we might make this part of the index AM ... +1, so let's keep the proposed style. Or, can we even propose the idea to table/index access method API? I've considered bit and the point seemed that which arguments should be required. > 4) I think it would be good to have some sort of README explaining how > the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a > while to realize how the workers coordinate which blocks to scan. I love the idea, it is quite helpful for reviewers like me. Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > Thanks for working on this. I took a quick look at this today, to do > some basic review. I plan to do a bunch of testing, but that's mostly to > get a better idea of what kind of improvements to expect - the initial > results look quite nice and sensible. Thank you for reviewing the patch! > A couple basic comments: > > 1) I really like the idea of introducing "compute_workers" callback to > the heap AM interface. I faced a similar issue with calculating workers > for index builds, because right now plan_create_index_workers is doing > that the logic works for btree, but really not brin etc. It didn't occur > to me we might make this part of the index AM ... Thanks. > > 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h > because it really has nothing to do with planning / paths. I realize it > needs the min_parallel_table_scan_size, but it doesn't seem right. I > guess it's a sign this bit of code (calculating parallel workers based > on log of relation size) should in some "shared" location. True. The same is actually true also for vacuumparallel.c. It includes optimizer/paths.h to use min_parallel_index_scan_size. > > 3) The difference in naming ParallelVacuumState vs. PHVState is a bit > weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to > make it consistent and clear. With the patch, since ParallelVacuumState is no longer dedicated for parallel index vacuuming we cannot rename them in this way. Both parallel table scanning/vacuuming and parallel index vacuuming can use the same ParallelVacuumState instance. The heap-specific necessary data for parallel heap scanning and vacuuming are stored in PHVState. > > 4) I think it would be good to have some sort of README explaining how > the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a > while to realize how the workers coordinate which blocks to scan. +1. I will add README in the next version patch. > > 5) Wouldn't it be better to introduce the scan_stats (grouping some of > the fields in a separate patch)? Seems entirely independent from the > parallel part, so doing it separately would make it easier to review. > Also, maybe reference the fields through scan_stats->x, instead of > through vacrel->scan_stats->x, when there's the pointer. Agreed. > 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats? > AFAIK it's not a statistic, it's actually a parameter affecting the > decisions, right? Right. It would be better to move them to a separate struct or somewhere. > > 7) I find it a bit strange that heap_vac_scan_next_block() needs to > check if it's a parallel scan, and redirect to the parallel callback. I > mean, shouldn't the caller know which callback to invoke? Why should the > serial callback care about this? do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is called in serial vacuum and parallel vacuum cases. I wanted to make heap_vac_scan_next_block() workable in both cases. I think it also makes sense to have do_lazy_scan_heap() calls either function depending on parallel scan being enabled. > > 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the > current block. Can you explain? The workers' current block numbers are used to calculate the minimum block number where we've scanned so far. In serial scan case, we vacuum FSM of the particular block range for every VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan case, it doesn't make sense to vacuum FSM in that way because we might not have processed some blocks in the block range. So the idea is to calculate the minimum block number where we've scanned so far and vacuum FSM of the range of consecutive already-scanned blocks. > > 9) I'm a bit confused why the code checks IsParallelWorker() in so many > places. Doesn't that mean the leader can't participate in the vacuum > like a regular worker? I used '!isParallelWorker()' for some jobs that should be done only by the leader process. For example, checking failsafe mode, vacuuming FSM etc. > > 10) I'm not quite sure I understand the comments at the end of > do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that > means "index vacuuming", right? It means both index vacuuming and heap vacuuming. > And then it says "pause without invoking > index and heap vacuuming" but isn't the whole point of this block to do > that cleanup so that the TidStore can be discarded? Maybe I just don't > understand how the work is divided between the leader and workers ... The comment needs to be updated. But what the patch does is that when the memory usage of the shared TidStore reaches the limit, worker processes exit after updating the shared statistics, and then the leader invokes (parallel) index vacuuming and parallel heap vacuuming. Since the different number workers could be used for parallel heap scan, parallel index vacuuming, and parallel heap vacuuming, the leader process waits for all workers to finish at end of each phase. > 11) Why does GlobalVisState need to move to snapmgr.h? If I undo this > the patch still builds fine for me. Oh, I might have missed something. I'll check if it's really necessary. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 12/13/24 00:04, Tomas Vondra wrote: > ... > > The main difference is here: > > > master / no parallel workers: > > pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) > > 1 parallel worker: > > pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) > > > Clearly, with parallel vacuum we scan only a tiny fraction of the pages, > essentially just those with deleted tuples, which is ~1/20 of pages. > That's close to the 15x speedup. > > This effect is clearest without indexes, but it does affect even runs > with indexes - having to scan the indexes makes it much less pronounced, > though. However, these indexes are pretty massive (about the same size > as the table) - multiple times larger than the table. Chances are it'd > be clearer on realistic data sets. > > So the question is - is this correct? And if yes, why doesn't the > regular (serial) vacuum do that? > > There's some more strange things, though. For example, how come the avg > read rate is 0.000 MB/s? > > avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s > > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the > pages are in shared buffers (thanks to the DELETE earlier in that session). > OK, after looking into this a bit more I think the reason is rather simple - SKIP_PAGES_THRESHOLD. With serial runs, we end up scanning all pages, because even with an update every 5000 tuples, that's still only ~25 pages apart, well within the 32-page window. So we end up skipping no pages, scan and vacuum all everything. But parallel runs have this skipping logic disabled, or rather the logic that switches to sequential scans if the gap is less than 32 pages. IMHO this raises two questions: 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to sequential scans is the pages are close enough. Maybe there is a reason for this difference? Workers can reduce the difference between random and sequential I/0, similarly to prefetching. But that just means the workers should use a lower threshold, e.g. as SKIP_PAGES_THRESHOLD / nworkers or something like that? I don't see this discussed in this thread. 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good storage. If I can get an order of magnitude improvement (or more than that) by disabling the threshold, and just doing random I/O, maybe there's time to adjust it a bit. regards -- Tomas Vondra
On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 12/13/24 00:04, Tomas Vondra wrote: > > ... > > > > The main difference is here: > > > > > > master / no parallel workers: > > > > pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) > > > > 1 parallel worker: > > > > pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) > > > > > > Clearly, with parallel vacuum we scan only a tiny fraction of the pages, > > essentially just those with deleted tuples, which is ~1/20 of pages. > > That's close to the 15x speedup. > > > > This effect is clearest without indexes, but it does affect even runs > > with indexes - having to scan the indexes makes it much less pronounced, > > though. However, these indexes are pretty massive (about the same size > > as the table) - multiple times larger than the table. Chances are it'd > > be clearer on realistic data sets. > > > > So the question is - is this correct? And if yes, why doesn't the > > regular (serial) vacuum do that? > > > > There's some more strange things, though. For example, how come the avg > > read rate is 0.000 MB/s? > > > > avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s > > > > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's > > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the > > pages are in shared buffers (thanks to the DELETE earlier in that session). > > > > OK, after looking into this a bit more I think the reason is rather > simple - SKIP_PAGES_THRESHOLD. > > With serial runs, we end up scanning all pages, because even with an > update every 5000 tuples, that's still only ~25 pages apart, well within > the 32-page window. So we end up skipping no pages, scan and vacuum all > everything. > > But parallel runs have this skipping logic disabled, or rather the logic > that switches to sequential scans if the gap is less than 32 pages. > > > IMHO this raises two questions: > > 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to > sequential scans is the pages are close enough. Maybe there is a reason > for this difference? Workers can reduce the difference between random > and sequential I/0, similarly to prefetching. But that just means the > workers should use a lower threshold, e.g. as > > SKIP_PAGES_THRESHOLD / nworkers > > or something like that? I don't see this discussed in this thread. Each parallel heap scan worker allocates a chunk of blocks which is 8192 blocks at maximum, so we would need to use the SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we need to evaluate the differences anyway. WIll do the benchmark test and share the results. > > 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good > storage. If I can get an order of magnitude improvement (or more than > that) by disabling the threshold, and just doing random I/O, maybe > there's time to adjust it a bit. Yeah, you've started a thread for this so let's discuss it there. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 12/19/24 23:05, Masahiko Sawada wrote: > On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 12/13/24 00:04, Tomas Vondra wrote: >>> ... >>> >>> The main difference is here: >>> >>> >>> master / no parallel workers: >>> >>> pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total) >>> >>> 1 parallel worker: >>> >>> pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total) >>> >>> >>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages, >>> essentially just those with deleted tuples, which is ~1/20 of pages. >>> That's close to the 15x speedup. >>> >>> This effect is clearest without indexes, but it does affect even runs >>> with indexes - having to scan the indexes makes it much less pronounced, >>> though. However, these indexes are pretty massive (about the same size >>> as the table) - multiple times larger than the table. Chances are it'd >>> be clearer on realistic data sets. >>> >>> So the question is - is this correct? And if yes, why doesn't the >>> regular (serial) vacuum do that? >>> >>> There's some more strange things, though. For example, how come the avg >>> read rate is 0.000 MB/s? >>> >>> avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s >>> >>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's >>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the >>> pages are in shared buffers (thanks to the DELETE earlier in that session). >>> >> >> OK, after looking into this a bit more I think the reason is rather >> simple - SKIP_PAGES_THRESHOLD. >> >> With serial runs, we end up scanning all pages, because even with an >> update every 5000 tuples, that's still only ~25 pages apart, well within >> the 32-page window. So we end up skipping no pages, scan and vacuum all >> everything. >> >> But parallel runs have this skipping logic disabled, or rather the logic >> that switches to sequential scans if the gap is less than 32 pages. >> >> >> IMHO this raises two questions: >> >> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to >> sequential scans is the pages are close enough. Maybe there is a reason >> for this difference? Workers can reduce the difference between random >> and sequential I/0, similarly to prefetching. But that just means the >> workers should use a lower threshold, e.g. as >> >> SKIP_PAGES_THRESHOLD / nworkers >> >> or something like that? I don't see this discussed in this thread. > > Each parallel heap scan worker allocates a chunk of blocks which is > 8192 blocks at maximum, so we would need to use the > SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we > need to evaluate the differences anyway. WIll do the benchmark test > and share the results. > Right. I don't think this really matters for small tables, and for large tables the chunks should be fairly large (possibly up to 8192 blocks), in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial case. There might be differences at boundaries between chunks, but that seems like a minor / expected detail. I haven't checked know if the code would need to change / how much. >> >> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good >> storage. If I can get an order of magnitude improvement (or more than >> that) by disabling the threshold, and just doing random I/O, maybe >> there's time to adjust it a bit. > > Yeah, you've started a thread for this so let's discuss it there. > OK. FWIW as suggested in the other thread, it doesn't seem to be merely a question of VACUUM performance, as not skipping pages gives vacuum the opportunity to do cleanup that would otherwise need to happen later. If only for this reason, I think it would be good to keep the serial and parallel vacuum consistent. regards -- Tomas Vondra
On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved that issue or its being included in this patch set?
On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > > > > IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved thatissue or its being included in this patch set? I recall we had some discussion on changes to group locking for implementing parallel heap vacuum, but I don't remember if we have a blocker now. One problem we previously had was that since the relation extension locks were not in conflict between parallel workers and the leader, multiple workers could extend the visibility map simultaneously. This problem was fixed by commit 85f6b49c2c. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 17, 2025 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Jan 17, 2025 at 1:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >
>
>
> IIRC, there was one of the blocker for implementing parallel heap vacuum was group locking, have we already resolved that issue or its being included in this patch set?
I recall we had some discussion on changes to group locking for
implementing parallel heap vacuum, but I don't remember if we have a
blocker now.
One problem we previously had was that since the relation extension
locks were not in conflict between parallel workers and the leader,
multiple workers could extend the visibility map simultaneously. This
problem was fixed by commit 85f6b49c2c.
Yes, that's correct. As part of that commit, we made the relation extension lock conflict among group members, ensuring that multiple workers cannot acquire it simultaneously. Additionally, this cannot cause a deadlock because no other locks are held while the relation extension lock is being held.
On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote: > > Hi, > > Thanks for the new patches. I've repeated my benchmarking on v8, and I > agree this looks fine - the speedups are reasonable and match what I'd > expect on this hardware. I don't see any suspicious results like with > the earlier patches, where it got much faster thanks to the absence of > SKIP_PAGE_THRESHOLD logic. > > Attached is the benchmarking script, CSV with raw results, and then also > two PDF reports comparing visualizing the impact of the patch by > comparing it to current master. > > * parallel-vacuum-duration.pdf - Duration of the vacuum, and duration > relative to master (green - faster, read - slower). The patch is clearly > an improvement, with speedup up to ~3x depending on the index count and > a fraction of updated rows. > > * parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM > VERBOSE. With the patch it can reach up to ~3GB/s, which is about the > max value possible on this hardware - so that's nice. I'll try to test > this on a better storage, to see how far it can go. Thank you for doing a performance benchmark. These results make sense to me. > I haven't done any actual code review on the new patches, I'll try to > find time for that sometime next week. Thank you! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Jan 21, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Jan 19, 2025 at 7:50 AM Tomas Vondra <tomas@vondra.me> wrote: > > > > Hi, > > > > Thanks for the new patches. I've repeated my benchmarking on v8, and I > > agree this looks fine - the speedups are reasonable and match what I'd > > expect on this hardware. I don't see any suspicious results like with > > the earlier patches, where it got much faster thanks to the absence of > > SKIP_PAGE_THRESHOLD logic. > > > > Attached is the benchmarking script, CSV with raw results, and then also > > two PDF reports comparing visualizing the impact of the patch by > > comparing it to current master. > > > > * parallel-vacuum-duration.pdf - Duration of the vacuum, and duration > > relative to master (green - faster, read - slower). The patch is clearly > > an improvement, with speedup up to ~3x depending on the index count and > > a fraction of updated rows. > > > > * parallel-vacuum-reads.pdf - Average read speed, as reported by VACUUM > > VERBOSE. With the patch it can reach up to ~3GB/s, which is about the > > max value possible on this hardware - so that's nice. I'll try to test > > this on a better storage, to see how far it can go. > > Thank you for doing a performance benchmark. These results make sense to me. > > > I haven't done any actual code review on the new patches, I'll try to > > find time for that sometime next week. > > Thank you! Since we introduced the eagar vacuum scan (052026c9b9), I need to update the parallel heap vacuum patch. After thinking of how to integrate these two features, I find some complexities. The region size used by eager vacuum scan and the chunk size used by parallel table scan are different. While the region is fixed size the chunk becomes smaller as we scan the table. A chunk of the table that a parallel vacuum worker took could be across different regions or be within one region, and different parallel heap vacuum workers might scan the same region. And parallel heap vacuum workers could be scanning different regions of the table simultaneously. During eager vacuum scan, we reset the eager_scan_remaining_fails counter when we start to scan the new region. So if we want to make parallel heap vacuum behaves exactly the same way as the single-progress vacuum in terms of the eager vacuum scan, we would need to have the eager_scan_remaining_fails counters for each region so that the workers can decrement it corresponding to the region of the block that the worker is scanning. But I'm concerned that it makes the logic very complex. I'd like to avoid making newly introduced codes more complex by adding yet another new code on top of that. Another idea is to disable the eager vacuum scan when parallel heap vacuum is enabled. It might look like just avoiding difficult things but it could make sense in a sense. The eager vacuum scan is aimed to amortize the aggressive vacuum by incrementally freezing pages that are potentially frozen by the next aggressive vacuum. On the other hand, parallel heap vacuum is available only in manual VACUUM and would be used to remove garbage on a large table as soon as possible or to freeze the entire table to avoid reaching the XID limit. So I think it might make sense to disable the eager vacuum scan when parallel vacuum. Thoughts? Thoughts? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > During eager vacuum scan, we reset the eager_scan_remaining_fails > counter when we start to scan the new region. So if we want to make > parallel heap vacuum behaves exactly the same way as the > single-progress vacuum in terms of the eager vacuum scan, we would > need to have the eager_scan_remaining_fails counters for each region > so that the workers can decrement it corresponding to the region of > the block that the worker is scanning. But I'm concerned that it makes > the logic very complex. I'd like to avoid making newly introduced > codes more complex by adding yet another new code on top of that. Would it be simpler to make only phase III parallel? In other words, how much of the infrastructure and complexity needed for parallel phase I is also needed for phase III? -- John Naylor Amazon Web Services
On Thu, Feb 13, 2025 at 8:16 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > During eager vacuum scan, we reset the eager_scan_remaining_fails > > counter when we start to scan the new region. So if we want to make > > parallel heap vacuum behaves exactly the same way as the > > single-progress vacuum in terms of the eager vacuum scan, we would > > need to have the eager_scan_remaining_fails counters for each region > > so that the workers can decrement it corresponding to the region of > > the block that the worker is scanning. But I'm concerned that it makes > > the logic very complex. I'd like to avoid making newly introduced > > codes more complex by adding yet another new code on top of that. > > Would it be simpler to make only phase III parallel? Yes, I think so. > In other words, > how much of the infrastructure and complexity needed for parallel > phase I is also needed for phase III? Both phases need some common changes to the parallel vacuum infrastructure so that we can launch parallel workers using ParallelVacuumContext for phase I and III and parallel vacuum workers can do its task on the heap table. Specifically, we need to change vacuumparallel.c to work also in parallel heap vacuum case, and add some table AM callbacks. Other than that, these phases have different complexity. As for supporting parallelism for phase III, changes to lazy vacuum would not be very complex, it needs to change both the radix tree and TidStore to support shared iteration through. Supporting parallelism of phase I is more complex since it integrates parallel table scan with lazy heap scan. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Since we introduced the eagar vacuum scan (052026c9b9), I need to > update the parallel heap vacuum patch. After thinking of how to > integrate these two features, I find some complexities. The region > size used by eager vacuum scan and the chunk size used by parallel > table scan are different. While the region is fixed size the chunk > becomes smaller as we scan the table. A chunk of the table that a > parallel vacuum worker took could be across different regions or be > within one region, and different parallel heap vacuum workers might > scan the same region. And parallel heap vacuum workers could be > scanning different regions of the table simultaneously. Ah, I see. What are the chunk size ranges? I picked a 32 MB region size after a little testing and mostly because it seemed reasonable. I think it would be fine to use different region size. Parallel workers could just consider the chunk they get an eager scan region (unless it is too small or too large -- then it might not make sense). > During eager vacuum scan, we reset the eager_scan_remaining_fails > counter when we start to scan the new region. So if we want to make > parallel heap vacuum behaves exactly the same way as the > single-progress vacuum in terms of the eager vacuum scan, we would > need to have the eager_scan_remaining_fails counters for each region > so that the workers can decrement it corresponding to the region of > the block that the worker is scanning. But I'm concerned that it makes > the logic very complex. I'd like to avoid making newly introduced > codes more complex by adding yet another new code on top of that. I don't think it would have to behave exactly the same. I think we just don't want to add a lot of complexity or make it hard to reason about. Since the failure rate is defined as a percent, couldn't we just have parallel workers set eager_scan_remaining_fails when they get their chunk assignment (as a percentage of their chunk size)? (I haven't looked at the code, so maybe this doesn't make sense). For the success cap, we could have whoever hits it first disable eager scanning for all future assigned chunks. > Another idea is to disable the eager vacuum scan when parallel heap > vacuum is enabled. It might look like just avoiding difficult things > but it could make sense in a sense. The eager vacuum scan is aimed to > amortize the aggressive vacuum by incrementally freezing pages that > are potentially frozen by the next aggressive vacuum. On the other > hand, parallel heap vacuum is available only in manual VACUUM and > would be used to remove garbage on a large table as soon as possible > or to freeze the entire table to avoid reaching the XID limit. So I > think it might make sense to disable the eager vacuum scan when > parallel vacuum. Do we only do parallelism in manual vacuum because we don't want to use up too many parallel workers for a maintenance subsystem? I never really tried to find out why parallel index vacuuming is only in manual vacuum. I assume you made the same choice they did for the same reasons. If the idea is to never allow parallelism in vacuum, then I think disabling eager scanning during manual parallel vacuum seems reasonable. People could use vacuum freeze if they want more freezing. Also, if you start with only doing parallelism for the third phase of heap vacuuming (second pass over the heap), this wouldn't be a problem because eager scanning only impacts the first phase. - Melanie
On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Feb 12, 2025 at 5:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Since we introduced the eagar vacuum scan (052026c9b9), I need to > > update the parallel heap vacuum patch. After thinking of how to > > integrate these two features, I find some complexities. The region > > size used by eager vacuum scan and the chunk size used by parallel > > table scan are different. While the region is fixed size the chunk > > becomes smaller as we scan the table. A chunk of the table that a > > parallel vacuum worker took could be across different regions or be > > within one region, and different parallel heap vacuum workers might > > scan the same region. And parallel heap vacuum workers could be > > scanning different regions of the table simultaneously. Thank you for your feedback. > Ah, I see. What are the chunk size ranges? I picked a 32 MB region > size after a little testing and mostly because it seemed reasonable. I > think it would be fine to use different region size. Parallel workers > could just consider the chunk they get an eager scan region (unless it > is too small or too large -- then it might not make sense). The maximum chunk size is 8192 blocks, 64MB. As we scan the table, we ramp down the chunk size. It would eventually become 1. > > > During eager vacuum scan, we reset the eager_scan_remaining_fails > > counter when we start to scan the new region. So if we want to make > > parallel heap vacuum behaves exactly the same way as the > > single-progress vacuum in terms of the eager vacuum scan, we would > > need to have the eager_scan_remaining_fails counters for each region > > so that the workers can decrement it corresponding to the region of > > the block that the worker is scanning. But I'm concerned that it makes > > the logic very complex. I'd like to avoid making newly introduced > > codes more complex by adding yet another new code on top of that. > > I don't think it would have to behave exactly the same. I think we > just don't want to add a lot of complexity or make it hard to reason > about. > > Since the failure rate is defined as a percent, couldn't we just have > parallel workers set eager_scan_remaining_fails when they get their > chunk assignment (as a percentage of their chunk size)? (I haven't > looked at the code, so maybe this doesn't make sense). IIUC since the chunk size eventually becomes 1, we cannot simply just have parallel workers set the failure rate to its assigned chunk. > > For the success cap, we could have whoever hits it first disable eager > scanning for all future assigned chunks. Agreed. > > > Another idea is to disable the eager vacuum scan when parallel heap > > vacuum is enabled. It might look like just avoiding difficult things > > but it could make sense in a sense. The eager vacuum scan is aimed to > > amortize the aggressive vacuum by incrementally freezing pages that > > are potentially frozen by the next aggressive vacuum. On the other > > hand, parallel heap vacuum is available only in manual VACUUM and > > would be used to remove garbage on a large table as soon as possible > > or to freeze the entire table to avoid reaching the XID limit. So I > > think it might make sense to disable the eager vacuum scan when > > parallel vacuum. > > Do we only do parallelism in manual vacuum because we don't want to > use up too many parallel workers for a maintenance subsystem? I never > really tried to find out why parallel index vacuuming is only in > manual vacuum. I assume you made the same choice they did for the same > reasons. > > If the idea is to never allow parallelism in vacuum, then I think > disabling eager scanning during manual parallel vacuum seems > reasonable. People could use vacuum freeze if they want more freezing. IIUC the purpose of parallel vacuum is incompatible with the purpose of auto vacuum. The former is aimed to execute the vacuum as fast as possible using more resources, whereas the latter is aimed to execute the vacuum while not affecting foreground transaction processing. It's probably worth considering to enable parallel vacuum even for autovacuum in a wraparound situation, but the purpose would remain the same. > Also, if you start with only doing parallelism for the third phase of > heap vacuuming (second pass over the heap), this wouldn't be a problem > because eager scanning only impacts the first phase. Right. I'm inclined to support only the second heap pass as the first step. If we support parallelism only for the second pass, it cannot help speed up freezing the entire table in emergency situations, but it would be beneficial for cases where a big table have a large amount of spread garbage. At least, I'm going to reorganize the patch set to support parallelism for the second pass first and then the first heap pass. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Since the failure rate is defined as a percent, couldn't we just have > > parallel workers set eager_scan_remaining_fails when they get their > > chunk assignment (as a percentage of their chunk size)? (I haven't > > looked at the code, so maybe this doesn't make sense). > > IIUC since the chunk size eventually becomes 1, we cannot simply just > have parallel workers set the failure rate to its assigned chunk. Yep. The ranges are too big (1-8192). The behavior would be too different from serial. > > Also, if you start with only doing parallelism for the third phase of > > heap vacuuming (second pass over the heap), this wouldn't be a problem > > because eager scanning only impacts the first phase. > > Right. I'm inclined to support only the second heap pass as the first > step. If we support parallelism only for the second pass, it cannot > help speed up freezing the entire table in emergency situations, but > it would be beneficial for cases where a big table have a large amount > of spread garbage. > > At least, I'm going to reorganize the patch set to support parallelism > for the second pass first and then the first heap pass. Makes sense. - Melanie
On Tue, Feb 18, 2025 at 4:43 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Feb 17, 2025 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Feb 14, 2025 at 2:21 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > Since the failure rate is defined as a percent, couldn't we just have > > > parallel workers set eager_scan_remaining_fails when they get their > > > chunk assignment (as a percentage of their chunk size)? (I haven't > > > looked at the code, so maybe this doesn't make sense). > > > > IIUC since the chunk size eventually becomes 1, we cannot simply just > > have parallel workers set the failure rate to its assigned chunk. > > Yep. The ranges are too big (1-8192). The behavior would be too > different from serial. > > > > Also, if you start with only doing parallelism for the third phase of > > > heap vacuuming (second pass over the heap), this wouldn't be a problem > > > because eager scanning only impacts the first phase. > > > > Right. I'm inclined to support only the second heap pass as the first > > step. If we support parallelism only for the second pass, it cannot > > help speed up freezing the entire table in emergency situations, but > > it would be beneficial for cases where a big table have a large amount > > of spread garbage. > > > > At least, I'm going to reorganize the patch set to support parallelism > > for the second pass first and then the first heap pass. > > Makes sense. I've attached the updated patches. In this version, I focused on parallelizing only the second pass over the heap. It's more straightforward than supporting the first pass, it still requires many preliminary changes though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v9-0003-radixtree.h-support-shared-iteration.patch
- v9-0002-vacuumparallel.c-Support-parallel-table-vacuuming.patch
- v9-0004-tidstore.c-support-shared-iteration-on-TidStore.patch
- v9-0005-Move-some-fields-of-LVRelState-to-LVVacCounters-s.patch
- v9-0006-Support-parallelism-for-removing-dead-items-durin.patch
- v9-0001-Introduces-table-AM-APIs-for-parallel-table-vacuu.patch
On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Right. I'm inclined to support only the second heap pass as the first > step. If we support parallelism only for the second pass, it cannot > help speed up freezing the entire table in emergency situations, but > it would be beneficial for cases where a big table have a large amount > of spread garbage. I started looking at the most recent patch set for this, and while looking back over the thread, a couple random thoughts occurred to me: On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > - Patched > parallel 0: 53468.734 (15990, 28296, 9465) > parallel 1: 35712.613 ( 8254, 23569, 3700) These measurements were done before before phase I and III were stream-ified, but even here those phases were already the quickest ones for a modest number of indexes and a single worker. I think it would have an even bigger difference when only a small percentage of pages need scanning/vacuuming, because it still has to read the entire index. It's a bit unfortunate that the simplest heap phase to parallelize is also the quickest to begin with. Pre-existing issue: We don't get anywhere near 2x improvement in phase II for 2 parallel index scans. We've known for a while that the shared memory TID store has more overhead than private memory, and here that overhead is about the same as the entirety of phase III with a single worker! It may be time to look into mitigations there, independently of this thread. The same commit that made the parallel scanning patch more difficult should also reduce the risk of having a large amount of freezing work at once in the first place. (There are still plenty of systemic things that can go wrong, of course, and it's always good if unpleasant work finishes faster.) I seem to recall a proposal from David Rowley to (something like) batch gathering xids for visibility checks during executor scans, but if so I can't find it in the archives. It's possible some similar work might speed up heap scanning in a more localized fashion. On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? > > TidStoreBeginIterateShared() is designed for multiple parallel workers > to iterate a shared TidStore. During an iteration, parallel workers > share the iteration state and iterate the underlying radix tree while > taking appropriate locks. Therefore, it's available only for a shared > TidStore. This is required to implement the parallel heap vacuum, > where multiple parallel workers do the iteration on the shared > TidStore. > > On the other hand, TidStoreBeginIterate() is designed for a single > process to iterate a TidStore. It accepts even a shared TidStore as > you mentioned, but during an iteration there is no inter-process > coordination such as locking. When it comes to parallel vacuum, > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > cover the case where we use only parallel index vacuum but not > parallel heap scan/vacuum. In this case, we need to store dead tuple > TIDs on the shared TidStore during heap scan so parallel workers can > use it during index vacuum. But it's not necessary to use > TidStoreBeginIterateShared() because only one (leader) process does > heap vacuum. The takeaway I got is that the word "shared" is used for two different concepts. Future readers are not going to think to look back at this email for explanation. At the least there needs to be a different word for the new concept. There are quite a lot of additions to radix_tree.h and tidstore.c to make it work this way, including a new "control object", new locks, new atomic variables. I'm not sure it's really necessary. Is it possible to just store the "most recent block heap-vacuumed" in the shared vacuum state? Then each backend would lock the TID store, iterate starting at the next block, and unlock the store when it has enough blocks. Sometimes my brainstorms are unworkable for some reason I failed to think about, but this way seems 95% simpler -- we would only need to teach the existing iteration machinery to take a "start key". -- John Naylor Amazon Web Services
On Mon, Feb 17, 2025 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> If the idea is to never allow parallelism in vacuum, then I think
> disabling eager scanning during manual parallel vacuum seems
> reasonable. People could use vacuum freeze if they want more freezing.
IIUC the purpose of parallel vacuum is incompatible with the purpose
of auto vacuum. The former is aimed to execute the vacuum as fast as
possible using more resources, whereas the latter is aimed to execute
the vacuum while not affecting foreground transaction processing. It's
probably worth considering to enable parallel vacuum even for
autovacuum in a wraparound situation, but the purpose would remain the
same.
That's assuming that the database is running with autovacuum_cost_delay > 0. There's presumably some number of systems that intentionally do not throttle autovacuum. Another consideration is that people are free to set vacuum_cost_page_miss = 0; in those cases it would still be useful to parallelize at least phase 1 and 2. Of course, folks can also set vacuum_cost_page_dirty = 0, in which case parallelization would help phase 2 and 3.
In terms of less hypothetical scenarios, I've definitely run into cases where 1 table accounts for 90%+ of the space used by a database. Parallelism would help ensure the single table is still processed in a timely fashion. (Assuming non-default settings for vacuum throttling.)
On Sun, Feb 23, 2025 at 8:51 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Right. I'm inclined to support only the second heap pass as the first > > step. If we support parallelism only for the second pass, it cannot > > help speed up freezing the entire table in emergency situations, but > > it would be beneficial for cases where a big table have a large amount > > of spread garbage. > > I started looking at the most recent patch set for this, and while > looking back over the thread, a couple random thoughts occurred to me: > > On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > - Patched > > parallel 0: 53468.734 (15990, 28296, 9465) > > parallel 1: 35712.613 ( 8254, 23569, 3700) > > These measurements were done before before phase I and III were > stream-ified, but even here those phases were already the quickest > ones for a modest number of indexes and a single worker. I think it > would have an even bigger difference when only a small percentage of > pages need scanning/vacuuming, because it still has to read the entire > index. It's a bit unfortunate that the simplest heap phase to > parallelize is also the quickest to begin with. > > Pre-existing issue: We don't get anywhere near 2x improvement in phase > II for 2 parallel index scans. We've known for a while that the shared > memory TID store has more overhead than private memory, and here that > overhead is about the same as the entirety of phase III with a single > worker! It may be time to look into mitigations there, independently > of this thread. Thank you for the comments. I've done simple performance benchmarks and attached the results. In this test, I prepared a table having one integer column and 3GB in size. The 'fraction' column means the fraction of pages with deleted rows; for example fraction=1000 means that the table has pages with deleted rows every 1000 pages. The results show the duration for each phase in addition to the total duration (for PATCHED only) and compares the total duration between the HEAD and the PATCHED. What I can see from these results was that we might not benefit much from parallelizing phase III, unfortunately. Although in the best case the phase III got about 2x speedup, as for the total duration it's about only 10% speedup. My analysis for these results matches what John mentioned; phase III is already the fastest phase and accounts only ~10% of the total execution time, and the overhead of shared TidStore offsets the speedup of phase III. > The same commit that made the parallel scanning patch more difficult > should also reduce the risk of having a large amount of freezing work > at once in the first place. (There are still plenty of systemic things > that can go wrong, of course, and it's always good if unpleasant work > finishes faster.) I think that vacuum would still need to scan a large amount of blocks of the table especially when it is very large and heavily modified. Parallel heap vacuum (only phase I) would be beneficial in case where autovacuum could not catch up. And we might want to consider using parallel heap vacuum also in autovacuum while integrating it with eagar freeze scan. > I seem to recall a proposal from David Rowley to (something like) > batch gathering xids for visibility checks during executor scans, but > if so I can't find it in the archives. It's possible some similar work > might speed up heap scanning in a more localized fashion. Interesting. > On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API, > > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference > > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another > > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()? > > > > TidStoreBeginIterateShared() is designed for multiple parallel workers > > to iterate a shared TidStore. During an iteration, parallel workers > > share the iteration state and iterate the underlying radix tree while > > taking appropriate locks. Therefore, it's available only for a shared > > TidStore. This is required to implement the parallel heap vacuum, > > where multiple parallel workers do the iteration on the shared > > TidStore. > > > > On the other hand, TidStoreBeginIterate() is designed for a single > > process to iterate a TidStore. It accepts even a shared TidStore as > > you mentioned, but during an iteration there is no inter-process > > coordination such as locking. When it comes to parallel vacuum, > > supporting TidStoreBeginIterate() on a shared TidStore is necessary to > > cover the case where we use only parallel index vacuum but not > > parallel heap scan/vacuum. In this case, we need to store dead tuple > > TIDs on the shared TidStore during heap scan so parallel workers can > > use it during index vacuum. But it's not necessary to use > > TidStoreBeginIterateShared() because only one (leader) process does > > heap vacuum. > > The takeaway I got is that the word "shared" is used for two different > concepts. Future readers are not going to think to look back at this > email for explanation. At the least there needs to be a different word > for the new concept. > > There are quite a lot of additions to radix_tree.h and tidstore.c to > make it work this way, including a new "control object", new locks, > new atomic variables. I'm not sure it's really necessary. Is it > possible to just store the "most recent block heap-vacuumed" in the > shared vacuum state? Then each backend would lock the TID store, > iterate starting at the next block, and unlock the store when it has > enough blocks. Sometimes my brainstorms are unworkable for some reason > I failed to think about, but this way seems 95% simpler -- we would > only need to teach the existing iteration machinery to take a "start > key". Interesting idea. While it might be worth evaluating this idea, given that the phase III accounts only a small portion of the total vacuum execution time, it might be better to switch focusing on parallelizing phase I. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > What I can see from these results was that we might not benefit much > from parallelizing phase III, unfortunately. Although in the best case > the phase III got about 2x speedup, as for the total duration it's > about only 10% speedup. My analysis for these results matches what > John mentioned; phase III is already the fastest phase and accounts > only ~10% of the total execution time, and the overhead of shared > TidStore offsets the speedup of phase III. So, are you proposing to drop the patches for parallelizing phase III for now? If so, are you planning on posting a set of patches just to parallelize phase I? I haven't looked at the prelim refactoring patches to see if they have independent value. What do you think is reasonable for us to try and do in the next few weeks? > > The same commit that made the parallel scanning patch more difficult > > should also reduce the risk of having a large amount of freezing work > > at once in the first place. (There are still plenty of systemic things > > that can go wrong, of course, and it's always good if unpleasant work > > finishes faster.) > > I think that vacuum would still need to scan a large amount of blocks > of the table especially when it is very large and heavily modified. > Parallel heap vacuum (only phase I) would be beneficial in case where > autovacuum could not catch up. And we might want to consider using > parallel heap vacuum also in autovacuum while integrating it with > eagar freeze scan. I'd be interested to hear more about this. - Melanie
On Tue, Feb 25, 2025 at 9:59 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > What I can see from these results was that we might not benefit much > > from parallelizing phase III, unfortunately. Although in the best case > > the phase III got about 2x speedup, as for the total duration it's > > about only 10% speedup. My analysis for these results matches what > > John mentioned; phase III is already the fastest phase and accounts > > only ~10% of the total execution time, and the overhead of shared > > TidStore offsets the speedup of phase III. > > So, are you proposing to drop the patches for parallelizing phase III > for now? If so, are you planning on posting a set of patches just to > parallelize phase I? I haven't looked at the prelim refactoring > patches to see if they have independent value. What do you think is > reasonable for us to try and do in the next few weeks? Given that we have only about one month until the feature freeze, I find that it's realistic to introduce either one parallelism for PG18 and at least we might want to implement the one first that is more beneficial and helpful for users. Since we found that parallel phase III is not very efficient in many cases, I'm thinking that in terms of PG18 development, we might want to switch focus to parallel phase I, and then go for phase III if we have time. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Given that we have only about one month until the feature freeze, I > find that it's realistic to introduce either one parallelism for PG18 > and at least we might want to implement the one first that is more > beneficial and helpful for users. Since we found that parallel phase > III is not very efficient in many cases, I'm thinking that in terms of > PG18 development, we might want to switch focus to parallel phase I, > and then go for phase III if we have time. Okay, well let me know how I can be helpful. Should I be reviewing a version that is already posted? - Melanie
On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Given that we have only about one month until the feature freeze, I > > find that it's realistic to introduce either one parallelism for PG18 > > and at least we might want to implement the one first that is more > > beneficial and helpful for users. Since we found that parallel phase > > III is not very efficient in many cases, I'm thinking that in terms of > > PG18 development, we might want to switch focus to parallel phase I, > > and then go for phase III if we have time. > > Okay, well let me know how I can be helpful. Should I be reviewing a > version that is already posted? Thank you so much. I'm going to submit the latest patches in a few days for parallelizing the phase I. I would appreciate it if you could review that version. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Given that we have only about one month until the feature freeze, I > > > find that it's realistic to introduce either one parallelism for PG18 > > > and at least we might want to implement the one first that is more > > > beneficial and helpful for users. Since we found that parallel phase > > > III is not very efficient in many cases, I'm thinking that in terms of > > > PG18 development, we might want to switch focus to parallel phase I, > > > and then go for phase III if we have time. > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > version that is already posted? > > Thank you so much. I'm going to submit the latest patches in a few > days for parallelizing the phase I. I would appreciate it if you could > review that version. > I've attached the updated patches that make the phase I (heap scanning) parallel. I'll share the benchmark results soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v10-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v10-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v10-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patch
- v10-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v10-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > Given that we have only about one month until the feature freeze, I > > > > find that it's realistic to introduce either one parallelism for PG18 > > > > and at least we might want to implement the one first that is more > > > > beneficial and helpful for users. Since we found that parallel phase > > > > III is not very efficient in many cases, I'm thinking that in terms of > > > > PG18 development, we might want to switch focus to parallel phase I, > > > > and then go for phase III if we have time. > > > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > > version that is already posted? > > > > Thank you so much. I'm going to submit the latest patches in a few > > days for parallelizing the phase I. I would appreciate it if you could > > review that version. > > > > I've attached the updated patches that make the phase I (heap > scanning) parallel. I'll share the benchmark results soon. > I've attached the benchmark test results. Overall, with the parallel heap scan (phase I), the vacuum got speedup much. On the other hand, looking at each phase I can see performance regressions in some cases: First, we can see the regression on a table with one index due to overhead of the shared TidStore. Currently, we disable parallel index vacuuming if the table has only one index as the leader process always takes one index. With this patch, we enable parallel heap scan even if the parallel index vacuuming is disabled, ending up using the shared TidStore. In the benchmark test, while the regression due to that overhead is about ~25% the speedup by parallel heap scan is 50%~, so the performance number is good overall. I think we can improve the shared TidStore in the future. Another performance regression I can see in the results is that heap vacuum phase (phase III) got slower with the patch. It's weired to me since I don't touch the code of heap vacuum phase. I'm still investigating the cause. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Another performance regression I can see in the results is that heap > vacuum phase (phase III) got slower with the patch. It's weired to me > since I don't touch the code of heap vacuum phase. I'm still > investigating the cause. I have investigated this regression. I've confirmed that In both scenarios (patched and unpatched), the entire table and its associated indexes were loaded into the shared buffer before the vacuum. Then, the 'perf record' analysis, focused specifically on the heap vacuum phase of the patched code, revealed numerous soft page faults occurring: 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel | |--52.44%--lazy_vacuum_heap_rel | | | |--46.33%--lazy_vacuum_heap_page (inlined) | | | | | |--32.42%--heap_page_is_all_visible (inlined) | | | | | | | |--26.46%--HeapTupleSatisfiesVacuum | | | | HeapTupleSatisfiesVacuumHorizon | | | | HeapTupleHeaderXminCommitted (inlined) | | | | | | | | | --18.52%--page_fault | | | | do_page_fault | | | | __do_page_fault | | | | handle_mm_fault | | | | __handle_mm_fault | | | | handle_pte_fault | | | | | | | | | |--16.53%--filemap_map_pages | | | | | | | | | | | --2.63%--alloc_set_pte | | | | | pfn_pte | | | | | | | | | --1.99%--pmd_page_vaddr | | | | | | | --1.99%--TransactionIdPrecedes I did not observe these page faults in the 'perf record' results for the HEAD version. Furthermore, when I disabled parallel heap vacuum while keeping parallel index vacuuming enabled, the regression disappeared. Based on these findings, the likely cause of the regression appears to be that during parallel heap vacuum operations, table blocks were loaded into the shared buffer by parallel vacuum workers. However, in the heap vacuum phase, the leader process needed to process all blocks, resulting in soft page faults while creating Page Table Entries (PTEs). Without the patch, the backend process had already created PTEs during the heap scan, thus preventing these faults from occurring during the heap vacuum phase. It appears to be an inherent side effect of utilizing parallel queries. Given this understanding, it's likely an acceptable trade-off that we can accommodate. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Some minor review comments for patch v10-0001. ====== src/include/access/tableam.h 1. struct IndexInfo; +struct ParallelVacuumState; +struct ParallelContext; +struct ParallelWorkerContext; struct SampleScanState; Use alphabetical order for consistency with existing code. ~~~ 2. + /* + * Estimate the size of shared memory that the parallel table vacuum needs + * for AM + * 2a. Missing period (.) 2b. Change the wording to be like below, for consistency with the other 'estimate' function comments, and for consistency with the comment where this function is implemented. Estimate the size of shared memory needed for a parallel table vacuum of this relation. ~~~ 3. + * The state_out is the output parameter so that an arbitrary data can be + * passed to the subsequent callback, parallel_vacuum_remove_dead_items. Typo? "an arbitrary data" ~~~ 4. General/Asserts All the below functions have a comment saying "Not called if parallel table vacuum is disabled." - parallel_vacuum_estimate - parallel_vacuum_initialize - parallel_vacuum_initialize_worker - parallel_vacuum_collect_dead_items But, it's only a comment. I wondered if they should all have some Assert as an integrity check on that. ~~~ 5. +/* + * Return the number of parallel workers for a parallel vacuum scan of this + * relation. + */ "Return the number" or "Compute the number"? The comment should match the comment in the fwd declaration of this function. ~~~ 6. +/* + * Perform a parallel vacuums scan to collect dead items. + */ 6a. "Perform" or "Execute"? The comment should match the one in the fwd declaration of this function. 6b. Typo "vacuums" ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. FYI. I am observing the following test behaviour: I apply patch v10-0001, do a clean rebuild and run 'make check', and all tests are OK. Then, after I apply just patch v10-0002 on top of 0001, do a clean rebuild and run 'make check' there are many test fails. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. Here are some review comments for patch v10-0002. ====== src/backend/access/heap/heapam_handler.c 1. .scan_bitmap_next_block = heapam_scan_bitmap_next_block, .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, .scan_sample_next_block = heapam_scan_sample_next_block, - .scan_sample_next_tuple = heapam_scan_sample_next_tuple + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, + + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, Pretty much every other heam AM method here has a 'heapam_' prefix. Is it better to be consistent and call the new function 'heapam_parallel_vacuum_compute_workers' instead of 'heap_parallel_vacuum_compute_workers'? ====== src/backend/access/heap/vacuumlazy.c 2. +/* + * Compute the number of workers for parallel heap vacuum. + * + * Disabled so far. + */ +int +heap_parallel_vacuum_compute_workers(Relation rel, int max_workers) +{ + return 0; +} + Instead of saying "Disabled so far", the function comment maybe should say: "Return 0 means parallel heap vacuum is disabled." Then the comment doesn't need to churn later when the function gets implemented in later patches. ====== src/backend/commands/vacuumparallel.c 3. +/* The kind of parallel vacuum work */ +typedef enum +{ + PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */ + PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */ +} PVWorkItem; + Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for the field name: 'work_phase' seems more appropriate. ~~~ 4. + /* + * Processing indexes or removing dead tuples from the table. + */ + PVWorkItem work_item; Missing question mark for this comment? ~~~ 5. + /* + * The number of workers for parallel table vacuuming. If > 0, the + * parallel table vacuum is enabled. + */ + int nworkers_for_table; + I guess this field will never be negative. So is it simpler to modify the comment to say: "If 0, parallel table vacuum is disabled." ~~~ parallel_vacuum_init: 6. + /* A parallel vacuum must be requested */ Assert(nrequested_workers >= 0); It's not very intuitive to say the user requested a parallel vacuum when the 'nrequested_workers' is 0. I felt some more commentary is needed here or in the function header; it seems nrequested_workers == 0 has a special meaning of having the system decide the number of workers. ~~~ parallel_vacuum_compute_workers: 7. static int -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, bool *will_parallel_vacuum) { int nindexes_parallel = 0; int nindexes_parallel_bulkdel = 0; int nindexes_parallel_cleanup = 0; - int parallel_workers; + int nworkers_table = 0; + int nworkers_index = 0; + + *nworkers_table_p = 0; 7a. AFAICT you can just remove the variable 'nworkers_table', and instead call the 'nworkers_table_p' parameter as 'nworkers_table' ~ 7b. IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now that the patch introduces parallel table vacuuming it makes this existing 'will_parallel_vacuum' name generic/misleading. Maybe it now needs to be called 'will_parallel_index_vacuum' or similar (in all places). ~~~ 8. + if (nrequested > 0) + { + /* + * If the parallel degree is specified, accept that as the number of + * parallel workers for table vacuum (though still cap at + * max_parallel_maintenance_workers). + */ + nworkers_table = Min(nrequested, max_parallel_maintenance_workers); + } + else + { + /* Compute the number of workers for parallel table scan */ + nworkers_table = table_parallel_vacuum_compute_workers(rel, + max_parallel_maintenance_workers); + + Assert(nworkers_table <= max_parallel_maintenance_workers); + } + The Assert can be outside of the if/else because it is the same for both. ~~~ 9. /* No index supports parallel vacuum */ - if (nindexes_parallel <= 0) - return 0; - - /* Compute the parallel degree */ - parallel_workers = (nrequested > 0) ? - Min(nrequested, nindexes_parallel) : nindexes_parallel; + if (nindexes_parallel > 0) + { + /* Take into account the requested number of workers */ + nworkers_index = (nrequested > 0) ? + Min(nrequested, nindexes_parallel) : nindexes_parallel; - /* Cap by max_parallel_maintenance_workers */ - parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers); + /* Cap by max_parallel_maintenance_workers */ + nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers); + } "No index supports..." seems to be an old comment that is not correct for this new code block. ~~~ 10. + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); +} + + Double blank line after function 'parallel_vacuum_process_table' ~~~ main: 11. + /* Initialize AM-specific vacuum state for parallel table vacuuming */ + if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) + { + ParallelWorkerContext pwcxt; + + pwcxt.toc = toc; + pwcxt.seg = seg; + table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt, + &state); + } + Wondering if this code can be done in the same if block later: "if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)" ~~~ 12. + if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) + { + /* Scan the table to collect dead items */ + parallel_vacuum_process_table(&pvs, state); + } + else + { + Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES); + + /* Process indexes to perform vacuum/cleanup */ + parallel_vacuum_process_safe_indexes(&pvs); + } Would this if/else be better implemented as a 'switch' for the possible phases? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 1:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 25, 2025 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > Given that we have only about one month until the feature freeze, I > > > > > find that it's realistic to introduce either one parallelism for PG18 > > > > > and at least we might want to implement the one first that is more > > > > > beneficial and helpful for users. Since we found that parallel phase > > > > > III is not very efficient in many cases, I'm thinking that in terms of > > > > > PG18 development, we might want to switch focus to parallel phase I, > > > > > and then go for phase III if we have time. > > > > > > > > Okay, well let me know how I can be helpful. Should I be reviewing a > > > > version that is already posted? > > > > > > Thank you so much. I'm going to submit the latest patches in a few > > > days for parallelizing the phase I. I would appreciate it if you could > > > review that version. > > > > > > > I've attached the updated patches that make the phase I (heap > > scanning) parallel. I'll share the benchmark results soon. > > > > I've attached the benchmark test results. > > Overall, with the parallel heap scan (phase I), the vacuum got speedup > much. On the other hand, looking at each phase I can see performance > regressions in some cases: > > First, we can see the regression on a table with one index due to > overhead of the shared TidStore. Currently, we disable parallel index > vacuuming if the table has only one index as the leader process always > takes one index. With this patch, we enable parallel heap scan even if > the parallel index vacuuming is disabled, ending up using the shared > TidStore. In the benchmark test, while the regression due to that > overhead is about ~25% the speedup by parallel heap scan is 50%~, so > the performance number is good overall. I think we can improve the > shared TidStore in the future. > > Another performance regression I can see in the results is that heap > vacuum phase (phase III) got slower with the patch. It's weired to me > since I don't touch the code of heap vacuum phase. I'm still > investigating the cause. > Discussing with Amit offlist, I've run another benchmark test where no data is loaded on the shared buffer. In the previous test, I loaded all table blocks before running vacuum, so it was the best case. The attached test results showed the worst case. Overall, while the numbers seem not stable, the phase I got sped up a bit, but not as scalable as expected, which is not surprising. Please note that the test results shows that the phase III also got sped up but this is because in parallel vacuum we use more ring buffers than the single process vacuum. So we need to compare the only phase I time in terms of the benefit of the parallelism. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Thu, Mar 6, 2025 at 5:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Some minor review comments for patch v10-0001. > > ====== > src/include/access/tableam.h > > 1. > struct IndexInfo; > +struct ParallelVacuumState; > +struct ParallelContext; > +struct ParallelWorkerContext; > struct SampleScanState; > > Use alphabetical order for consistency with existing code. > > ~~~ > > 2. > + /* > + * Estimate the size of shared memory that the parallel table vacuum needs > + * for AM > + * > > 2a. > Missing period (.) > > 2b. > Change the wording to be like below, for consistency with the other > 'estimate' function comments, and for consistency with the comment > where this function is implemented. > > Estimate the size of shared memory needed for a parallel table vacuum > of this relation. > > ~~~ > > 3. > + * The state_out is the output parameter so that an arbitrary data can be > + * passed to the subsequent callback, parallel_vacuum_remove_dead_items. > > Typo? "an arbitrary data" > > ~~~ > > 4. General/Asserts > > All the below functions have a comment saying "Not called if parallel > table vacuum is disabled." > - parallel_vacuum_estimate > - parallel_vacuum_initialize > - parallel_vacuum_initialize_worker > - parallel_vacuum_collect_dead_items > > But, it's only a comment. I wondered if they should all have some > Assert as an integrity check on that. > > ~~~ > > 5. > +/* > + * Return the number of parallel workers for a parallel vacuum scan of this > + * relation. > + */ > > "Return the number" or "Compute the number"? > The comment should match the comment in the fwd declaration of this function. > > ~~~ > > 6. > +/* > + * Perform a parallel vacuums scan to collect dead items. > + */ > > 6a. > "Perform" or "Execute"? > The comment should match the one in the fwd declaration of this function. > > 6b. > Typo "vacuums" > Thank you for reviewing the patch. I'll address these comments and submit the updated version patches soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 6, 2025 at 10:56 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Sawada-San. Here are some review comments for patch v10-0002. Thank you for reviewing the patch. > > ====== > src/backend/access/heap/heapam_handler.c > > 1. > .scan_bitmap_next_block = heapam_scan_bitmap_next_block, > .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, > .scan_sample_next_block = heapam_scan_sample_next_block, > - .scan_sample_next_tuple = heapam_scan_sample_next_tuple > + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, > + > + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, > > Pretty much every other heam AM method here has a 'heapam_' prefix. Is > it better to be consistent and call the new function > 'heapam_parallel_vacuum_compute_workers' instead of > 'heap_parallel_vacuum_compute_workers'? Hmm, given that the existing vacuum-related callback, heap_vacuum_rel, uses 'heap' I guess it's okay with 'heap_' prefix for parallel vacuum callbacks. > > ====== > src/backend/access/heap/vacuumlazy.c > > 2. > +/* > + * Compute the number of workers for parallel heap vacuum. > + * > + * Disabled so far. > + */ > +int > +heap_parallel_vacuum_compute_workers(Relation rel, int max_workers) > +{ > + return 0; > +} > + > > Instead of saying "Disabled so far", the function comment maybe should say: > "Return 0 means parallel heap vacuum is disabled." > > Then the comment doesn't need to churn later when the function gets > implemented in later patches. Updated. > > ====== > src/backend/commands/vacuumparallel.c > > 3. > +/* The kind of parallel vacuum work */ > +typedef enum > +{ > + PV_WORK_ITEM_PROCESS_INDEXES, /* index vacuuming or cleanup */ > + PV_WORK_ITEM_COLLECT_DEAD_ITEMS, /* collect dead tuples */ > +} PVWorkItem; > + > > Isn't this more like a PVWorkPhase instead of PVWorkItem? Ditto for > the field name: 'work_phase' seems more appropriate. Agreed. > > ~~~ > > 4. > + /* > + * Processing indexes or removing dead tuples from the table. > + */ > + PVWorkItem work_item; > > Missing question mark for this comment? Updated. > > ~~~ > > 5. > + /* > + * The number of workers for parallel table vacuuming. If > 0, the > + * parallel table vacuum is enabled. > + */ > + int nworkers_for_table; > + > > I guess this field will never be negative. So is it simpler to modify > the comment to say: > "If 0, parallel table vacuum is disabled." Agreed. > > ~~~ > > parallel_vacuum_init: > > 6. > + /* A parallel vacuum must be requested */ > Assert(nrequested_workers >= 0); > > It's not very intuitive to say the user requested a parallel vacuum > when the 'nrequested_workers' is 0. I felt some more commentary is > needed here or in the function header; it seems nrequested_workers == > 0 has a special meaning of having the system decide the number of > workers. Added some comments. > > ~~~ > > parallel_vacuum_compute_workers: > > 7. > static int > -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int > nrequested, > +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, > + int nrequested, int *nworkers_table_p, > bool *will_parallel_vacuum) > { > int nindexes_parallel = 0; > int nindexes_parallel_bulkdel = 0; > int nindexes_parallel_cleanup = 0; > - int parallel_workers; > + int nworkers_table = 0; > + int nworkers_index = 0; > + > + *nworkers_table_p = 0; > > 7a. > AFAICT you can just remove the variable 'nworkers_table', and instead > call the 'nworkers_table_p' parameter as 'nworkers_table' Yes, but I wanted to not modify the output parameter many times. > > ~ > > 7b. > IIUC this 'will_parallel_vacuum' only has meaning for indexes, but now > that the patch introduces parallel table vacuuming it makes this > existing 'will_parallel_vacuum' name generic/misleading. Maybe it now > needs to be called 'will_parallel_index_vacuum' or similar (in all > places). Agreed. > > ~~~ > > 8. > + if (nrequested > 0) > + { > + /* > + * If the parallel degree is specified, accept that as the number of > + * parallel workers for table vacuum (though still cap at > + * max_parallel_maintenance_workers). > + */ > + nworkers_table = Min(nrequested, max_parallel_maintenance_workers); > + } > + else > + { > + /* Compute the number of workers for parallel table scan */ > + nworkers_table = table_parallel_vacuum_compute_workers(rel, > + max_parallel_maintenance_workers); > + > + Assert(nworkers_table <= max_parallel_maintenance_workers); > + } > + > > The Assert can be outside of the if/else because it is the same for both. This part has been removed by modifying the parallel degree computation logic. > > ~~~ > > 9. > /* No index supports parallel vacuum */ > - if (nindexes_parallel <= 0) > - return 0; > - > - /* Compute the parallel degree */ > - parallel_workers = (nrequested > 0) ? > - Min(nrequested, nindexes_parallel) : nindexes_parallel; > + if (nindexes_parallel > 0) > + { > + /* Take into account the requested number of workers */ > + nworkers_index = (nrequested > 0) ? > + Min(nrequested, nindexes_parallel) : nindexes_parallel; > > - /* Cap by max_parallel_maintenance_workers */ > - parallel_workers = Min(parallel_workers, max_parallel_maintenance_workers); > + /* Cap by max_parallel_maintenance_workers */ > + nworkers_index = Min(nworkers_index, max_parallel_maintenance_workers); > + } > > "No index supports..." seems to be an old comment that is not correct > for this new code block. Removed. > > ~~~ > > 10. > + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); > +} > + > + > > Double blank line after function 'parallel_vacuum_process_table' Removed. > > ~~~ > > main: > > 11. > + /* Initialize AM-specific vacuum state for parallel table vacuuming */ > + if (shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) > + { > + ParallelWorkerContext pwcxt; > + > + pwcxt.toc = toc; > + pwcxt.seg = seg; > + table_parallel_vacuum_initialize_worker(rel, &pvs, &pwcxt, > + &state); > + } > + > > Wondering if this code can be done in the same if block later: "if > (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS)" If we do that, we would end up calling InstrStartParallelQuery() before the worker initialization. I guess we want to avoid counting these usage during the initialization. > > ~~~ > > 12. > + if (pvs.shared->work_item == PV_WORK_ITEM_COLLECT_DEAD_ITEMS) > + { > + /* Scan the table to collect dead items */ > + parallel_vacuum_process_table(&pvs, state); > + } > + else > + { > + Assert(pvs.shared->work_item == PV_WORK_ITEM_PROCESS_INDEXES); > + > + /* Process indexes to perform vacuum/cleanup */ > + parallel_vacuum_process_safe_indexes(&pvs); > + } > > Would this if/else be better implemented as a 'switch' for the possible phases? > Okay, changed. I've attached the updated version patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v11-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
- v11-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v11-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v11-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v11-0002-vacuumparallel.c-Support-parallel-table-vacuumin.patch
Hi Sawada-San, Here are some review comments for patch v11-0001. ====== src/backend/access/heap/vacuumlazy.c 1. +/* + * Compute the number of workers for parallel heap vacuum. + * + * Return 0 to disable parallel vacuum so far. + */ +int +heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested) You don't need to say "so far". ====== src/backend/access/table/tableamapi.c 2. Assert(routine->relation_vacuum != NULL); + Assert(routine->parallel_vacuum_compute_workers != NULL); Assert(routine->scan_analyze_next_block != NULL); Is it better to keep these Asserts in the same order that the TableAmRoutine fields are assigned (in heapam_handler.c)? ~~~ 3. + /* + * Callbacks for parallel vacuum are also optional (except for + * parallel_vacuum_compute_workers). But one callback implies presence of + * the others. + */ + Assert(((((routine->parallel_vacuum_estimate == NULL) == + (routine->parallel_vacuum_initialize == NULL)) == + (routine->parallel_vacuum_initialize_worker == NULL)) == + (routine->parallel_vacuum_collect_dead_items == NULL))); /also optional/optional/ ====== src/include/access/heapam.h +extern int heap_parallel_vacuum_compute_workers(Relation rel, int nworkers_requested); 4. wrong tab/space after 'int'. ====== src/include/access/tableam.h 5. + /* + * Compute the number of parallel workers for parallel table vacuum. The + * parallel degree for parallel vacuum is further limited by + * max_parallel_maintenance_workers. The function must return 0 to disable + * parallel table vacuum. + * + * 'nworkers_requested' is a >=0 number and the requested number of + * workers. This comes from the PARALLEL option. 0 means to choose the + * parallel degree based on the table AM specific factors such as table + * size. + */ + int (*parallel_vacuum_compute_workers) (Relation rel, + int nworkers_requested); The comment here says "This comes from the PARALLEL option." and "0 means to choose the parallel degree...". But, the PG docs [1] says "To disable this feature, one can use PARALLEL option and specify parallel workers as zero.". These two descriptions "disable this feature" (PG docs) and letting the system "choose the parallel degree" (code comment) don't sound the same. Should this 0001 patch update the PG documentation for the effect of setting PARALLEL value zero? ~~~ 6. + /* + * Initialize DSM space for parallel table vacuum. + * + * Not called if parallel table vacuum is disabled. + * + * Optional callback, but either all other parallel vacuum callbacks need + * to exist, or neither. + */ "or neither"? Also, saying "all other" seems incorrect because parallel_vacuum_compute_workers callback must exist event if parallel_vacuum_initialize does not exist. IMO you meant to say "all optional", and "or none". SUGGESTION: Optional callback. Either all optional parallel vacuum callbacks need to exist, or none. (this same issue is repeated in multiple places). ====== [1] https://www.postgresql.org/docs/devel/sql-vacuum.html Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-San. Here are some review comments for patch v11-0002 ====== Commit message. 1. Heap table AM disables the parallel heap vacuuming for now, but an upcoming patch uses it. This function implementation was moved into patch 0001, so probably this part of the commit message comment also belongs now in patch 0001. ====== src/backend/commands/vacuumparallel.c 2. + * For processing indexes in parallel, individual indexes are processed by one + * vacuum process. We launch parallel worker processes at the start of parallel index + * bulk-deletion and index cleanup and once all indexes are processed, the parallel + * worker processes exit. + * "are processed by one vacuum process." -- Did you mean "are processed by separate vacuum processes." ? ~~~ 3. + * + * Each time we process table or indexes in parallel, the parallel context is + * re-initialized so that the same DSM can be used for multiple passes of table vacuum + * or index bulk-deletion and index cleanup. Maybe I am mistaken, but it seems like the logic is almost always re-initializing this. I wonder if it might be simpler to just remove the 'need_reinitialize_dsm' field and initialize unconditionally. ~~~ 4. + * nrequested_workers is >= 0 number and the requested parallel degree. 0 + * means that the parallel degrees for table and indexes vacuum are decided + * differently. See the comments of parallel_vacuum_compute_workers() for + * details. + * * On success, return parallel vacuum state. Otherwise return NULL. */ SUGGESTION nrequested_workers is the requested parallel degree (>=0). 0 means that... ~~~ 5. static int -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, - bool *will_parallel_vacuum) +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, + bool *idx_will_parallel_vacuum) IIUC the returns for this function seem inconsistent. AFAIK, it previously would return the number of workers for parallel index vacuuming. But now (after this patch) the return value is returned Max(nworkers_table, nworkers_index). Meanwhile, the number of workers for parallel table vacuuming is returned as a by-reference parameter 'nworkers_table_p'. In other words, it is returning the number of workers in 2 different ways. Why not make this a void function, but introduce another parameter 'nworkers_index_p', similar to 'nworkers_table_p'? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Another performance regression I can see in the results is that heap > > vacuum phase (phase III) got slower with the patch. It's weired to me > > since I don't touch the code of heap vacuum phase. I'm still > > investigating the cause. > > I have investigated this regression. I've confirmed that In both > scenarios (patched and unpatched), the entire table and its associated > indexes were loaded into the shared buffer before the vacuum. Then, > the 'perf record' analysis, focused specifically on the heap vacuum > phase of the patched code, revealed numerous soft page faults > occurring: > > 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel > | > |--52.44%--lazy_vacuum_heap_rel > | | > | |--46.33%--lazy_vacuum_heap_page (inlined) > | | | > | | |--32.42%--heap_page_is_all_visible (inlined) > | | | | > | | | |--26.46%--HeapTupleSatisfiesVacuum > | | | | > HeapTupleSatisfiesVacuumHorizon > | | | | > HeapTupleHeaderXminCommitted (inlined) > | | | | | > | | | | --18.52%--page_fault > | | | | do_page_fault > | | | | > __do_page_fault > | | | | > handle_mm_fault > | | | | > __handle_mm_fault > | | | | > handle_pte_fault > | | | | | > | | | | > |--16.53%--filemap_map_pages > | | | | | | > | | | | | > --2.63%--alloc_set_pte > | | | | | > pfn_pte > | | | | | > | | | | > --1.99%--pmd_page_vaddr > | | | | > | | | --1.99%--TransactionIdPrecedes > > I did not observe these page faults in the 'perf record' results for > the HEAD version. Furthermore, when I disabled parallel heap vacuum > while keeping parallel index vacuuming enabled, the regression > disappeared. Based on these findings, the likely cause of the > regression appears to be that during parallel heap vacuum operations, > table blocks were loaded into the shared buffer by parallel vacuum > workers. > In the previous paragraph, you mentioned that the entire table and its associated indexes were loaded into the shared buffer before the vacuum. If that is true, then why does the parallel vacuum need to reload the table blocks into shared buffers? > However, in the heap vacuum phase, the leader process needed > to process all blocks, resulting in soft page faults while creating > Page Table Entries (PTEs). Without the patch, the backend process had > already created PTEs during the heap scan, thus preventing these > faults from occurring during the heap vacuum phase. > This part is again not clear to me because I am assuming all the data exists in shared buffers before the vacuum, so why the page faults will occur in the first place. -- With Regards, Amit Kapila.
On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Discussing with Amit offlist, I've run another benchmark test where no > data is loaded on the shared buffer. In the previous test, I loaded > all table blocks before running vacuum, so it was the best case. The > attached test results showed the worst case. > > Overall, while the numbers seem not stable, the phase I got sped up a > bit, but not as scalable as expected, which is not surprising. > Sorry, but it is difficult for me to understand this data because it doesn't contain the schema or details like what exactly is a fraction. It is also not clear how the workers are divided among heap and indexes, like do we use parallelism for both phases of heap or only first phase and do we reuse those workers for index vacuuming. These tests were probably discussed earlier, but it would be better to either add a summary of the required information to understand the results or at least a link to a previous email that has such details. Please > note that the test results shows that the phase III also got sped up > but this is because in parallel vacuum we use more ring buffers than > the single process vacuum. So we need to compare the only phase I time > in terms of the benefit of the parallelism. > Does phase 3 also use parallelism? If so, can we try to divide the ring buffers among workers or at least try vacuum with an increased number of ring buffers. This would be good to do for both the phases, if they both use parallelism. -- With Regards, Amit Kapila.
Hi Sawada-San. Some review comments for patch v11-0003. ====== src/backend/access/heap/vacuumlazy.c 1. +typedef struct LVScanData +{ + BlockNumber rel_pages; /* total number of pages */ + + BlockNumber scanned_pages; /* # pages examined (not skipped via VM) */ + + /* + * Count of all-visible blocks eagerly scanned (for logging only). This + * does not include skippable blocks scanned due to SKIP_PAGES_THRESHOLD. + */ + BlockNumber eager_scanned_pages; + + BlockNumber removed_pages; /* # pages removed by relation truncation */ + BlockNumber new_frozen_tuple_pages; /* # pages with newly frozen tuples */ + + + /* # pages newly set all-visible in the VM */ + BlockNumber vm_new_visible_pages; + + /* + * # pages newly set all-visible and all-frozen in the VM. This is a + * subset of vm_new_visible_pages. That is, vm_new_visible_pages includes + * all pages set all-visible, but vm_new_visible_frozen_pages includes + * only those which were also set all-frozen. + */ + BlockNumber vm_new_visible_frozen_pages; + + /* # all-visible pages newly set all-frozen in the VM */ + BlockNumber vm_new_frozen_pages; + + BlockNumber lpdead_item_pages; /* # pages with LP_DEAD items */ + BlockNumber missed_dead_pages; /* # pages with missed dead tuples */ + BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ + + /* Counters that follow are only for scanned_pages */ + int64 tuples_deleted; /* # deleted from table */ + int64 tuples_frozen; /* # newly frozen */ + int64 lpdead_items; /* # deleted from indexes */ + int64 live_tuples; /* # live tuples remaining */ + int64 recently_dead_tuples; /* # dead, but not yet removable */ + int64 missed_dead_tuples; /* # removable, but not removed */ + + /* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid. */ + TransactionId NewRelfrozenXid; + MultiXactId NewRelminMxid; + bool skippedallvis; +} LVScanData; + 1a. Double blank line after 'new_frozen_tuple_pages' field. ~ 1b. I know you have only refactored existing code but IMO, the use of "#" meaning "Number of" or "Count of" doesn't seem nice. It *may* be justifiable to prevent wrapping of the short comments on the same line as the fields, but for all the other larger comments it looks strange not to be spelled out properly as words. ~~~ heap_vacuum_rel: 2. /* Initialize page counters explicitly (be tidy) */ - vacrel->scanned_pages = 0; - vacrel->eager_scanned_pages = 0; - vacrel->removed_pages = 0; - vacrel->new_frozen_tuple_pages = 0; - vacrel->lpdead_item_pages = 0; - vacrel->missed_dead_pages = 0; - vacrel->nonempty_pages = 0; + scan_data = palloc(sizeof(LVScanData)); + scan_data->scanned_pages = 0; + scan_data->eager_scanned_pages = 0; + scan_data->removed_pages = 0; + scan_data->new_frozen_tuple_pages = 0; + scan_data->lpdead_item_pages = 0; + scan_data->missed_dead_pages = 0; + scan_data->nonempty_pages = 0; + scan_data->tuples_deleted = 0; + scan_data->tuples_frozen = 0; + scan_data->lpdead_items = 0; + scan_data->live_tuples = 0; + scan_data->recently_dead_tuples = 0; + scan_data->missed_dead_tuples = 0; + scan_data->vm_new_visible_pages = 0; + scan_data->vm_new_visible_frozen_pages = 0; + scan_data->vm_new_frozen_pages = 0; + scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); + vacrel->scan_data = scan_data; Or, you could replace most of these assignments by using palloc0, and write a comment to say all the counters were zapped to 0. ~~~ 3. + scan_data->vm_new_frozen_pages = 0; + scan_data->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); + vacrel->scan_data = scan_data; /* dead_items_alloc allocates vacrel->dead_items later on */ That comment about dead_items_alloc seems misplaced now because it is grouped with all the scan_data stuff. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 5, 2025 at 6:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Mar 3, 2025 at 3:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Another performance regression I can see in the results is that heap > > > vacuum phase (phase III) got slower with the patch. It's weired to me > > > since I don't touch the code of heap vacuum phase. I'm still > > > investigating the cause. > > > > I have investigated this regression. I've confirmed that In both > > scenarios (patched and unpatched), the entire table and its associated > > indexes were loaded into the shared buffer before the vacuum. Then, > > the 'perf record' analysis, focused specifically on the heap vacuum > > phase of the patched code, revealed numerous soft page faults > > occurring: > > > > 62.37% 13.90% postgres postgres [.] lazy_vacuum_heap_rel > > | > > |--52.44%--lazy_vacuum_heap_rel > > | | > > | |--46.33%--lazy_vacuum_heap_page (inlined) > > | | | > > | | |--32.42%--heap_page_is_all_visible (inlined) > > | | | | > > | | | |--26.46%--HeapTupleSatisfiesVacuum > > | | | | > > HeapTupleSatisfiesVacuumHorizon > > | | | | > > HeapTupleHeaderXminCommitted (inlined) > > | | | | | > > | | | | --18.52%--page_fault > > | | | | do_page_fault > > | | | | > > __do_page_fault > > | | | | > > handle_mm_fault > > | | | | > > __handle_mm_fault > > | | | | > > handle_pte_fault > > | | | | | > > | | | | > > |--16.53%--filemap_map_pages > > | | | | | | > > | | | | | > > --2.63%--alloc_set_pte > > | | | | | > > pfn_pte > > | | | | | > > | | | | > > --1.99%--pmd_page_vaddr > > | | | | > > | | | --1.99%--TransactionIdPrecedes > > > > I did not observe these page faults in the 'perf record' results for > > the HEAD version. Furthermore, when I disabled parallel heap vacuum > > while keeping parallel index vacuuming enabled, the regression > > disappeared. Based on these findings, the likely cause of the > > regression appears to be that during parallel heap vacuum operations, > > table blocks were loaded into the shared buffer by parallel vacuum > > workers. > > > > In the previous paragraph, you mentioned that the entire table and its > associated indexes were loaded into the shared buffer before the > vacuum. If that is true, then why does the parallel vacuum need to > reload the table blocks into shared buffers? Hmm, my above sentences are wrong. All pages are loaded into the shared buffer before the vacuum test. In parallel heap vacuum cases, the leader process reads a part of the table during phase I whereas in single-process vacuum case, the process reads the entire table. So there will be differences in the phase III in terms of PTEs as described below. > > > However, in the heap vacuum phase, the leader process needed > > to process all blocks, resulting in soft page faults while creating > > Page Table Entries (PTEs). Without the patch, the backend process had > > already created PTEs during the heap scan, thus preventing these > > faults from occurring during the heap vacuum phase. > > > > This part is again not clear to me because I am assuming all the data > exists in shared buffers before the vacuum, so why the page faults > will occur in the first place. IIUC PTEs are process-local data. So even if physical pages are loaded to PostgreSQL's shared buffer (and paga caches), soft page faults (or minor page faults)[1] can occur if these pages are not yet mapped in its page table. Regards, [1] https://en.wikipedia.org/wiki/Page_fault#Minor -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 7, 2025 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Discussing with Amit offlist, I've run another benchmark test where no > > data is loaded on the shared buffer. In the previous test, I loaded > > all table blocks before running vacuum, so it was the best case. The > > attached test results showed the worst case. > > > > Overall, while the numbers seem not stable, the phase I got sped up a > > bit, but not as scalable as expected, which is not surprising. > > > > Sorry, but it is difficult for me to understand this data because it > doesn't contain the schema or details like what exactly is a fraction. > It is also not clear how the workers are divided among heap and > indexes, like do we use parallelism for both phases of heap or only > first phase and do we reuse those workers for index vacuuming. These > tests were probably discussed earlier, but it would be better to > either add a summary of the required information to understand the > results or at least a link to a previous email that has such details. The testing configurations are: max_wal_size = 50GB shared_buffers = 25GB max_parallel_maintenance_workers = 10 max_parallel_workers = 20 max_worker_processes = 30 The test scripts are: ($m and $p are a fraction and a parallel degree, respectively) create unlogged table test_vacuum (a bigint) with (autovacuum_enabled=off); insert into test_vacuum select i from generate_series(1,200000000) s(i); create index idx_0 on test_vacuum (a); create index idx_1 on test_vacuum (a); create index idx_2 on test_vacuum (a); create index idx_3 on test_vacuum (a); create index idx_4 on test_vacuum (a); delete from test_vacuum where mod(a, $m) = 0; vacuum (verbose, parallel $p) test_vacuum; -- measured the execution time > > Please > > note that the test results shows that the phase III also got sped up > > but this is because in parallel vacuum we use more ring buffers than > > the single process vacuum. So we need to compare the only phase I time > > in terms of the benefit of the parallelism. > > > > Does phase 3 also use parallelism? If so, can we try to divide the > ring buffers among workers or at least try vacuum with an increased > number of ring buffers. This would be good to do for both the phases, > if they both use parallelism. No, only phase 1 was parallelized in this test. In parallel vacuum, since it uses (ring_buffer_size * parallel_degree) memory, more pages are loaded during phase 1, increasing cache hits during phase 3. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the updated version patches. I've started trying to review this and realized that, while I'm familiar with heap vacuuming code, I'm not familiar enough with the vacuumparallel.c machinery to be of help without much additional study. As such, I have mainly focused on reading the comments in your code. I think your comment in vacuumlazy.c describing the design could use more detail and a bit of massaging. For example, I don't know what you mean when you say: * We could require different number of parallel vacuum workers for each phase * for various factors such as table size and number of indexes. Does that refer to something you did implement or you are saying we could do that in the future? - Melanie
On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Does phase 3 also use parallelism? If so, can we try to divide the > > ring buffers among workers or at least try vacuum with an increased > > number of ring buffers. This would be good to do for both the phases, > > if they both use parallelism. > > No, only phase 1 was parallelized in this test. In parallel vacuum, > since it uses (ring_buffer_size * parallel_degree) memory, more pages > are loaded during phase 1, increasing cache hits during phase 3. > Shouldn't we ideally try with a vacuum without parallelism with ring_buffer_size * parallel_degree to make the comparison better? Also, what could be the reason for the variation in data of phase-I? Do you restart the system after each run to ensure there is nothing in the memory? If not, then shouldn't we try at least a few runs by restarting the system before each run to ensure there is nothing leftover in memory? -- With Regards, Amit Kapila.
On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > However, in the heap vacuum phase, the leader process needed > > > to process all blocks, resulting in soft page faults while creating > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > already created PTEs during the heap scan, thus preventing these > > > faults from occurring during the heap vacuum phase. > > > > > > > This part is again not clear to me because I am assuming all the data > > exists in shared buffers before the vacuum, so why the page faults > > will occur in the first place. > > IIUC PTEs are process-local data. So even if physical pages are loaded > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > minor page faults)[1] can occur if these pages are not yet mapped in > its page table. > Okay, I got your point. BTW, I noticed that even for the case where all the data is in shared_buffers, the performance improvement for workers greater than two does decrease marginally. Am I reading the data correctly? If so, what is the theory, and do we have recommendations for a parallel degree? -- With Regards, Amit Kapila.
On Mon, Mar 10, 2025 at 5:03 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sat, Mar 8, 2025 at 1:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached the updated version patches. > > I've started trying to review this and realized that, while I'm > familiar with heap vacuuming code, I'm not familiar enough with the > vacuumparallel.c machinery to be of help without much additional > study. As such, I have mainly focused on reading the comments in your > code. Thank you for looking at the patch. > > I think your comment in vacuumlazy.c describing the design could use > more detail and a bit of massaging. > > For example, I don't know what you mean when you say: > > * We could require different number of parallel vacuum workers for each phase > * for various factors such as table size and number of indexes. > > Does that refer to something you did implement or you are saying we > could do that in the future? It referred to the parallel heap vacuum implementation that I wrote. Since the parallel degrees for parallel heap scan and parallel index vacuuming are chosen separately based on different factors, we launch a different number of workers for each phase and they exit at the end of each phase. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > to process all blocks, resulting in soft page faults while creating > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > already created PTEs during the heap scan, thus preventing these > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > exists in shared buffers before the vacuum, so why the page faults > > > will occur in the first place. > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > minor page faults)[1] can occur if these pages are not yet mapped in > > its page table. > > > > Okay, I got your point. BTW, I noticed that even for the case where > all the data is in shared_buffers, the performance improvement for > workers greater than two does decrease marginally. Am I reading the > data correctly? If so, what is the theory, and do we have > recommendations for a parallel degree? The decrease you referred to is that the total vacuum execution time? When it comes to the execution time of phase 1, it seems we have good scalability. For example, with 2 workers (i.e.3 workers working including the leader in total) it got about 3x speed up, and with 4 workers it got about 5x speed up. Regarding other phases, the phase 3 got slower probably because of PTEs stuff, but I don't investigate why the phase 2 also slightly got slower with more than 2 workers. In the current patch, the parallel degree for phase 1 is chosen based on the table size, which is almost the same as the calculation of the degree for parallel seq scan. But thinking further, we might want to account for the number of all-visible pages and all-frozen pages here so that we can avoid launching many workers for mostly-frozen-big-tables. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 5:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sun, Mar 9, 2025 at 11:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Does phase 3 also use parallelism? If so, can we try to divide the > > > ring buffers among workers or at least try vacuum with an increased > > > number of ring buffers. This would be good to do for both the phases, > > > if they both use parallelism. > > > > No, only phase 1 was parallelized in this test. In parallel vacuum, > > since it uses (ring_buffer_size * parallel_degree) memory, more pages > > are loaded during phase 1, increasing cache hits during phase 3. > > > > Shouldn't we ideally try with a vacuum without parallelism with > ring_buffer_size * parallel_degree to make the comparison better? Right. I'll share the benchmark test results with such configuration. > Also, what could be the reason for the variation in data of phase-I? > Do you restart the system after each run to ensure there is nothing in > the memory? If not, then shouldn't we try at least a few runs by > restarting the system before each run to ensure there is nothing > leftover in memory? I dropped all page caches by executing 'echo 3 > /proc/sys/vm/drop_caches' before each run and these results are the median of 3 runs. I'll investigate it further. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Some thoughts/questions on the idea I notice that we are always considering block-level parallelism for heaps and object-level parallelism for indexes. I'm wondering, when multiple tables are being vacuumed together—either because the user has provided a list of tables or has specified a partitioned table with multiple children—does it still make sense to default to block-level parallelism? Or could we consider table-level parallelism in such cases? For example, if there are 4 tables and 6 workers, with 2 tables being small and the other 2 being large, perhaps we could allocate 4 workers to vacuum all 4 tables in parallel. For the larger tables, we could apply block-level parallelism, using more workers for internal parallelism. On the other hand, if all tables are small, we could just apply table-level parallelism without needing block-level parallelism at all. This approach could offer more flexibility, isn't it? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Some more specific comments on the patch set -- v11-0001 1. This introduces functions like parallel_vacuum_estimate(), parallel_vacuum_initialize(), etc., but these functions haven't been assigned to the respective members in TableAMRoutine. As shown in the hunk below, only parallel_vacuum_compute_workers is initialized, while parallel_vacuum_estimate and the others are not. These are assigned in later patches, which makes the patch division appear a bit unclean. +++ b/src/backend/access/heap/heapam_handler.c @@ -2688,7 +2688,9 @@ static const TableAmRoutine heapam_methods = { .scan_bitmap_next_block = heapam_scan_bitmap_next_block, .scan_bitmap_next_tuple = heapam_scan_bitmap_next_tuple, .scan_sample_next_block = heapam_scan_sample_next_block, - .scan_sample_next_tuple = heapam_scan_sample_next_tuple + .scan_sample_next_tuple = heapam_scan_sample_next_tuple, + + .parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers, }; -- v11-0002 2. In commit message, you mentioned a function name "parallel_vacuum_remove_dead_items_begin()" but there is no such function introduced in the patch, I think you meant 'parallel_vacuum_collect_dead_items_begin()'? 3. I would suggest rewrite for this /launched for the table vacuuming and index processing./launched for the table pruning phase and index vacuuming. 4. These comments are talking about DSA and DSM but we better clarify for what we are using DSM and DSA, or give reference to the location where we have explained that. + * In a parallel vacuum, we perform table scan or both index bulk deletion and + * index cleanup or all of them with parallel worker processes. Different + * numbers of workers are launched for the table vacuuming and index processing. + * ParallelVacuumState contains shared information as well as the memory space + * for storing dead items allocated in the DSA area. + * + * When initializing parallel table vacuum scan, we invoke table AM routines for + * estimating DSM sizes and initializing DSM memory. Parallel table vacuum + * workers invoke the table AM routine for vacuuming the table. 5. Is there any particular reason why marking the dead TID as reusable is not done in parallel? Is it because parallelizing that phase wouldn't make sense due to the work involved, or is there another reason? +/* The kind of parallel vacuum phases */ +typedef enum +{ + PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */ + PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */ +} PVWorkPhase; 6. In the below hunk "nrequested_workers is >= 0 number and the requested parallel degree." sentence doesn't make sense to me, can you rephrase this? + * nrequested_workers is >= 0 number and the requested parallel degree. 0 + * means that the parallel degrees for table and indexes vacuum are decided + * differently. See the comments of parallel_vacuum_compute_workers() for + * details. Thats what I got so far, I will continue reviewing 0002 and the remaining patches from the thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > > to process all blocks, resulting in soft page faults while creating > > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > > already created PTEs during the heap scan, thus preventing these > > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > > exists in shared buffers before the vacuum, so why the page faults > > > > will occur in the first place. > > > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > > minor page faults)[1] can occur if these pages are not yet mapped in > > > its page table. > > > > > > > Okay, I got your point. BTW, I noticed that even for the case where > > all the data is in shared_buffers, the performance improvement for > > workers greater than two does decrease marginally. Am I reading the > > data correctly? If so, what is the theory, and do we have > > recommendations for a parallel degree? > > The decrease you referred to is that the total vacuum execution time? > Right. > When it comes to the execution time of phase 1, it seems we have good > scalability. For example, with 2 workers (i.e.3 workers working > including the leader in total) it got about 3x speed up, and with 4 > workers it got about 5x speed up. Regarding other phases, the phase 3 > got slower probably because of PTEs stuff, but I don't investigate why > the phase 2 also slightly got slower with more than 2 workers. > Could it be possible that now phase-2 needs to access the shared area for TIDs, and some locking/unlocking causes such slowdown? -- With Regards, Amit Kapila.
On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Some thoughts/questions on the idea > > I notice that we are always considering block-level parallelism for > heaps and object-level parallelism for indexes. I'm wondering, when > multiple tables are being vacuumed together—either because the user > has provided a list of tables or has specified a partitioned table > with multiple children—does it still make sense to default to > block-level parallelism? Or could we consider table-level parallelism > in such cases? For example, if there are 4 tables and 6 workers, with > 2 tables being small and the other 2 being large, perhaps we could > allocate 4 workers to vacuum all 4 tables in parallel. For the larger > tables, we could apply block-level parallelism, using more workers for > internal parallelism. On the other hand, if all tables are small, we > could just apply table-level parallelism without needing block-level > parallelism at all. This approach could offer more flexibility, isn't > it? > I have not thought from this angle, but it seems we can build this even on top of block-level vacuum for large tables. -- With Regards, Amit Kapila.
On Wed, Mar 12, 2025 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 11:24 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Mar 12, 2025 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Mar 11, 2025 at 5:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Some thoughts/questions on the idea > > > > I notice that we are always considering block-level parallelism for > > heaps and object-level parallelism for indexes. I'm wondering, when > > multiple tables are being vacuumed together—either because the user > > has provided a list of tables or has specified a partitioned table > > with multiple children—does it still make sense to default to > > block-level parallelism? Or could we consider table-level parallelism > > in such cases? For example, if there are 4 tables and 6 workers, with > > 2 tables being small and the other 2 being large, perhaps we could > > allocate 4 workers to vacuum all 4 tables in parallel. For the larger > > tables, we could apply block-level parallelism, using more workers for > > internal parallelism. On the other hand, if all tables are small, we > > could just apply table-level parallelism without needing block-level > > parallelism at all. This approach could offer more flexibility, isn't > > it? > > > > I have not thought from this angle, but it seems we can build this > even on top of block-level vacuum for large tables. Yes, that can be built on top of block-level vacuum. In that case, we can utilize the workers more efficiently, depending on how many workers we have and how many tables need to be vacuumed. And yes, that could also be discussed separately. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 12, 2025 at 3:05 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 12, 2025 at 3:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 11, 2025 at 6:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Mar 10, 2025 at 11:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Sun, Mar 9, 2025 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > However, in the heap vacuum phase, the leader process needed > > > > > > to process all blocks, resulting in soft page faults while creating > > > > > > Page Table Entries (PTEs). Without the patch, the backend process had > > > > > > already created PTEs during the heap scan, thus preventing these > > > > > > faults from occurring during the heap vacuum phase. > > > > > > > > > > > > > > > > This part is again not clear to me because I am assuming all the data > > > > > exists in shared buffers before the vacuum, so why the page faults > > > > > will occur in the first place. > > > > > > > > IIUC PTEs are process-local data. So even if physical pages are loaded > > > > to PostgreSQL's shared buffer (and paga caches), soft page faults (or > > > > minor page faults)[1] can occur if these pages are not yet mapped in > > > > its page table. > > > > > > > > > > Okay, I got your point. BTW, I noticed that even for the case where > > > all the data is in shared_buffers, the performance improvement for > > > workers greater than two does decrease marginally. Am I reading the > > > data correctly? If so, what is the theory, and do we have > > > recommendations for a parallel degree? > > > > The decrease you referred to is that the total vacuum execution time? > > > > Right. > > > When it comes to the execution time of phase 1, it seems we have good > > scalability. For example, with 2 workers (i.e.3 workers working > > including the leader in total) it got about 3x speed up, and with 4 > > workers it got about 5x speed up. Regarding other phases, the phase 3 > > got slower probably because of PTEs stuff, but I don't investigate why > > the phase 2 also slightly got slower with more than 2 workers. > > > > Could it be possible that now phase-2 needs to access the shared area > for TIDs, and some locking/unlocking causes such slowdown? No, TidStore is shared in this case but we don't take a lock on it during phase 2. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 7, 2025 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the updated version patches. When testing the multi passes of table vacuuming, I found an issue. With the current patch, both leader and parallel workers process stop the phase 1 as soon as the shared TidStore size reaches to the limit, and then the leader resumes the parallel heap scan after heap vacuuming and index vacuuming. Therefore, as I described in the patch, one tricky part of this patch is that it's possible that we launch fewer workers than the previous time when resuming phase 1 after phase 2 and 3. In this case, since the previous parallel workers might have already allocated some blocks in their chunk, newly launched workers need to take over their parallel scan state. That's why in the patch we store workers' ParallelBlockTableScanWorkerData in shared memory. However, I found my assumption is wrong; in order to take over the previous scan state properly we need to take over not only ParallelBlockTableScanWorkerData but also ReadStream data as parallel workers might have already queued some blocks for look-ahead in their ReadStream. Looking at ReadStream codes, I find that it's not realistic to store it into the shared memory. One plausible solution would be that we don't use ReadStream in parallel heap vacuum cases but directly use table_block_parallelscan_xxx() instead. It works but we end up having two different scan methods for parallel and non-parallel lazy heap scan. I've implemented this idea in the attached v12 patches. Other than the above change, I've fixed some bugs and addressed review comments I got so far. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v12-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
- v12-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v12-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patch
- v12-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v12-0005-Support-parallelism-for-collecting-dead-items-du.patch
On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > When testing the multi passes of table vacuuming, I found an issue. > With the current patch, both leader and parallel workers process stop > the phase 1 as soon as the shared TidStore size reaches to the limit, > and then the leader resumes the parallel heap scan after heap > vacuuming and index vacuuming. Therefore, as I described in the patch, > one tricky part of this patch is that it's possible that we launch > fewer workers than the previous time when resuming phase 1 after phase > 2 and 3. In this case, since the previous parallel workers might have > already allocated some blocks in their chunk, newly launched workers > need to take over their parallel scan state. That's why in the patch > we store workers' ParallelBlockTableScanWorkerData in shared memory. > However, I found my assumption is wrong; in order to take over the > previous scan state properly we need to take over not only > ParallelBlockTableScanWorkerData but also ReadStream data as parallel > workers might have already queued some blocks for look-ahead in their > ReadStream. Looking at ReadStream codes, I find that it's not > realistic to store it into the shared memory. It seems like one way to solve this would be to add functionality to the read stream to unpin the buffers it has in the buffers queue without trying to continue calling the callback until the stream is exhausted. We have read_stream_reset(), but that is to restart streams that have already been exhausted. Exhausted streams are where the callback has returned InvalidBlockNumber. In the read_stream_reset() cases, the read stream user knows there are more blocks it would like to scan or that it would like to restart the scan from the beginning. Your case is you want to stop trying to exhaust the read stream and just unpin the remaining buffers. As long as the worker which paused phase I knows exactly the last block it processed and can communicate this to whatever worker resumes phase I later, it can initialize vacrel->current_block to the last block processed. > One plausible solution would be that we don't use ReadStream in > parallel heap vacuum cases but directly use > table_block_parallelscan_xxx() instead. It works but we end up having > two different scan methods for parallel and non-parallel lazy heap > scan. I've implemented this idea in the attached v12 patches. One question is which scenarios will parallel vacuum phase I without AIO be faster than read AIO-ified vacuum phase I. Without AIO writes, I suppose it would be trivial for phase I parallel vacuum to be faster without using read AIO. But it's worth thinking about the tradeoff. - Melanie
Hi, On 2025-03-20 01:35:42 -0700, Masahiko Sawada wrote: > One plausible solution would be that we don't use ReadStream in > parallel heap vacuum cases but directly use > table_block_parallelscan_xxx() instead. It works but we end up having > two different scan methods for parallel and non-parallel lazy heap > scan. I've implemented this idea in the attached v12 patches. I think that's a bad idea - this means we'll never be able to use direct IO for parallel VACUUMs, despite a) The CPU overhead of buffered reads being a problem for VACUUM b) Data ending up in the kernel page cache is rather wasteful for VACUUM, as that's often data that won't otherwise be used again soon. I.e. these reads would particularly benefit from using direct IO. c) Even disregarding DIO, loosing the ability to do larger reads, as provided by read streams, looses a fair bit of efficiency (just try doing a pg_prewarm of a large relation with io_combine_limit=1 vs io_combine_limit=1). Greetings, Andres Freund
On Sat, Mar 22, 2025 at 7:16 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > When testing the multi passes of table vacuuming, I found an issue. > > With the current patch, both leader and parallel workers process stop > > the phase 1 as soon as the shared TidStore size reaches to the limit, > > and then the leader resumes the parallel heap scan after heap > > vacuuming and index vacuuming. Therefore, as I described in the patch, > > one tricky part of this patch is that it's possible that we launch > > fewer workers than the previous time when resuming phase 1 after phase > > 2 and 3. In this case, since the previous parallel workers might have > > already allocated some blocks in their chunk, newly launched workers > > need to take over their parallel scan state. That's why in the patch > > we store workers' ParallelBlockTableScanWorkerData in shared memory. > > However, I found my assumption is wrong; in order to take over the > > previous scan state properly we need to take over not only > > ParallelBlockTableScanWorkerData but also ReadStream data as parallel > > workers might have already queued some blocks for look-ahead in their > > ReadStream. Looking at ReadStream codes, I find that it's not > > realistic to store it into the shared memory. > > It seems like one way to solve this would be to add functionality to > the read stream to unpin the buffers it has in the buffers queue > without trying to continue calling the callback until the stream is > exhausted. > > We have read_stream_reset(), but that is to restart streams that have > already been exhausted. Exhausted streams are where the callback has > returned InvalidBlockNumber. In the read_stream_reset() cases, the > read stream user knows there are more blocks it would like to scan or > that it would like to restart the scan from the beginning. > > Your case is you want to stop trying to exhaust the read stream and > just unpin the remaining buffers. As long as the worker which paused > phase I knows exactly the last block it processed and can communicate > this to whatever worker resumes phase I later, it can initialize > vacrel->current_block to the last block processed. If we use ParallelBlockTableScanDesc with streaming read like the patch did, we would also need to somehow rewind the number of blocks allocated to workers. The problem I had with such usage was that a parallel vacuum worker allocated a new chunk of blocks when doing look-ahead reading and therefore advanced ParallelBlockTableScanDescData.phs_nallocated. In this case, even if we unpin the remaining buffers in the queue by a new functionality and a parallel worker resumes the phase 1 from the last processed block, we would lose some blocks in already allocated chunks unless we rewind ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData data. However, since a worker might have already allocated multiple chunks it would not be easy to rewind these scan state data. Another idea is that parallel workers don't exit phase 1 until it consumes all pinned buffers in the queue, even if the memory usage of TidStore exceeds the limit. It would need to add new functionality to the read stream to disable the look-ahead reading. Since we could use much memory while processing these buffers, exceeding the memory limit, we can trigger this mode when the memory usage of TidStore reaches 70% of the limit or so. On the other hand, it means that we would not use the streaming read for the blocks in this mode, which is not efficient. > > > One plausible solution would be that we don't use ReadStream in > > parallel heap vacuum cases but directly use > > table_block_parallelscan_xxx() instead. It works but we end up having > > two different scan methods for parallel and non-parallel lazy heap > > scan. I've implemented this idea in the attached v12 patches. > > One question is which scenarios will parallel vacuum phase I without > AIO be faster than read AIO-ified vacuum phase I. Without AIO writes, > I suppose it would be trivial for phase I parallel vacuum to be faster > without using read AIO. But it's worth thinking about the tradeoff. As Andres pointed out, there are major downsides. So we would need to invent a way to stop and resume the read stream in the middle during parallel scan. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > If we use ParallelBlockTableScanDesc with streaming read like the > patch did, we would also need to somehow rewind the number of blocks > allocated to workers. The problem I had with such usage was that a > parallel vacuum worker allocated a new chunk of blocks when doing > look-ahead reading and therefore advanced > ParallelBlockTableScanDescData.phs_nallocated. In this case, even if > we unpin the remaining buffers in the queue by a new functionality and > a parallel worker resumes the phase 1 from the last processed block, > we would lose some blocks in already allocated chunks unless we rewind > ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData > data. However, since a worker might have already allocated multiple > chunks it would not be easy to rewind these scan state data. Ah I didn't realize rewinding the state would be difficult. It seems like the easiest way to make sure those blocks are done is to add them back to the counter somehow. And I don't suppose there is some way to save these not yet done block assignments somewhere and give them to the workers who restart phase I to process on the second pass? > Another idea is that parallel workers don't exit phase 1 until it > consumes all pinned buffers in the queue, even if the memory usage of > TidStore exceeds the limit. It would need to add new functionality to > the read stream to disable the look-ahead reading. Since we could use > much memory while processing these buffers, exceeding the memory > limit, we can trigger this mode when the memory usage of TidStore > reaches 70% of the limit or so. On the other hand, it means that we > would not use the streaming read for the blocks in this mode, which is > not efficient. That might work. And/or maybe you could start decreasing the size of block assignment chunks when the memory usage of TidStore reaches a certain level. I don't know how much that would help or how fiddly it would be. > So we would need to > invent a way to stop and resume the read stream in the middle during > parallel scan. As for needing to add new read stream functionality, we actually probably don't have to. If you use read_stream_end() -> read_stream_reset(), it resets the distance to 0, so then read_stream_next_buffer() should just end up unpinning the buffers and freeing the per buffer data. I think the easiest way to implement this is to think about it as ending a read stream and starting a new one next time you start phase I and not as pausing and resuming the read stream. And anyway, maybe it's better not to keep a bunch of pinned buffers and allocated memory hanging around while doing what could be very long index scans. - Melanie
Hi, On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote: > Another idea is that parallel workers don't exit phase 1 until it > consumes all pinned buffers in the queue, even if the memory usage of > TidStore exceeds the limit. Yes, that seems a quite reasonable approach to me. > It would need to add new functionality to the read stream to disable the > look-ahead reading. Couldn't your next block callback simply return InvalidBlockNumber once close to the memory limit? > Since we could use much memory while processing these buffers, exceeding the > memory limit, we can trigger this mode when the memory usage of TidStore > reaches 70% of the limit or so. It wouldn't be that much memory, would it? A few 10s-100s of buffers don't increase the size of a TidStore that much? Using 10 parallel vacuum with a m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers. > On the other hand, it means that we would not use the streaming read for the > blocks in this mode, which is not efficient. I don't follow - why wouldn't you be using streaming read? Greetings, Andres Freund
On Sun, Mar 23, 2025 at 10:01 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Sun, Mar 23, 2025 at 4:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > If we use ParallelBlockTableScanDesc with streaming read like the > > patch did, we would also need to somehow rewind the number of blocks > > allocated to workers. The problem I had with such usage was that a > > parallel vacuum worker allocated a new chunk of blocks when doing > > look-ahead reading and therefore advanced > > ParallelBlockTableScanDescData.phs_nallocated. In this case, even if > > we unpin the remaining buffers in the queue by a new functionality and > > a parallel worker resumes the phase 1 from the last processed block, > > we would lose some blocks in already allocated chunks unless we rewind > > ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData > > data. However, since a worker might have already allocated multiple > > chunks it would not be easy to rewind these scan state data. > > Ah I didn't realize rewinding the state would be difficult. It seems > like the easiest way to make sure those blocks are done is to add them > back to the counter somehow. And I don't suppose there is some way to > save these not yet done block assignments somewhere and give them to > the workers who restart phase I to process on the second pass? It might be possible to store the not-yet-done-blocks in DSA to pass them to the next workers. But it would make the codes more complex. > > Another idea is that parallel workers don't exit phase 1 until it > > consumes all pinned buffers in the queue, even if the memory usage of > > TidStore exceeds the limit. It would need to add new functionality to > > the read stream to disable the look-ahead reading. Since we could use > > much memory while processing these buffers, exceeding the memory > > limit, we can trigger this mode when the memory usage of TidStore > > reaches 70% of the limit or so. On the other hand, it means that we > > would not use the streaming read for the blocks in this mode, which is > > not efficient. > > That might work. And/or maybe you could start decreasing the size of > block assignment chunks when the memory usage of TidStore reaches a > certain level. I don't know how much that would help or how fiddly it > would be. I've tried this idea in the attached version patch. I've started with a simple approach; once the TidStore reaches the limit, heap_vac_scan_next_block(), a callback for the read stream, begins to return InvalidBlockNumber. We continue phase 1 until the read stream is exhausted. > > > So we would need to > > invent a way to stop and resume the read stream in the middle during > > parallel scan. > > As for needing to add new read stream functionality, we actually > probably don't have to. If you use read_stream_end() -> > read_stream_reset(), it resets the distance to 0, so then > read_stream_next_buffer() should just end up unpinning the buffers and > freeing the per buffer data. I think the easiest way to implement this > is to think about it as ending a read stream and starting a new one > next time you start phase I and not as pausing and resuming the read > stream. And anyway, maybe it's better not to keep a bunch of pinned > buffers and allocated memory hanging around while doing what could be > very long index scans. You're right. I've studied the read stream code and figured out how to use it. In the attached patch, we end the read stream at the end of phase 1 and start a new read stream, as you suggested. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v13-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patch
- v13-0005-Support-parallelism-for-collecting-dead-items-du.patch
- v13-0004-Move-GlobalVisState-definition-to-snapmgr_intern.patch
- v13-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v13-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
On Sun, Mar 23, 2025 at 10:13 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-03-23 01:45:35 -0700, Masahiko Sawada wrote: > > Another idea is that parallel workers don't exit phase 1 until it > > consumes all pinned buffers in the queue, even if the memory usage of > > TidStore exceeds the limit. > > Yes, that seems a quite reasonable approach to me. > > > > It would need to add new functionality to the read stream to disable the > > look-ahead reading. > > Couldn't your next block callback simply return InvalidBlockNumber once close > to the memory limit? Good idea. > > > > Since we could use much memory while processing these buffers, exceeding the > > memory limit, we can trigger this mode when the memory usage of TidStore > > reaches 70% of the limit or so. > > It wouldn't be that much memory, would it? A few 10s-100s of buffers don't > increase the size of a TidStore that much? Using 10 parallel vacuum with a > m_w_m of 1MB doesn't make sense, we'd constantly start/stop workers. > It ultimately depends on how big the next size class of DSA segment is, but typically yes. We configure the max DSA segment size based on the maintenance_work_mem. > > > On the other hand, it means that we would not use the streaming read for the > > blocks in this mode, which is not efficient. > > I don't follow - why wouldn't you be using streaming read? I was thinking of idea like stopping look-ahead reading at some point and processing pages without the read stream until the TidStore reaches the limit. But it seems like simpler if we could stop retrieving next block for the read stream once the TidStore reached the limit and then continue phase 1 until the read stream is exhausted. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > You're right. I've studied the read stream code and figured out how to > use it. In the attached patch, we end the read stream at the end of > phase 1 and start a new read stream, as you suggested. I've started looking at this patch set some more. In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD codepath and run out of unskippable blocks in the current chunk and then go back to get another chunk (goto retry) but we are near the memory limit so we can't get another block (!dead_items_check_memory_limit()), could we get an infinite loop? Or even incorrectly encroach on another worker's block? Asking that because of this math end_block = next_block + vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 and we are in the goto retry loop, it seems like we could keep incrementing next_block even when we shouldn't be. I just want to make sure that the skip pages optimization works with the parallel block assignment and the low memory read stream wind-down. I also think you do not need to expose table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is only called in heap-specific code and the logic seems very heap-related. If table AMs want something to skip some blocks, they could easily implement it. On another topic, I think it would be good to have a comment above this code in parallel_lazy_scan_gather_scan_results(), stating why we are very sure it is correct. Assert(TransactionIdIsValid(data->NewRelfrozenXid)); Assert(MultiXactIdIsValid(data->NewRelminMxid)); if (TransactionIdPrecedes(data->NewRelfrozenXid, vacrel->scan_data->NewRelfrozenXid)) vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, vacrel->scan_data->NewRelminMxid)) vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) vacrel->scan_data->nonempty_pages = data->nonempty_pages; vacrel->scan_data->skippedallvis |= data->skippedallvis; Parallel relfrozenxid advancement sounds scary, and scary things are best with comments. Even though the way this works is intuitive, I think it is worth pointing out that this part is important to get right so future programmers know how important it is. One thing I was wondering about is if there are any implications of different workers having different values in their GlobalVisState. GlobalVisState can be updated during vacuum, so even if they start out with the same values, that could diverge. It is probably okay since it just controls what tuples are removable. Some workers may remove fewer tuples than they absolutely could, and this is probably okay. And if it is okay for each worker to have different GlobalVisState then maybe you shouldn't have a GlobalVisState in shared memory. If you look at GlobalVisTestFor() it just returns the memory address of that global variable in the backend. So, it seems like it might be better to just let each parallel worker use their own backend local GlobalVisState and not try to put it in shared memory and copy it from one worker to the other workers when initializing them. I'm not sure. At the very least, there should be a comment explaining why you've done it the way you have done it. - Melanie
On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > You're right. I've studied the read stream code and figured out how to > > use it. In the attached patch, we end the read stream at the end of > > phase 1 and start a new read stream, as you suggested. > > I've started looking at this patch set some more. Thank you for reviewing the patch! > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > codepath and run out of unskippable blocks in the current chunk and > then go back to get another chunk (goto retry) but we are near the > memory limit so we can't get another block > (!dead_items_check_memory_limit()), could we get an infinite loop? > > Or even incorrectly encroach on another worker's block? Asking that > because of this math > end_block = next_block + > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; You're right. We should make sure that reset next_block is reset to InvalidBlockNumber at the beginning of the retry loop. > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > and we are in the goto retry loop, it seems like we could keep > incrementing next_block even when we shouldn't be. Right. Will fix. > > I just want to make sure that the skip pages optimization works with > the parallel block assignment and the low memory read stream > wind-down. > > I also think you do not need to expose > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > only called in heap-specific code and the logic seems very > heap-related. If table AMs want something to skip some blocks, they > could easily implement it. Agreed. Will remove it. > > On another topic, I think it would be good to have a comment above > this code in parallel_lazy_scan_gather_scan_results(), stating why we > are very sure it is correct. > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > vacrel->scan_data->NewRelfrozenXid)) > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > vacrel->scan_data->NewRelminMxid)) > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > Parallel relfrozenxid advancement sounds scary, and scary things are > best with comments. Even though the way this works is intuitive, I > think it is worth pointing out that this part is important to get > right so future programmers know how important it is. > > One thing I was wondering about is if there are any implications of > different workers having different values in their GlobalVisState. > GlobalVisState can be updated during vacuum, so even if they start out > with the same values, that could diverge. It is probably okay since it > just controls what tuples are removable. Some workers may remove fewer > tuples than they absolutely could, and this is probably okay. > Good point. > And if it is okay for each worker to have different GlobalVisState > then maybe you shouldn't have a GlobalVisState in shared memory. If > you look at GlobalVisTestFor() it just returns the memory address of > that global variable in the backend. So, it seems like it might be > better to just let each parallel worker use their own backend local > GlobalVisState and not try to put it in shared memory and copy it from > one worker to the other workers when initializing them. I'm not sure. > At the very least, there should be a comment explaining why you've > done it the way you have done it. Agreed. IIUC it's not a problem even if parallel workers use their own GlobalVisState. I'll make that change and remove the 0004 patch which exposes GlobalVisState. I'll send the updated patch soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > You're right. I've studied the read stream code and figured out how to > > > use it. In the attached patch, we end the read stream at the end of > > > phase 1 and start a new read stream, as you suggested. > > > > I've started looking at this patch set some more. > > Thank you for reviewing the patch! > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > codepath and run out of unskippable blocks in the current chunk and > > then go back to get another chunk (goto retry) but we are near the > > memory limit so we can't get another block > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > Or even incorrectly encroach on another worker's block? Asking that > > because of this math > > end_block = next_block + > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > You're right. We should make sure that reset next_block is reset to > InvalidBlockNumber at the beginning of the retry loop. > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > and we are in the goto retry loop, it seems like we could keep > > incrementing next_block even when we shouldn't be. > > Right. Will fix. > > > > > I just want to make sure that the skip pages optimization works with > > the parallel block assignment and the low memory read stream > > wind-down. > > > > I also think you do not need to expose > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > only called in heap-specific code and the logic seems very > > heap-related. If table AMs want something to skip some blocks, they > > could easily implement it. > > Agreed. Will remove it. > > > > > On another topic, I think it would be good to have a comment above > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > are very sure it is correct. > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > vacrel->scan_data->NewRelfrozenXid)) > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > vacrel->scan_data->NewRelminMxid)) > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > best with comments. Even though the way this works is intuitive, I > > think it is worth pointing out that this part is important to get > > right so future programmers know how important it is. > > > > One thing I was wondering about is if there are any implications of > > different workers having different values in their GlobalVisState. > > GlobalVisState can be updated during vacuum, so even if they start out > > with the same values, that could diverge. It is probably okay since it > > just controls what tuples are removable. Some workers may remove fewer > > tuples than they absolutely could, and this is probably okay. > > > > Good point. > > > And if it is okay for each worker to have different GlobalVisState > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > you look at GlobalVisTestFor() it just returns the memory address of > > that global variable in the backend. So, it seems like it might be > > better to just let each parallel worker use their own backend local > > GlobalVisState and not try to put it in shared memory and copy it from > > one worker to the other workers when initializing them. I'm not sure. > > At the very least, there should be a comment explaining why you've > > done it the way you have done it. > > Agreed. IIUC it's not a problem even if parallel workers use their own > GlobalVisState. I'll make that change and remove the 0004 patch which > exposes GlobalVisState. > > I'll send the updated patch soon. I've attached the updated patches. This version includes the comments from Melanie, some bug fixes, and comment updates. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > You're right. I've studied the read stream code and figured out how to > > > > use it. In the attached patch, we end the read stream at the end of > > > > phase 1 and start a new read stream, as you suggested. > > > > > > I've started looking at this patch set some more. > > > > Thank you for reviewing the patch! > > > > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > > codepath and run out of unskippable blocks in the current chunk and > > > then go back to get another chunk (goto retry) but we are near the > > > memory limit so we can't get another block > > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > > > Or even incorrectly encroach on another worker's block? Asking that > > > because of this math > > > end_block = next_block + > > > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > > > You're right. We should make sure that reset next_block is reset to > > InvalidBlockNumber at the beginning of the retry loop. > > > > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > > and we are in the goto retry loop, it seems like we could keep > > > incrementing next_block even when we shouldn't be. > > > > Right. Will fix. > > > > > > > > I just want to make sure that the skip pages optimization works with > > > the parallel block assignment and the low memory read stream > > > wind-down. > > > > > > I also think you do not need to expose > > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > > only called in heap-specific code and the logic seems very > > > heap-related. If table AMs want something to skip some blocks, they > > > could easily implement it. > > > > Agreed. Will remove it. > > > > > > > > On another topic, I think it would be good to have a comment above > > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > > are very sure it is correct. > > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > > vacrel->scan_data->NewRelfrozenXid)) > > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > > vacrel->scan_data->NewRelminMxid)) > > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > > best with comments. Even though the way this works is intuitive, I > > > think it is worth pointing out that this part is important to get > > > right so future programmers know how important it is. > > > > > > One thing I was wondering about is if there are any implications of > > > different workers having different values in their GlobalVisState. > > > GlobalVisState can be updated during vacuum, so even if they start out > > > with the same values, that could diverge. It is probably okay since it > > > just controls what tuples are removable. Some workers may remove fewer > > > tuples than they absolutely could, and this is probably okay. > > > > > > > Good point. > > > > > And if it is okay for each worker to have different GlobalVisState > > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > > you look at GlobalVisTestFor() it just returns the memory address of > > > that global variable in the backend. So, it seems like it might be > > > better to just let each parallel worker use their own backend local > > > GlobalVisState and not try to put it in shared memory and copy it from > > > one worker to the other workers when initializing them. I'm not sure. > > > At the very least, there should be a comment explaining why you've > > > done it the way you have done it. > > > > Agreed. IIUC it's not a problem even if parallel workers use their own > > GlobalVisState. I'll make that change and remove the 0004 patch which > > exposes GlobalVisState. > > > > I'll send the updated patch soon. > > I've attached the updated patches. This version includes the comments > from Melanie, some bug fixes, and comment updates. Rebased the patches as they conflicted with recent commits. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Mon, Mar 31, 2025 at 2:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 27, 2025 at 11:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 27, 2025 at 8:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Mar 26, 2025 at 1:00 PM Melanie Plageman > > > <melanieplageman@gmail.com> wrote: > > > > > > > > On Mon, Mar 24, 2025 at 7:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > You're right. I've studied the read stream code and figured out how to > > > > > use it. In the attached patch, we end the read stream at the end of > > > > > phase 1 and start a new read stream, as you suggested. > > > > > > > > I've started looking at this patch set some more. > > > > > > Thank you for reviewing the patch! > > > > > > > > > > > In heap_vac_scan_next_block() if we are in the SKIP_PAGES_THRESHOLD > > > > codepath and run out of unskippable blocks in the current chunk and > > > > then go back to get another chunk (goto retry) but we are near the > > > > memory limit so we can't get another block > > > > (!dead_items_check_memory_limit()), could we get an infinite loop? > > > > > > > > Or even incorrectly encroach on another worker's block? Asking that > > > > because of this math > > > > end_block = next_block + > > > > > > > > vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining + 1; > > > > > > You're right. We should make sure that reset next_block is reset to > > > InvalidBlockNumber at the beginning of the retry loop. > > > > > > > > > > > if vacrel->plvstate->scanworker->pbscanwork.phsw_chunk_remaining is 0 > > > > and we are in the goto retry loop, it seems like we could keep > > > > incrementing next_block even when we shouldn't be. > > > > > > Right. Will fix. > > > > > > > > > > > I just want to make sure that the skip pages optimization works with > > > > the parallel block assignment and the low memory read stream > > > > wind-down. > > > > > > > > I also think you do not need to expose > > > > table_block_parallelscan_skip_pages_in_chunk() in the table AM. It is > > > > only called in heap-specific code and the logic seems very > > > > heap-related. If table AMs want something to skip some blocks, they > > > > could easily implement it. > > > > > > Agreed. Will remove it. > > > > > > > > > > > On another topic, I think it would be good to have a comment above > > > > this code in parallel_lazy_scan_gather_scan_results(), stating why we > > > > are very sure it is correct. > > > > Assert(TransactionIdIsValid(data->NewRelfrozenXid)); > > > > Assert(MultiXactIdIsValid(data->NewRelminMxid)); > > > > > > > > if (TransactionIdPrecedes(data->NewRelfrozenXid, > > > > vacrel->scan_data->NewRelfrozenXid)) > > > > vacrel->scan_data->NewRelfrozenXid = data->NewRelfrozenXid; > > > > > > > > if (MultiXactIdPrecedesOrEquals(data->NewRelminMxid, > > > > vacrel->scan_data->NewRelminMxid)) > > > > vacrel->scan_data->NewRelminMxid = data->NewRelminMxid; > > > > > > > > if (data->nonempty_pages < vacrel->scan_data->nonempty_pages) > > > > vacrel->scan_data->nonempty_pages = data->nonempty_pages; > > > > > > > > vacrel->scan_data->skippedallvis |= data->skippedallvis; > > > > > > > > Parallel relfrozenxid advancement sounds scary, and scary things are > > > > best with comments. Even though the way this works is intuitive, I > > > > think it is worth pointing out that this part is important to get > > > > right so future programmers know how important it is. > > > > > > > > One thing I was wondering about is if there are any implications of > > > > different workers having different values in their GlobalVisState. > > > > GlobalVisState can be updated during vacuum, so even if they start out > > > > with the same values, that could diverge. It is probably okay since it > > > > just controls what tuples are removable. Some workers may remove fewer > > > > tuples than they absolutely could, and this is probably okay. > > > > > > > > > > Good point. > > > > > > > And if it is okay for each worker to have different GlobalVisState > > > > then maybe you shouldn't have a GlobalVisState in shared memory. If > > > > you look at GlobalVisTestFor() it just returns the memory address of > > > > that global variable in the backend. So, it seems like it might be > > > > better to just let each parallel worker use their own backend local > > > > GlobalVisState and not try to put it in shared memory and copy it from > > > > one worker to the other workers when initializing them. I'm not sure. > > > > At the very least, there should be a comment explaining why you've > > > > done it the way you have done it. > > > > > > Agreed. IIUC it's not a problem even if parallel workers use their own > > > GlobalVisState. I'll make that change and remove the 0004 patch which > > > exposes GlobalVisState. > > > > > > I'll send the updated patch soon. > > > > I've attached the updated patches. This version includes the comments > > from Melanie, some bug fixes, and comment updates. > > Rebased the patches as they conflicted with recent commits. > I've attached the new version patch. There are no major changes; I fixed some typos, improved the comment, and removed duplicated codes. Also, I've updated the commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached the new version patch. There are no major changes; I > fixed some typos, improved the comment, and removed duplicated codes. > Also, I've updated the commit messages. I haven't looked closely at this version but I did notice that you do not document that parallel vacuum disables eager scanning. Imagine you are a user who has set the eager freeze related table storage option (vacuum_max_eager_freeze_failure_rate) and you schedule a regular parallel vacuum. Now that table storage option does nothing. - Melanie
On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've attached the new version patch. There are no major changes; I > > fixed some typos, improved the comment, and removed duplicated codes. > > Also, I've updated the commit messages. > > I haven't looked closely at this version but I did notice that you do > not document that parallel vacuum disables eager scanning. Imagine you > are a user who has set the eager freeze related table storage option > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > parallel vacuum. Now that table storage option does nothing. Good point. That restriction should be mentioned in the documentation. I'll update the patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I haven't looked closely at this version but I did notice that you do > > not document that parallel vacuum disables eager scanning. Imagine you > > are a user who has set the eager freeze related table storage option > > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > > parallel vacuum. Now that table storage option does nothing. > > Good point. That restriction should be mentioned in the documentation. > I'll update the patch. Yea, I mean, to be honest, when I initially replied to the thread saying I thought temporarily disabling eager scanning for parallel heap vacuuming was viable, I hadn't looked at the patch yet and thought that there was a separate way to enable the new parallel heap vacuum (separate from the parallel option for the existing parallel index vacuuming). I don't like that this disables functionality that worked when I pushed the eager scanning feature. - Melanie
Hi, On 2025-04-04 14:34:53 -0700, Masahiko Sawada wrote: > On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > I've attached the new version patch. There are no major changes; I > > > fixed some typos, improved the comment, and removed duplicated codes. > > > Also, I've updated the commit messages. > > > > I haven't looked closely at this version but I did notice that you do > > not document that parallel vacuum disables eager scanning. Imagine you > > are a user who has set the eager freeze related table storage option > > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > > parallel vacuum. Now that table storage option does nothing. > > Good point. That restriction should be mentioned in the documentation. > I'll update the patch. I don't think we commonly accept that a new feature B regresses a pre-existing feature A, particularly not if feature B is enabled by default. Why would that be OK here? The justification in the code: + * One might think that it would make sense to use the eager scanning even + * during parallel lazy vacuum, but parallel vacuum is available only in + * VACUUM command and would not be something that happens frequently, + * which seems not fit to the purpose of the eager scanning. Also, it + * would require making the code complex. So it would make sense to + * disable it for now. feels not at all convincing to me. There e.g. are lots of places that run nightly vacuums. I don't think it's ok to just disable eager scanning in such a case, as it would mean that the "freeze cliff" would end up being *higher* because of the nightly vacuums than if just plain autovacuum would have been used. I think it was already a mistake to allow the existing vacuum parallelism to be introduced without integrating it with autovacuum. I don't think we should go further down that road. Greetings, Andres Freund
On Sat, Apr 5, 2025 at 1:32 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-04-04 14:34:53 -0700, Masahiko Sawada wrote: > > On Fri, Apr 4, 2025 at 11:05 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > On Tue, Apr 1, 2025 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > I've attached the new version patch. There are no major changes; I > > > > fixed some typos, improved the comment, and removed duplicated codes. > > > > Also, I've updated the commit messages. > > > > > > I haven't looked closely at this version but I did notice that you do > > > not document that parallel vacuum disables eager scanning. Imagine you > > > are a user who has set the eager freeze related table storage option > > > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > > > parallel vacuum. Now that table storage option does nothing. > > > > Good point. That restriction should be mentioned in the documentation. > > I'll update the patch. > > I don't think we commonly accept that a new feature B regresses a pre-existing > feature A, particularly not if feature B is enabled by default. Why would that > be OK here? The eager freeze scan is the pre-existing feature but it's pretty new code that was pushed just a couple months ago. I didn't want to make the newly introduced code complex further in one major release especially if it's in a vacuum area. While I agree that disabling eager freeze scans during parallel heap vacuum is not very user-friendly, there are still many cases where parallel heap vacuum helps even without the eager freeze scan. FYI the parallel heap scan can be disabled by setting min_parallel_table_scan_size. So I thought we can incrementally improve this part. > > > The justification in the code: > + * One might think that it would make sense to use the eager scanning even > + * during parallel lazy vacuum, but parallel vacuum is available only in > + * VACUUM command and would not be something that happens frequently, > + * which seems not fit to the purpose of the eager scanning. Also, it > + * would require making the code complex. So it would make sense to > + * disable it for now. > > feels not at all convincing to me. There e.g. are lots of places that run > nightly vacuums. I don't think it's ok to just disable eager scanning in such > a case, as it would mean that the "freeze cliff" would end up being *higher* > because of the nightly vacuums than if just plain autovacuum would have been > used. That's a fair argument. > I think it was already a mistake to allow the existing vacuum parallelism to > be introduced without integrating it with autovacuum. I don't think we should > go further down that road. Okay, I think we can consider how to proceed with this patch including the above point in the v19 development. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sun, Apr 6, 2025 at 1:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The eager freeze scan is the pre-existing feature but it's pretty new > code that was pushed just a couple months ago. I didn't want to make > the newly introduced code complex further in one major release > especially if it's in a vacuum area. While I agree that disabling > eager freeze scans during parallel heap vacuum is not very > user-friendly, there are still many cases where parallel heap vacuum > helps even without the eager freeze scan. FYI the parallel heap scan > can be disabled by setting min_parallel_table_scan_size. So I thought > we can incrementally improve this part. min_parallel_table_scan_size also affects parallel sequential scans, though. And AFAIK, that can only be configured globally (i.e. not for a specific table). I wonder if there is a clear way to make specific phases of vacuum using parallelism configurable. - Melanie
Hi Sawada-san. I was revisiting this thread after a long time. I found most of my previous review comments from v11-0001 were not yet addressed. I can't tell if they are deliberately left out, or if they are accidentally overlooked. Please see the details below. On Mon, Mar 10, 2025 at 3:05 PM Peter Smith <smithpb2250@gmail.com> wrote: > > ====== > src/backend/access/table/tableamapi.c > > 2. > Assert(routine->relation_vacuum != NULL); > + Assert(routine->parallel_vacuum_compute_workers != NULL); > Assert(routine->scan_analyze_next_block != NULL); > > Is it better to keep these Asserts in the same order that the > TableAmRoutine fields are assigned (in heapam_handler.c)? > Not yet addressed? > ~~~ > > 3. > + /* > + * Callbacks for parallel vacuum are also optional (except for > + * parallel_vacuum_compute_workers). But one callback implies presence of > + * the others. > + */ > + Assert(((((routine->parallel_vacuum_estimate == NULL) == > + (routine->parallel_vacuum_initialize == NULL)) == > + (routine->parallel_vacuum_initialize_worker == NULL)) == > + (routine->parallel_vacuum_collect_dead_items == NULL))); > > /also optional/optional/ > Not yet addressed? > ====== > src/include/access/heapam.h > > +extern int heap_parallel_vacuum_compute_workers(Relation rel, int > nworkers_requested); > > 4. > wrong tab/space after 'int'. Not yet addressed? > > ====== > src/include/access/tableam.h > > 5. > + /* > + * Compute the number of parallel workers for parallel table vacuum. The > + * parallel degree for parallel vacuum is further limited by > + * max_parallel_maintenance_workers. The function must return 0 to disable > + * parallel table vacuum. > + * > + * 'nworkers_requested' is a >=0 number and the requested number of > + * workers. This comes from the PARALLEL option. 0 means to choose the > + * parallel degree based on the table AM specific factors such as table > + * size. > + */ > + int (*parallel_vacuum_compute_workers) (Relation rel, > + int nworkers_requested); > > The comment here says "This comes from the PARALLEL option." and "0 > means to choose the parallel degree...". But, the PG docs [1] says "To > disable this feature, one can use PARALLEL option and specify parallel > workers as zero.". > > These two descriptions "disable this feature" (PG docs) and letting > the system "choose the parallel degree" (code comment) don't sound the > same. Should this 0001 patch update the PG documentation for the > effect of setting PARALLEL value zero? Not yet addressed? > > ~~~ > > 6. > + /* > + * Initialize DSM space for parallel table vacuum. > + * > + * Not called if parallel table vacuum is disabled. > + * > + * Optional callback, but either all other parallel vacuum callbacks need > + * to exist, or neither. > + */ > > "or neither"? > > Also, saying "all other" seems incorrect because > parallel_vacuum_compute_workers callback must exist event if > parallel_vacuum_initialize does not exist. > > IMO you meant to say "all optional", and "or none". > > SUGGESTION: > Optional callback. Either all optional parallel vacuum callbacks need > to exist, or none. > > (this same issue is repeated in multiple places). Not yet addressed? ====== > [1] https://www.postgresql.org/docs/devel/sql-vacuum.html Kind Regards, Peter Smith. Fujitsu Australia
Hi Sawada-san, Here are some review comments for the patch v16-0002 ====== Commit message 1. Missing period. /cleanup, ParallelVacuumState/cleanup. ParallelVacuumState/ ~~~ 2. Typo /paralel/parallel/ ~~~ 3. Heap table AM disables the parallel heap vacuuming for now, but an upcoming patch uses it. IIUC, this "disabling" was implemented in patch 0001, so maybe this comment also belongs in patch 0001. ====== src/backend/commands/vacuumparallel.c 4. * This file contains routines that are intended to support setting up, using, - * and tearing down a ParallelVacuumState. + * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared + * information as well as the memory space for storing dead items allocated in + * the DSA area. We launch * What's the "We launch" incomplete sentence meant to say? ~~~ 5. +typedef enum +{ + PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */ + PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */ +} PVWorkPhase; + AFAIK, these different "phases" are the ones already described in the vacuumlazy.c header comment -- there they are named as I, II, III. IMO it might be better to try to keep these phase enums together with the phase descriptions, as well as giving the values all names similar to those existing descriptions. Maybe something like: PHASE_I_VACUUM_RELATION_COLLECT_DEAD PHASE_II_VACUUM_INDEX_COLLECT_DEAD PHASE_III_VACUUM_REAP_DEAD ~~~ 6. -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int nrequested, - bool *will_parallel_vacuum) +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, + int nrequested, int *nworkers_table_p, + bool *idx_will_parallel_vacuum, void *state) I had asked this once before ([1] comment #5) IIUC the returns for this function seem inconsistent. AFAIK, it previously would return the number of workers for parallel index vacuuming. But now (after this patch) the return value is returned Max(nworkers_table, nworkers_index). Meanwhile, the number of workers for parallel table vacuuming is returned as a by-reference parameter 'nworkers_table_p'. In other words, it is returning the number of workers in 2 different ways. It would seem tidier to make this a void function, and introduce another parameter 'nworkers_index_p', similar to 'nworkers_table_p'. The caller can do the parallel_workers = Max(nworkers_table_p, nworkers_index_p); ~~~ parallel_vacuum_process_all_indexes 7. * to finish, or we might get incomplete data.) */ if (nworkers > 0) - { - /* Wait for all vacuum workers to finish */ - WaitForParallelWorkersToFinish(pvs->pcxt); - - for (int i = 0; i < pvs->pcxt->nworkers_launched; i++) - InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]); - } + parallel_vacuum_end_worke_phase(pvs); 7a. The code comment that precedes this was describing code that is now all removed (moved into the parallel_vacuum_end_worke_phase), so probably that comment needs to be changed. ~ 7b. Typo: parallel_vacuum_end_worke_phase (worke?) ~~~ parallel_vacuum_collect_dead_items_end: 8. + /* Wait for parallel workers to finish */ + parallel_vacuum_end_worke_phase(pvs); typo (worke) ~~~ 9. + /* Decrement the worker count for the leader itself */ + if (VacuumActiveNWorkers) + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); Is this OK? Or did the prior parallel_vacuum_end_worke_phase call already set VacuumActiveNWorkers to NULL. ~~~ parallel_vacuum_process_table: 10. + /* + * We have completed the table vacuum so decrement the active worker + * count. + */ + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); Is this OK? (Similar question to the previous review comment #9) Can the table_parallel_vacuum_collect_dead_items end up calling parallel_vacuum_end_worke_phase, so by time we get here the VacuumActiveNWorkers might be NULL? ~~~ parallel_vacuum_begin_work_phase: 11. +/* + * Launch parallel vacuum workers for the given phase. If at least one + * worker launched, enable the shared vacuum delay costing. + */ Maybe this comment should also say a call to this needs to be balanced by a call to the other function: parallel_vacuum_end_work_phase. ~~~ 12. +static void +parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs) Typo (worke) ====== [1] https://www.postgresql.org/message-id/CAHut%2BPs9yTGk2TWdjgLQEzGfPTWafKT0H-Q6WvYh5UKNW0pCvQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Dear Sawada-san, Thanks for updating the patch. I have been reviewing and below are comments for now. 01. Sorry if I forgot something, but is there a reason that parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the function can check and regards zero if the API is NULL. 02. table_parallel_vacuum_estimate We can assert that state is not NULL. 03. table_parallel_vacuum_initialize Same as 02.. Also, I think we should ensure here that table_parallel_vacuum_estimate() has already been called before, and current assertion might not enough because others can set random value here. Other functions like table_rescan_tidrange() checks the internal flag of the data structure, which is good for me. How do you feel to check pcxt->estimator has non-zero value? 04. table_parallel_vacuum_initialize_worker Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been done? 05. table_parallel_vacuum_collect_dead_items Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker() has been done? 06. table_parallel_vacuum_initialize_worker Comments atop function needs to be updated. 07. While playing, I found that the parallel vacuum worker can be launched more than pages: ``` postgres=# CREATE TABLE var (id int); CREATE TABLE postgres=# INSERT INTO var VALUES (generate_series(1, 10000)); INSERT 0 10000 postgres=# VACUUM (parallel 80, verbose) var ; INFO: vacuuming "postgres.public.var" INFO: launched 80 parallel vacuum workers for collecting dead tuples (planned: 80) INFO: finished vacuuming "postgres.public.var": index scans: 0 pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned ... ``` I hope the minimum chunk size is a page so that this situation can reduce the performance. How do you feel to cap the value with rel::rd_rel::relpages in heap_parallel_vacuum_compute_workers()? This value is not always up-to-date but seems good candidate. Best regards, Hayato Kuroda FUJITSU LIMITED
On Sun, Apr 6, 2025 at 10:27 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Sawada-san. > > I was revisiting this thread after a long time. I found most of my > previous review comments from v11-0001 were not yet addressed. I can't > tell if they are deliberately left out, or if they are accidentally > overlooked. Please see the details below. Sorry I missed to address or reply these comments. > > On Mon, Mar 10, 2025 at 3:05 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > ====== > > src/backend/access/table/tableamapi.c > > > > 2. > > Assert(routine->relation_vacuum != NULL); > > + Assert(routine->parallel_vacuum_compute_workers != NULL); > > Assert(routine->scan_analyze_next_block != NULL); > > > > Is it better to keep these Asserts in the same order that the > > TableAmRoutine fields are assigned (in heapam_handler.c)? > > > > Not yet addressed? Will fix. > > > ~~~ > > > > 3. > > + /* > > + * Callbacks for parallel vacuum are also optional (except for > > + * parallel_vacuum_compute_workers). But one callback implies presence of > > + * the others. > > + */ > > + Assert(((((routine->parallel_vacuum_estimate == NULL) == > > + (routine->parallel_vacuum_initialize == NULL)) == > > + (routine->parallel_vacuum_initialize_worker == NULL)) == > > + (routine->parallel_vacuum_collect_dead_items == NULL))); > > > > /also optional/optional/ > > > > Not yet addressed? Will fix. > > > ====== > > src/include/access/heapam.h > > > > +extern int heap_parallel_vacuum_compute_workers(Relation rel, int > > nworkers_requested); > > > > 4. > > wrong tab/space after 'int'. > > Not yet addressed? Will fix. > > > > > ====== > > src/include/access/tableam.h > > > > 5. > > + /* > > + * Compute the number of parallel workers for parallel table vacuum. The > > + * parallel degree for parallel vacuum is further limited by > > + * max_parallel_maintenance_workers. The function must return 0 to disable > > + * parallel table vacuum. > > + * > > + * 'nworkers_requested' is a >=0 number and the requested number of > > + * workers. This comes from the PARALLEL option. 0 means to choose the > > + * parallel degree based on the table AM specific factors such as table > > + * size. > > + */ > > + int (*parallel_vacuum_compute_workers) (Relation rel, > > + int nworkers_requested); > > > > The comment here says "This comes from the PARALLEL option." and "0 > > means to choose the parallel degree...". But, the PG docs [1] says "To > > disable this feature, one can use PARALLEL option and specify parallel > > workers as zero.". > > > > These two descriptions "disable this feature" (PG docs) and letting > > the system "choose the parallel degree" (code comment) don't sound the > > same. Should this 0001 patch update the PG documentation for the > > effect of setting PARALLEL value zero? > > Not yet addressed? It often happens that we treat a value differently when the user inputs the value and when we use it internally. I think the comment should follow how to use VacuumParams.nworkers internally, which is described in the comment as follow: /* * The number of parallel vacuum workers. 0 by default which means choose * based on the number of indexes. -1 indicates parallel vacuum is * disabled. */ int nworkers; So it seems no problem to me. > > > > > ~~~ > > > > 6. > > + /* > > + * Initialize DSM space for parallel table vacuum. > > + * > > + * Not called if parallel table vacuum is disabled. > > + * > > + * Optional callback, but either all other parallel vacuum callbacks need > > + * to exist, or neither. > > + */ > > > > "or neither"? > > > > Also, saying "all other" seems incorrect because > > parallel_vacuum_compute_workers callback must exist event if > > parallel_vacuum_initialize does not exist. > > > > IMO you meant to say "all optional", and "or none". > > > > SUGGESTION: > > Optional callback. Either all optional parallel vacuum callbacks need > > to exist, or none. > > > > (this same issue is repeated in multiple places). > > Not yet addressed? Will fix. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Apr 7, 2025 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Sawada-san, > > Here are some review comments for the patch v16-0002 > > ====== > Commit message > > 1. > Missing period. > /cleanup, ParallelVacuumState/cleanup. ParallelVacuumState/ > ~~~ > > 2. > Typo /paralel/parallel/ > > ~~~ > > 3. > Heap table AM disables the parallel heap vacuuming for now, but an > upcoming patch uses it. > > IIUC, this "disabling" was implemented in patch 0001, so maybe this > comment also belongs in patch 0001. > > ====== > src/backend/commands/vacuumparallel.c > > 4. > * This file contains routines that are intended to support setting up, using, > - * and tearing down a ParallelVacuumState. > + * and tearing down a ParallelVacuumState. ParallelVacuumState contains shared > + * information as well as the memory space for storing dead items allocated in > + * the DSA area. We launch > * > > What's the "We launch" incomplete sentence meant to say? Will fix the above points.. > 5. > +typedef enum > +{ > + PV_WORK_PHASE_PROCESS_INDEXES, /* index vacuuming or cleanup */ > + PV_WORK_PHASE_COLLECT_DEAD_ITEMS, /* collect dead tuples */ > +} PVWorkPhase; > + > > AFAIK, these different "phases" are the ones already described in the > vacuumlazy.c header comment -- there they are named as I, II, III. > > IMO it might be better to try to keep these phase enums together with > the phase descriptions, as well as giving the values all names similar > to those existing descriptions. > > Maybe something like: > PHASE_I_VACUUM_RELATION_COLLECT_DEAD > PHASE_II_VACUUM_INDEX_COLLECT_DEAD > PHASE_III_VACUUM_REAP_DEAD I'm not sure it's better. vacuumparallel.c is independent of heap AM so it's not necessarily true for every table AM that phase I, II, and III correspond to lazy vacuum's phases. For example, some table AM might want to have parallelism for its own phase that cannot correspond to none of lazy vacuum's phases. > > ~~~ > > 6. > -parallel_vacuum_compute_workers(Relation *indrels, int nindexes, int > nrequested, > - bool *will_parallel_vacuum) > +parallel_vacuum_compute_workers(Relation rel, Relation *indrels, int nindexes, > + int nrequested, int *nworkers_table_p, > + bool *idx_will_parallel_vacuum, void *state) > > I had asked this once before ([1] comment #5) > > IIUC the returns for this function seem inconsistent. AFAIK, it > previously would return the number of workers for parallel index > vacuuming. But now (after this patch) the return value is returned > Max(nworkers_table, nworkers_index). Meanwhile, the number of workers > for parallel table vacuuming is returned as a by-reference parameter > 'nworkers_table_p'. In other words, it is returning the number of > workers in 2 different ways. > > It would seem tidier to make this a void function, and introduce > another parameter 'nworkers_index_p', similar to 'nworkers_table_p'. > The caller can do the parallel_workers = Max(nworkers_table_p, > nworkers_index_p); Hmm, parallel_vacuum_compute_workers() is a non-exposed function and its sole caller parallel_vacuum_init() would not use nworkers_index_p. I think we can do that change you suggested when the caller needs to use the number of parallel workers for index vacuuming. > > ~~~ > > parallel_vacuum_process_all_indexes > > 7. > * to finish, or we might get incomplete data.) > */ > if (nworkers > 0) > - { > - /* Wait for all vacuum workers to finish */ > - WaitForParallelWorkersToFinish(pvs->pcxt); > - > - for (int i = 0; i < pvs->pcxt->nworkers_launched; i++) > - InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]); > - } > + parallel_vacuum_end_worke_phase(pvs); > > 7a. > The code comment that precedes this was describing code that is now > all removed (moved into the parallel_vacuum_end_worke_phase), so > probably that comment needs to be changed. > > ~ > > 7b. > Typo: parallel_vacuum_end_worke_phase (worke?) > > ~~~ > > parallel_vacuum_collect_dead_items_end: > > 8. > + /* Wait for parallel workers to finish */ > + parallel_vacuum_end_worke_phase(pvs); > > typo (worke) > > ~~~ Will fix the above comments. > > 9. > + /* Decrement the worker count for the leader itself */ > + if (VacuumActiveNWorkers) > + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); > > Is this OK? Or did the prior parallel_vacuum_end_worke_phase call > already set VacuumActiveNWorkers to NULL. True. Will remove this part. > > ~~~ > > parallel_vacuum_process_table: > > 10. > + /* > + * We have completed the table vacuum so decrement the active worker > + * count. > + */ > + pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1); > > Is this OK? (Similar question to the previous review comment #9) Can > the table_parallel_vacuum_collect_dead_items end up calling > parallel_vacuum_end_worke_phase, so by time we get here the > VacuumActiveNWorkers might be NULL? True. Will remote this part. > > ~~~ > > parallel_vacuum_begin_work_phase: > > 11. > +/* > + * Launch parallel vacuum workers for the given phase. If at least one > + * worker launched, enable the shared vacuum delay costing. > + */ > > Maybe this comment should also say a call to this needs to be balanced > by a call to the other function: parallel_vacuum_end_work_phase. Will fix. > > ~~~ > > 12. > +static void > +parallel_vacuum_end_worke_phase(ParallelVacuumState *pvs) > > Typo (worke) > Will fix. Thank you for the comments! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Apr 18, 2025 at 2:49 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thanks for updating the patch. I have been reviewing and below are comments for now. Thank you for reviewing the patch! > > 01. > Sorry if I forgot something, but is there a reason that > parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the function > can check and regards zero if the API is NULL. I thought that making this function mandatory would be a good way for table AM authors to indicate their intentions in terms of parallel vacuum. On the other hand, I see your point; it would not bother table AMs that are not going to support parallel vacuum. I'd like to hear further opinions on this. > > 02. table_parallel_vacuum_estimate > We can assert that state is not NULL. I think the state can be NULL depending on the caller. It should not be mandatory. > > 03. table_parallel_vacuum_initialize > > Same as 02.. Also, I think we should ensure here that table_parallel_vacuum_estimate() > has already been called before, and current assertion might not enough because > others can set random value here. Other functions like table_rescan_tidrange() > checks the internal flag of the data structure, which is good for me. How do you > feel to check pcxt->estimator has non-zero value? > > 04. table_parallel_vacuum_initialize_worker > Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been > done? > > 05. table_parallel_vacuum_collect_dead_items > Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker() > has been done? IIUC table_parallel_vacuum_initialize() is called only by vacuumparallel.c. which is not controllable outside of it. How can others set a random value to nworkers? > > 06. table_parallel_vacuum_initialize_worker > Comments atop function needs to be updated. Did you refer to the following comment? /* * Initialize AM-specific vacuum state for worker processes. */ static inline void table_parallel_vacuum_initialize_worker(Relation rel, struct ParallelVacuumState *pvs, struct ParallelWorkerContext *pwcxt, void **state_out) Which part of the comment do you think we need to update? > > 07. > While playing, I found that the parallel vacuum worker can be launched more than > pages: > > ``` > postgres=# CREATE TABLE var (id int); > CREATE TABLE > postgres=# INSERT INTO var VALUES (generate_series(1, 10000)); > INSERT 0 10000 > postgres=# VACUUM (parallel 80, verbose) var ; > INFO: vacuuming "postgres.public.var" > INFO: launched 80 parallel vacuum workers for collecting dead tuples (planned: 80) > INFO: finished vacuuming "postgres.public.var": index scans: 0 > pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned > ... > ``` > > I hope the minimum chunk size is a page so that this situation can reduce the performance. > How do you feel to cap the value with rel::rd_rel::relpages in heap_parallel_vacuum_compute_workers()? > This value is not always up-to-date but seems good candidate. Thank you for testing the patch. Yes, we should avoid that. I think it would be better to cap the number of workers based on the number of chunks of blocks. I'm going to propose a new parallel scan method for parallel lazy scan so it would make sense to choose a saner value based on it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Apr 5, 2025 at 1:17 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Apr 4, 2025 at 5:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I haven't looked closely at this version but I did notice that you do > > > not document that parallel vacuum disables eager scanning. Imagine you > > > are a user who has set the eager freeze related table storage option > > > (vacuum_max_eager_freeze_failure_rate) and you schedule a regular > > > parallel vacuum. Now that table storage option does nothing. > > > > Good point. That restriction should be mentioned in the documentation. > > I'll update the patch. > > Yea, I mean, to be honest, when I initially replied to the thread > saying I thought temporarily disabling eager scanning for parallel > heap vacuuming was viable, I hadn't looked at the patch yet and > thought that there was a separate way to enable the new parallel heap > vacuum (separate from the parallel option for the existing parallel > index vacuuming). I don't like that this disables functionality that > worked when I pushed the eager scanning feature. Thank you for sharing your opinion. I think this is one of the main points that we need to deal with to complete this patch. After considering various approaches to integrating parallel heap vacuum and eager freeze scanning, one viable solution would be to implement a dedicated parallel scan mechanism for parallel lazy scan, rather than relying on the table_block_parallelscan_xxx() facility. This approach would involve dividing the table into chunks of 4,096 blocks, same as eager freeze scanning, where each parallel worker would perform eager freeze scanning while maintaining its own local failure count and a shared success count. This straightforward approach offers an additional advantage: since the chunk size remains constant, we can implement the SKIP_PAGES_THRESHOLD optimization consistently throughout the table, including its final sections. However, this approach does present certain challenges. First, we would need to maintain a separate implementation of lazy vacuum's parallel scan alongside the table_block_parallelscan_XXX() facility, potentially increasing maintenance overhead. Additionally, the fixed chunk size across the entire table might prove less efficient when processing blocks near the table's end compared to the dynamic approach used by table_block_parallelscan_nextpage(). Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jun 30, 2025 at 9:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've rebased the patches to the current HEAD. Hi Sawada-san, I've started reviewing this. I initially meant to review the eager scan part, but I think it would be easier to do that if it were in a separate patch on top of the rest of the parallel heap vacuuming patches -- just during review. I had a hard time telling which parts of the design (and code) were necessitated by the eager scanning feature. I started reviewing the rest of the code and this is some initial feedback on it: LV --- I don't think we should use "LV" and "lazy vacuum" anymore. Let's call it what it is: "heap vacuuming". We don't need to rename vacuumlazy.c right now, but if you are making new structs or substantially altering old ones, I wouldn't use "LV". table AM callbacks ------------------------- I think there is a fundamental tension here related to whether or not vacuumparallel.c should be table-AM agnostic. All of its code is invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c is index AM-agnostic but heap-specific. I think this is what makes your table AM callbacks feel odd. Other table AM callbacks are invoked from general code -- for example from executor code that is agnostic to the underlying table AMs. But, all of the functions you added and the associated callbacks are invoked below heap_vacuum_rel() (as is all of vacuumparallel.c). duplicated information across structs ----------------------------------------------- I think a useful preliminary patch to this set would be to refactor LVRelState into multiple structs that each represent the static input, working state, and output of vacuum. You start to do this in one of your patches with LVScanData but it doesn't go far enough. In fact, I think there are members of all the parallel data structures you introduced and of those that are already present that overlap with members of the LVRelState and we could start putting these smaller structs. For my examples, I am referring both to existing code and code you added. For example, relnamespace, relname, and indname in LVRelState and ParallelVacuumState and ParallelVacuumState->heaprel and LVRelState->rel. Seems like you could have a smaller struct that is accessible to the vacuum error callback and also to the users in LVRelState. (Side note: In your patch, I find it confusing that these members of LVRelState are initialized in heap_parallel_vacuum_collect_dead_items() instead of heap_parallel_vacuum_initialize_worker().) And even for cases where the information has to be passed from the leader to the worker, there is no reason you can't have the same struct but in shared memory for the parallel case and local memory for the serial case. For example with the struct members "aggressive", "skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or ParallelLVScanDesc->nblocks and LVRelState->rel_pages. Ideally the entire part of the LVRelState that is an unchanging input/reference data is in a struct which is in local memory for the serial and local parallel case and a single read-only location in the shared parallel case. Or, for the shared case, if there is a reason not to read from the shared memory, we copy them over to a local instance of the same struct. Maybe it is called HeapVacInput or similar. There are a bunch of other instances like this. For example, the leader initializes LVScanData->NewRelfrozenXid from vacrel->cutoffs.OldestXmin, then uses this to set ParallelLVShared->NewRelfrozenXid. Then the worker copies it from ParallelLVShared->NewRelfrozenXid back to LVScanData->NewRelfrozenXid. But the worker has access to ParallelLVShared->cutoffs. So you could either get it from there or allow the workers to access the first LVScanData and have that always belong to the leader. Either way, I don't think you should need NewRelfrozenXid in ParallelLVShared. Another related concern: I don't understand why ParallelLVScanWorkerData->nallocated has to be in shared memory. I get that each worker has to keep track for itself how many blocks it has "processed" so far (either skipped or scanned) and I get that there has to be some coordination variable that keeps track of whether or not all blocks have been processed and where the next chunk is. But why does the worker's nallocated have to be in shared memory? It doesn't seem like the leader accesses it. (Side note: I'm rather confused by why ParallelLVScanDesc is its own thing [instead of part of ParallelLVShared] -- not to mention its chunk_size member appears to be unused.) Separately, I think the fact that PVShared and ParallelVacuumState stayed almost untouched (and continue to have general, parallel-sounding names) with your patch but now mostly deal with stuff about indexes while most other parallel vacuuming content is in other structs is confusing. I think we need to consider which members in PVShared and ParallelVacuumState are phase II only and put that in an appropriately named structure or get more of the heap vacuuming specific members in those generally named structures. - Melanie
On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Jun 30, 2025 at 9:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've rebased the patches to the current HEAD. > > Hi Sawada-san, > > I've started reviewing this. I initially meant to review the eager > scan part, but I think it would be easier to do that if it were in a > separate patch on top of the rest of the parallel heap vacuuming > patches -- just during review. I had a hard time telling which parts > of the design (and code) were necessitated by the eager scanning > feature. Thank you for reviewing the patch! I'll separate the changes of the eager scanning part from the main patch in the next version patch. > LV > --- > I don't think we should use "LV" and "lazy vacuum" anymore. Let's call > it what it is: "heap vacuuming". We don't need to rename vacuumlazy.c > right now, but if you are making new structs or substantially altering > old ones, I wouldn't use "LV". Agreed. I'll change accordingly in the next version patch. > > table AM callbacks > ------------------------- > I think there is a fundamental tension here related to whether or not > vacuumparallel.c should be table-AM agnostic. All of its code is > invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c > is index AM-agnostic but heap-specific. > > I think this is what makes your table AM callbacks feel odd. Other > table AM callbacks are invoked from general code -- for example from > executor code that is agnostic to the underlying table AMs. But, all > of the functions you added and the associated callbacks are invoked > below heap_vacuum_rel() (as is all of vacuumparallel.c). What does it exactly mean to make vacuumparallel.c index AM-agnostic but heap-specific? I imagine we change the vacuumparallel.c so that it calls heap's functions for parallel vacuuming such as initializing DSM and parallel vacuum worker's entry function etc. But I'm not sure how it works with non-heap table AMs that uses vacuumparallel.c. I've tried to implement parallel heap scanning in vacuumlazy.c before but I realized that there could be a lot of duplications between vacuumlazy.c and vacuumparallel.c. One possible change would be to have lazyvacuum.c pass the set of callbacks to parallel_vacuum_init() instead of having them as table AM callbacks. This removes the weirdness of the associated table AM callbacks being invoked below heap_vacuum_rel() but it doesn't address your point "vacuumparallel.c is index AM-agnostic but heap-specific". > duplicated information across structs > ----------------------------------------------- > I think a useful preliminary patch to this set would be to refactor > LVRelState into multiple structs that each represent the static input, > working state, and output of vacuum. You start to do this in one of > your patches with LVScanData but it doesn't go far enough. > > In fact, I think there are members of all the parallel data structures > you introduced and of those that are already present that overlap with > members of the LVRelState and we could start putting these smaller > structs. For my examples, I am referring both to existing code and > code you added. > > For example, relnamespace, relname, and indname in LVRelState and > ParallelVacuumState and ParallelVacuumState->heaprel and > LVRelState->rel. Seems like you could have a smaller struct that is > accessible to the vacuum error callback and also to the users in > LVRelState. I guess it's also related to the point of whether we should make vacuumparallel.c heap-specific. Currently, since vacuumparallel.c is independent from heap AM and has its own error callback parallel_vacuum_error_callback(), it has duplicated information like heaprel and indrels. Given vacuumparallel.c should not rely on the heap-specific struct, I imagine that we store that information in vacuumparallel.c side and have heap access to it. But which means it's workable only during parallel vacuum. > (Side note: In your patch, I find it confusing that these members of > LVRelState are initialized in > heap_parallel_vacuum_collect_dead_items() instead of > heap_parallel_vacuum_initialize_worker().) Yeah, I think at least some fields such as dbname, relnamespace, and relname should be initialized in heap_parallel_vacuum_initialize_worker(). > > And even for cases where the information has to be passed from the > leader to the worker, there is no reason you can't have the same > struct but in shared memory for the parallel case and local memory for > the serial case. For example with the struct members "aggressive", > "skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or > ParallelLVScanDesc->nblocks and LVRelState->rel_pages. > > Ideally the entire part of the LVRelState that is an unchanging > input/reference data is in a struct which is in local memory for the > serial and local parallel case and a single read-only location in the > shared parallel case. Or, for the shared case, if there is a reason > not to read from the shared memory, we copy them over to a local > instance of the same struct. Maybe it is called HeapVacInput or > similar. ParallelLVShared is created to pass information to parallel vacuum workers while keeping LVRelStates able to work locally. Suppose that we create HeapVacInput including "aggressive", "cutoff", "skipwithvm", and "rel_pages", LVRelState would have to have a pointer to a HeapVacInput instance that is either on local memory or shared memory. Since we also need to pass other information such as initial_chunk_size and eager scanning related information to the parallel vacuum worker, we would have to create something like ParallelLVShared as the patch does. As a result, we would have two structs that need to be shared on the shared buffer. Is that kind of what you meant? or did you mean that we include parallel vacuum related fields too to HeapVacInput struct? > > There are a bunch of other instances like this. For example, the > leader initializes LVScanData->NewRelfrozenXid from > vacrel->cutoffs.OldestXmin, then uses this to set > ParallelLVShared->NewRelfrozenXid. Then the worker copies it from > ParallelLVShared->NewRelfrozenXid back to LVScanData->NewRelfrozenXid. > But the worker has access to ParallelLVShared->cutoffs. So you could > either get it from there or allow the workers to access the first > LVScanData and have that always belong to the leader. Either way, I > don't think you should need NewRelfrozenXid in ParallelLVShared. You're right. Both NewRelFrozenXid and NewRelminMxid don't need to be shared and can be removed from ParallelLVShared. > > Another related concern: I don't understand why > ParallelLVScanWorkerData->nallocated has to be in shared memory. I get > that each worker has to keep track for itself how many blocks it has > "processed" so far (either skipped or scanned) and I get that there > has to be some coordination variable that keeps track of whether or > not all blocks have been processed and where the next chunk is. But > why does the worker's nallocated have to be in shared memory? It > doesn't seem like the leader accesses it. ParallelLVScanWorkerData->nallocated works in the same way as ParallelBlockTableScanDescData.phs_nallocated. That is, it controls how many blocks has been allocated to any of the workers so far. When allocating a new chunk to scan, each worker gets ParallelLVScanWorkerData->nallocated value and adds the chunk size to it in an atomic operation (see parallel_lazy_scan_get_nextpage()). > > (Side note: I'm rather confused by why ParallelLVScanDesc is its own > thing [instead of part of ParallelLVShared] -- not to mention its > chunk_size member appears to be unused.) I wanted to keep ParallelLVShared() mostly read-only and aimed to share the information from the leader to the workers, whereas ParallelLVScanDesc() is a pure scan state maintained by each worker (and the leader). chunk_size is used when allocating the new chunk (see parallel_lazy_scan_get_nextpage()). It's initialized by vacrel->plvstate->shared->initial_chunk_size since the first chunk size could vary depending on eager scanning state, and is updated to the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk. > > Separately, I think the fact that PVShared and ParallelVacuumState > stayed almost untouched (and continue to have general, > parallel-sounding names) with your patch but now mostly deal with > stuff about indexes while most other parallel vacuuming content is in > other structs is confusing. I think we need to consider which members > in PVShared and ParallelVacuumState are phase II only and put that in > an appropriately named structure or get more of the heap vacuuming > specific members in those generally named structures. Yeah, fields such as reltuples, estimated_count, and maintenance_work_mem_worker are used only for phase II while some fields such as relid, TidStore related fields, and vacuum delays related fields are commonly used in phase I and II. I'll consider creating a new struct to store phase II only information. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jul 18, 2025 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > I think there is a fundamental tension here related to whether or not > > vacuumparallel.c should be table-AM agnostic. All of its code is > > invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c > > is index AM-agnostic but heap-specific. > > > > I think this is what makes your table AM callbacks feel odd. Other > > table AM callbacks are invoked from general code -- for example from > > executor code that is agnostic to the underlying table AMs. But, all > > of the functions you added and the associated callbacks are invoked > > below heap_vacuum_rel() (as is all of vacuumparallel.c). > > What does it exactly mean to make vacuumparallel.c index AM-agnostic > but heap-specific? I imagine we change the vacuumparallel.c so that it > calls heap's functions for parallel vacuuming such as initializing DSM > and parallel vacuum worker's entry function etc. But I'm not sure how > it works with non-heap table AMs that uses vacuumparallel.c. I don't understand what you mean by non-heap table AMs using vacuumparallel.c. Do you mean that they are using it like a library? I don't understand why the existing parallel index vacuumming code added table AM callbacks either. They are only invoked from below heap_vacuum_rel(). Unless I am missing something? From looking at tableam.h, the parallel vacuum callbacks seem to be the only table AM callbacks that are not invoked from table AM agnostic code. What was the point of adding them? I can see how other table AMs implementing vacuum may want to use some of the vacuumparallel.c functions exposed in the vacuum.h header file to save time and code for the index vacuuming portion of vacuuming. But why have table AM callbacks that aren't invoked unless you are using heap? More generally, vacuuming is a very table AM-specific operation. You often wouldn't need any of the same data structures. Think about an undo-based table AM -- you don't need dead items or anything like that. Or the phase ordering -- other table AMs may not need to vacuum the table then the indexes then the table. Something like a scan, it just has to yield tuples to the executor. But vacuum seems inextricably linked with the on-disk layout. Most of the vacuum.c code is just concerned with parsing options, checking permissions, determining precedence with GUCs/table options, and doing catalog operations. > One possible change would be to have lazyvacuum.c pass the set of > callbacks to parallel_vacuum_init() instead of having them as table AM > callbacks. This removes the weirdness of the associated table AM > callbacks being invoked below heap_vacuum_rel() but it doesn't address > your point "vacuumparallel.c is index AM-agnostic but heap-specific". I do agree you are in a tough spot if we keep the index parallel vacuuming callbacks in tableam.h. It is weird to call a heap-specific function from within a table AM callback. I guess my first question is which table AMs are using the existing callbacks and how are they doing that since I don't see places where they are called from non heap code. If they use the vacuum.h exposed vacuumparallel.c functions as a library, they could do that without the callbacks in tableam.h. But I think I must just be missing something. > > And even for cases where the information has to be passed from the > > leader to the worker, there is no reason you can't have the same > > struct but in shared memory for the parallel case and local memory for > > the serial case. For example with the struct members "aggressive", > > "skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or > > ParallelLVScanDesc->nblocks and LVRelState->rel_pages. > > > > Ideally the entire part of the LVRelState that is an unchanging > > input/reference data is in a struct which is in local memory for the > > serial and local parallel case and a single read-only location in the > > shared parallel case. Or, for the shared case, if there is a reason > > not to read from the shared memory, we copy them over to a local > > instance of the same struct. Maybe it is called HeapVacInput or > > similar. > > ParallelLVShared is created to pass information to parallel vacuum > workers while keeping LVRelStates able to work locally. Suppose that > we create HeapVacInput including "aggressive", "cutoff", "skipwithvm", > and "rel_pages", LVRelState would have to have a pointer to a > HeapVacInput instance that is either on local memory or shared memory. > Since we also need to pass other information such as > initial_chunk_size and eager scanning related information to the > parallel vacuum worker, we would have to create something like > ParallelLVShared as the patch does. As a result, we would have two > structs that need to be shared on the shared buffer. Is that kind of > what you meant? or did you mean that we include parallel vacuum > related fields too to HeapVacInput struct? Yes, I think generally this is what I am saying. Honestly, I think the two biggest problems right now are that 1) the existing LVRelState mixes too many different types of data (e.g. output stats and input data) and that 2) the structs your patch introduces don't have names making it clear enough what they are. What is ParallelLVScanDesc? We don't have the convention of a scan descriptor for vacuum, so is the point here that it is only used for a single "scan" of the heap table? That sounds like all of phase I to me. Or why is the static input data structure just called ParallelLVShared? Nothing about that indicates read-only or static. When I was imagining how to make it more clear, I was thinking something like what you are saying above. There are two types of input read-only data passed from the leader to the workers. One is parallel-only and one is needed in both the parallel and serial cases. I was suggesting having the common input read-only fields in the same type of struct. That would mean having two types of input data structures in the parallel-case -- one parallel only and one common. It's possible that that is more confusing. I think it also depends on how the workers end up accessing the data. For input read-only data you can either have workers access a single copy in shared memory (and have a pointer that points to shared memory in the parallel case and local memory in the serial case) or you can copy the data over to local memory so that the parallel and serial cases use it in the same data structure. You are doing mostly the latter but the setup is spread around enough that it isn't immediately clear that is what is happening. One thing that might be worth trying is moving the setup steps in heap_vacuum_rel() into some kind of helper function that can be called both in heap_vacuum_rel() by the leader and also in the worker setup. Then it might make it clear what only the leader does (e.g. call vacuum_get_cutoffs()) and what everyone does. I'm just brainstorming here, though. I think if you just make a new version of the patch with the eager scanning separated out and the names of the structs clarified, I could try to be more specific. I found the spread of members across the different structs hard to parse, but I'm not 100% sure yet what would make it easier. > I wanted to keep ParallelLVShared() mostly read-only and aimed to > share the information from the leader to the workers, whereas > ParallelLVScanDesc() is a pure scan state maintained by each worker > (and the leader). I'm not sure what scan state means here. nallocated seems to be a coordination member. But nblocks seems to be used in places that you have access to the LVRelState, which has rel_pages, so why wouldn't you just use that? > chunk_size is used when allocating the new chunk (see > parallel_lazy_scan_get_nextpage()). It's initialized by > vacrel->plvstate->shared->initial_chunk_size since the first chunk > size could vary depending on eager scanning state, and is updated to > the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk. I see that ParallelLVScanWorkerData->chunk_size is used in that way, but I don't see where ParallelLVScanDesc->chunk_size is used. Also, why does ParallelLVScanWorkerData->chunk_size need to be in shared memory? Isn't the worker just using it locally? - Melanie
On Mon, Jul 21, 2025 at 7:28 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Jul 18, 2025 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Jul 17, 2025 at 1:39 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > I think there is a fundamental tension here related to whether or not > > > vacuumparallel.c should be table-AM agnostic. All of its code is > > > invoked below heap_vacuum_rel(). I would argue that vacuumparallel.c > > > is index AM-agnostic but heap-specific. > > > > > > I think this is what makes your table AM callbacks feel odd. Other > > > table AM callbacks are invoked from general code -- for example from > > > executor code that is agnostic to the underlying table AMs. But, all > > > of the functions you added and the associated callbacks are invoked > > > below heap_vacuum_rel() (as is all of vacuumparallel.c). > > > > What does it exactly mean to make vacuumparallel.c index AM-agnostic > > but heap-specific? I imagine we change the vacuumparallel.c so that it > > calls heap's functions for parallel vacuuming such as initializing DSM > > and parallel vacuum worker's entry function etc. But I'm not sure how > > it works with non-heap table AMs that uses vacuumparallel.c. > > I don't understand what you mean by non-heap table AMs using > vacuumparallel.c. Do you mean that they are using it like a library? I > don't understand why the existing parallel index vacuumming code added > table AM callbacks either. They are only invoked from below > heap_vacuum_rel(). Unless I am missing something? From looking at > tableam.h, the parallel vacuum callbacks seem to be the only table AM > callbacks that are not invoked from table AM agnostic code. What was > the point of adding them? Given your suggestion that we should make vacuumparallel.c index AM-agnostic but heap-specific, I imagined that parallel_vacuum_init() calls a function defined in vacuumlazy.c that initializes the DSM space for parallel heap vacuum. However, since parallel_vacuum_init() is exposed, IIUC every table AM can use it. Therefore, for example a non-heap table AM calls parallel_vacuum_init() in its own relation_vacuum callback in order to vacuum its indexes in parallel, it means eventually to call heap's DSM-initialization function, which won't work. > I can see how other table AMs implementing vacuum may want to use some > of the vacuumparallel.c functions exposed in the vacuum.h header file > to save time and code for the index vacuuming portion of vacuuming. > But why have table AM callbacks that aren't invoked unless you are > using heap? > > More generally, vacuuming is a very table AM-specific operation. You > often wouldn't need any of the same data structures. Think about an > undo-based table AM -- you don't need dead items or anything like > that. Or the phase ordering -- other table AMs may not need to vacuum > the table then the indexes then the table. Something like a scan, it > just has to yield tuples to the executor. But vacuum seems > inextricably linked with the on-disk layout. Most of the vacuum.c code > is just concerned with parsing options, checking permissions, > determining precedence with GUCs/table options, and doing catalog > operations. I guess it ultimately depends on how other table AMs define what to do in the VACUUM command (or autovacuum). For instance, the columnar table AM simply updates the relation statistics during the vacuum command[1]. Since it can efficiently collect statistics such as the number of live tuples, it completes this process quickly. If an AM (not limited to the columnar table AM) needs to gather more comprehensive statistics that might require scanning the entire table, the feature introduced by this patch -- multiple workers scanning the table while collecting data -- could prove beneficial and it might want to use it in relation_vacuum callback. The patch introduces the parallel_vacuum_collect_dead_items callback, but it's not restricted to collecting only dead tuples. It can be used for gathering various types of data. Regarding phase ordering, both the existing vacuumparallel.c features and the new parallel table vacuuming feature are flexible in their execution sequence. If a table AM needs to invoke parallel index vacuuming before parallel table vacuuming, it can call the corresponding functions in that order. > > > One possible change would be to have lazyvacuum.c pass the set of > > callbacks to parallel_vacuum_init() instead of having them as table AM > > callbacks. This removes the weirdness of the associated table AM > > callbacks being invoked below heap_vacuum_rel() but it doesn't address > > your point "vacuumparallel.c is index AM-agnostic but heap-specific". > > I do agree you are in a tough spot if we keep the index parallel > vacuuming callbacks in tableam.h. It is weird to call a heap-specific > function from within a table AM callback. I guess my first question is > which table AMs are using the existing callbacks and how are they > doing that since I don't see places where they are called from non > heap code. If they use the vacuum.h exposed vacuumparallel.c functions > as a library, they could do that without the callbacks in tableam.h. > But I think I must just be missing something. Citus columnar table and orioledb are open-source table AM implementations I know, and it seems that orioledb's vacuum codes are somewhat similar to heap's one[2] (I don't know the details of orioledb though); it initializes parallel vacuum by calling vacuum_parallel_init() in its relation_vacuum callback, orioledb_vacuum_rel(). The reason why I added some callbacks as table AM callbacks in the patch is that I could not find other better places. Currently, vacuumparallel.c handles several critical operations for parallel vacuuming: allocating and initializing DSM space for parallel index vacuuming, initializing and launching parallel workers, and managing various auxiliary tasks such as configuring vacuum delays and setting maintenance_work_mem for workers. Given these existing functionalities, I aimed to implement the parallel heap vacuum worker launch through the same vacuumparallel.c codebase, maintaining a consistent interface. To achieve this integration, vacuumparallel.c requires access to heap-specific functions, and defining them as table AM callbacks emerged as the cleanest solution. My personal vision is to maintain vacuumparallel.c's neutrality regarding both index AM and table AM implementations, while providing a unified interface for table AMs that wish to leverage parallel operations-- whether for table vacuuming, index vacuuming, or both. > > > > And even for cases where the information has to be passed from the > > > leader to the worker, there is no reason you can't have the same > > > struct but in shared memory for the parallel case and local memory for > > > the serial case. For example with the struct members "aggressive", > > > "skipwithvm", and "cutoffs" in LVRelState and ParallelLVShared. Or > > > ParallelLVScanDesc->nblocks and LVRelState->rel_pages. > > > > > > Ideally the entire part of the LVRelState that is an unchanging > > > input/reference data is in a struct which is in local memory for the > > > serial and local parallel case and a single read-only location in the > > > shared parallel case. Or, for the shared case, if there is a reason > > > not to read from the shared memory, we copy them over to a local > > > instance of the same struct. Maybe it is called HeapVacInput or > > > similar. > > > > ParallelLVShared is created to pass information to parallel vacuum > > workers while keeping LVRelStates able to work locally. Suppose that > > we create HeapVacInput including "aggressive", "cutoff", "skipwithvm", > > and "rel_pages", LVRelState would have to have a pointer to a > > HeapVacInput instance that is either on local memory or shared memory. > > Since we also need to pass other information such as > > initial_chunk_size and eager scanning related information to the > > parallel vacuum worker, we would have to create something like > > ParallelLVShared as the patch does. As a result, we would have two > > structs that need to be shared on the shared buffer. Is that kind of > > what you meant? or did you mean that we include parallel vacuum > > related fields too to HeapVacInput struct? > > Yes, I think generally this is what I am saying. > > Honestly, I think the two biggest problems right now are that 1) the > existing LVRelState mixes too many different types of data (e.g. > output stats and input data) and that 2) the structs your patch > introduces don't have names making it clear enough what they are. > > What is ParallelLVScanDesc? We don't have the convention of a scan > descriptor for vacuum, so is the point here that it is only used for a > single "scan" of the heap table? That sounds like all of phase I to > me. Yes, ParallelLVScanDesc is added to control the working state during do_parallel_lazy_scan_heap(), a parallel variant of phase I. > Or why is the static input data structure just called > ParallelLVShared? Nothing about that indicates read-only or static. > > When I was imagining how to make it more clear, I was thinking > something like what you are saying above. There are two types of input > read-only data passed from the leader to the workers. One is > parallel-only and one is needed in both the parallel and serial cases. > I was suggesting having the common input read-only fields in the same > type of struct. That would mean having two types of input data > structures in the parallel-case -- one parallel only and one common. > It's possible that that is more confusing. > > I think it also depends on how the workers end up accessing the data. > For input read-only data you can either have workers access a single > copy in shared memory (and have a pointer that points to shared memory > in the parallel case and local memory in the serial case) or you can > copy the data over to local memory so that the parallel and serial > cases use it in the same data structure. You are doing mostly the > latter but the setup is spread around enough that it isn't immediately > clear that is what is happening. > > One thing that might be worth trying is moving the setup steps in > heap_vacuum_rel() into some kind of helper function that can be called > both in heap_vacuum_rel() by the leader and also in the worker setup. > Then it might make it clear what only the leader does (e.g. call > vacuum_get_cutoffs()) and what everyone does. I'm just brainstorming > here, though. > > I think if you just make a new version of the patch with the eager > scanning separated out and the names of the structs clarified, I could > try to be more specific. I found the spread of members across the > different structs hard to parse, but I'm not 100% sure yet what would > make it easier. Okay, I'll try to clarify the names and the above idea in the next version patch. > > > I wanted to keep ParallelLVShared() mostly read-only and aimed to > > share the information from the leader to the workers, whereas > > ParallelLVScanDesc() is a pure scan state maintained by each worker > > (and the leader). > > I'm not sure what scan state means here. nallocated seems to be a > coordination member. But nblocks seems to be used in places that you > have access to the LVRelState, which has rel_pages, so why wouldn't > you just use that? Right, we can use rel_pages instead. > > > chunk_size is used when allocating the new chunk (see > > parallel_lazy_scan_get_nextpage()). It's initialized by > > vacrel->plvstate->shared->initial_chunk_size since the first chunk > > size could vary depending on eager scanning state, and is updated to > > the fixed size PARALLEL_LV_CHUNK_SIZE from the next chunk. > > I see that ParallelLVScanWorkerData->chunk_size is used in that way, > but I don't see where ParallelLVScanDesc->chunk_size is used. It seems that ParallelLVScanDesc->chunk_size is not used at all, so I'll remove it. > Also, > why does ParallelLVScanWorkerData->chunk_size need to be in shared > memory? Isn't the worker just using it locally? I put it in shared memory for the case where the ParallelLVScanWorkerData is re-used by another worker after resuming phase I. But the worker can simply use PARALLEL_LV_CHUNK_SIZE in that case as the first chunk should have already been allocated. I'll remove it. Regards, [1] https://github.com/citusdata/citus/blob/main/src/backend/columnar/columnar_tableam.c#L1058 [2] https://github.com/orioledb/orioledb/blob/main/src/tableam/vacuum.c#L353 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On 2025-07-21 12:41:49 -0700, Masahiko Sawada wrote: > The reason why I added some callbacks as table AM callbacks in the > patch is that I could not find other better places. Currently, > vacuumparallel.c handles several critical operations for parallel > vacuuming: allocating and initializing DSM space for parallel index > vacuuming, initializing and launching parallel workers, and managing > various auxiliary tasks such as configuring vacuum delays and setting > maintenance_work_mem for workers. Given these existing > functionalities, I aimed to implement the parallel heap vacuum worker > launch through the same vacuumparallel.c codebase, maintaining a > consistent interface. To achieve this integration, vacuumparallel.c > requires access to heap-specific functions, and defining them as table > AM callbacks emerged as the cleanest solution. I don't think that can be the right solution. Vacuuming is a table-am specific operation and thus already happens within a table-am's own code. It would be rather wrong to have tableam.h indirected calls below heapam specific code. That is not to say you can't have callbacks or such, it just doesn't make sense for those callbacks to be at the level of tableam. If you want to make vacuumparallel support parallel table vacuuming for multiple table AMs (I'm somewhat doubtful that's a good idea), you could do that by having a vacuumparallel.c specific callback struct. Greetings, Andres Freund
On Mon, Jul 21, 2025 at 4:49 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-07-21 12:41:49 -0700, Masahiko Sawada wrote: > > The reason why I added some callbacks as table AM callbacks in the > > patch is that I could not find other better places. Currently, > > vacuumparallel.c handles several critical operations for parallel > > vacuuming: allocating and initializing DSM space for parallel index > > vacuuming, initializing and launching parallel workers, and managing > > various auxiliary tasks such as configuring vacuum delays and setting > > maintenance_work_mem for workers. Given these existing > > functionalities, I aimed to implement the parallel heap vacuum worker > > launch through the same vacuumparallel.c codebase, maintaining a > > consistent interface. To achieve this integration, vacuumparallel.c > > requires access to heap-specific functions, and defining them as table > > AM callbacks emerged as the cleanest solution. > > I don't think that can be the right solution. Vacuuming is a table-am specific > operation and thus already happens within a table-am's own code. It would be > rather wrong to have tableam.h indirected calls below heapam specific code. > > That is not to say you can't have callbacks or such, it just doesn't make > sense for those callbacks to be at the level of tableam. If you want to make > vacuumparallel support parallel table vacuuming for multiple table AMs (I'm > somewhat doubtful that's a good idea), you could do that by having a > vacuumparallel.c specific callback struct. Thank you for the comments! Do you think it makes sense to implement the above idea that we launch parallel vacuum workers for heap through the same vacuumparallel.c codebase and maintain the consistent interface with parallel index vacuuming APIs? I've considered an idea of implementing parallel heap vacuum independent of vacuumparallel.c, but I find some difficulties and duplications. For instance, vacuumparallel.c already does many operations like initializing DSM space, setting vacuum delays, and launching parallel vacuum workers. Probably we can expose ParallelVacuumContext so that vacuumlazy.c can use it to launch parallel workers for heap vacuuming, but I'm not sure it's a good idea. We also need to expose other works that vacuumparallel.c does such as setting ring_buffer size and cost-based vacuum delay too and vacuumlazy.c needs to use them appropriately. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote: > Do you think it makes sense to implement the above idea that we launch > parallel vacuum workers for heap through the same vacuumparallel.c > codebase and maintain the consistent interface with parallel index > vacuuming APIs? Yes, that might make sense. But wiring it up via tableam doesn't make sense. Greetings, Andres Freund
On Mon, Jul 21, 2025 at 7:49 PM Andres Freund <andres@anarazel.de> wrote: > That is not to say you can't have callbacks or such, it just doesn't make > sense for those callbacks to be at the level of tableam. If you want to make > vacuumparallel support parallel table vacuuming for multiple table AMs (I'm > somewhat doubtful that's a good idea), you could do that by having a > vacuumparallel.c specific callback struct. I'm not doubtful that this is a good idea. There are a number of tableams around these days that are "heap except whatever", where (I suspect) the size of "whatever" ranges from quite small to moderately large. I imagine that such efforts end up duplicating a lot of heapam code and probably always will; but if we can avoid increasing that amount, I think it's a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jul 25, 2025 at 10:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 21, 2025 at 7:49 PM Andres Freund <andres@anarazel.de> wrote: > > That is not to say you can't have callbacks or such, it just doesn't make > > sense for those callbacks to be at the level of tableam. If you want to make > > vacuumparallel support parallel table vacuuming for multiple table AMs (I'm > > somewhat doubtful that's a good idea), you could do that by having a > > vacuumparallel.c specific callback struct. > > I'm not doubtful that this is a good idea. There are a number of > tableams around these days that are "heap except whatever", where (I > suspect) the size of "whatever" ranges from quite small to moderately > large. I imagine that such efforts end up duplicating a lot of heapam > code and probably always will; but if we can avoid increasing that > amount, I think it's a good idea. Based on our previous discussions, I have contemplated modifying the patch to define the callback functions for parallel table vacuum within a structure in vacuumparallel.c instead of using table AM's callbacks. This approach would require the leader to pass both the library name and handler function name to vacuumparallel.c, enabling workers to locate the handler function via load_external_function() and access the callback functions. Although it's technically feasible, I'm not sure that the design is elegant; while table AM seeking to use parallel index vacuuming can simply invoke parallel_vacuum_init(), those requiring parallel table vacuum would need to both provide the handler function and supply the library and handler function names. An alternative, more straightforward approach would be to implement a dedicated ParallelContext in vacuumlazy.c specifically for parallel heap vacuum, distinct from the ParallelVacuumContext in vacuumparallel.c. Under this design, vacuumparallel.c would be exclusively dedicated to parallel index vacuuming, while the parallel heap vacuum implementation would be contained within vacuumlazy.c, eliminating the need for a handler function. While this solution seems more elegant, it would result in code duplication between vacuumlazy.c and vacuumparallel.c, particularly in areas concerning worker initialization and cost-based delays. At present, I am inclined toward the latter solution, though I have yet to implement it. I welcome any feedback on these approaches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jul 23, 2025 at 12:06 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote: > > Do you think it makes sense to implement the above idea that we launch > > parallel vacuum workers for heap through the same vacuumparallel.c > > codebase and maintain the consistent interface with parallel index > > vacuuming APIs? > > Yes, that might make sense. But wiring it up via tableam doesn't make sense. If you do parallel worker setup below heap_vacuum_rel(), then how are you supposed to use those workers to do non-heap table vacuuming? All the code in vacuumparallel.c is invoked from below lazy_scan_heap(), so I don't see how having a vacuumparallel.c-specific callback struct solves the layering violation. It seems like parallel index vacuuming setup would have to be done in vacuum_rel() if we want to reuse the same parallel workers for the table vacuuming and index vacuuming phases and allow for different table AMs to vacuum the tables in their own way using these parallel workers. - Melanie
On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jul 23, 2025 at 12:06 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2025-07-22 11:44:29 -0700, Masahiko Sawada wrote: > > > Do you think it makes sense to implement the above idea that we launch > > > parallel vacuum workers for heap through the same vacuumparallel.c > > > codebase and maintain the consistent interface with parallel index > > > vacuuming APIs? > > > > Yes, that might make sense. But wiring it up via tableam doesn't make sense. > > If you do parallel worker setup below heap_vacuum_rel(), then how are > you supposed to use those workers to do non-heap table vacuuming? IIUC non-heap tables can call parallel_vacuum_init() in its relation_vacuum table AM callback implementation in order to initialize parallel table vacuum, parallel index vacuum, or both. > > All the code in vacuumparallel.c is invoked from below > lazy_scan_heap(), so I don't see how having a > vacuumparallel.c-specific callback struct solves the layering > violation. I think the layering problem we discussed is about where the callbacks are declared; it seems odd that we add new table AM callbacks that are invoked only by another table AM callback. IIUC we invoke all codes in vacuumparallel.c in vacuumlazy.c even today. If we think we think this design has a problem in terms of layering of functions, we can refactor it as a separate patch. > It seems like parallel index vacuuming setup would have to be done in > vacuum_rel() if we want to reuse the same parallel workers for the > table vacuuming and index vacuuming phases and allow for different > table AMs to vacuum the tables in their own way using these parallel > workers. Hmm, let me clarify your idea as I'm confused. If the parallel context used for both table vacuuming and index vacuuming is set up in vacuum_rel(), its DSM would need to have some data too required by table AM to do parallel table vacuuming. In order to do that, table AMs somehow need to tell the necessary DSM size at least. How do table AMs tell that to parallel vacuum initialization function (e.g., parallel_vacuum_init()) in vacuum_rel()? Also, if we set up the parallel context in vacuum_rel(), we would somehow need to pass it to the relation_vacuum table AM callback so that they can use it during their own vacuum operation. Do you mean to pass it via table_relation_vacuum()? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > If you do parallel worker setup below heap_vacuum_rel(), then how are > > you supposed to use those workers to do non-heap table vacuuming? > > IIUC non-heap tables can call parallel_vacuum_init() in its > relation_vacuum table AM callback implementation in order to > initialize parallel table vacuum, parallel index vacuum, or both. Yep, it's just that that is more like a library helper function. Currently, in master, the flow is: leader does: heap_vacuum_rel() dead_items_alloc() parallel_vacuum_init() passes parallel_vacuum_main() to parallel context lazy_scan_heap() ... parallel_vacuum_process_all_indexes() LaunchParallelWorkers() ... parallel_vacuum_process_[un]safe_indexes() For the leader, I don't remember where your patch moves LaunchParallelWorkers() -- since it would have to move to before the first heap vacuuming phase. When I was imagining what made the most sense to make parallelism agnostic of heap, it seemed like it would be moving LaunchParallelWorkers() above heap_vacuum_rel(). But I recognize that is a pretty big change, and I haven't tried it, so I don't know all the complications. (sounds like you mention some below). Basically, right now in master, we expose some functions for parallel index vacuuming that people can call in their own table implementation of table vacuuming, but we don't invoke any of them in table-AM agnostic code. That may work fine for just index vacuuming, but now that we are trying to make parts of the table vacuuming itself extensible, the interface feels more awkward. I won't push the idea further because I'm not even sure if it works, but the thing that felt the most natural to me was pushing parallelism above the table AM. > > All the code in vacuumparallel.c is invoked from below > > lazy_scan_heap(), so I don't see how having a > > vacuumparallel.c-specific callback struct solves the layering > > violation. > > I think the layering problem we discussed is about where the callbacks > are declared; it seems odd that we add new table AM callbacks that are > invoked only by another table AM callback. IIUC we invoke all codes in > vacuumparallel.c in vacuumlazy.c even today. If we think we think this > design has a problem in terms of layering of functions, we can > refactor it as a separate patch. Sure, but we are doubling down on it. I think parallel heap vacuuming would have a completely different interface if that layering worked differently. But, again, I don't want to put this additional, extremely complex barrier in the way. I brought it up because it's what occurred to me when reviewing this. But, I think it's fine to look for an achievable compromise. > > It seems like parallel index vacuuming setup would have to be done in > > vacuum_rel() if we want to reuse the same parallel workers for the > > table vacuuming and index vacuuming phases and allow for different > > table AMs to vacuum the tables in their own way using these parallel > > workers. > > Hmm, let me clarify your idea as I'm confused. If the parallel > context used for both table vacuuming and index vacuuming is set up in > vacuum_rel(), its DSM would need to have some data too required by > table AM to do parallel table vacuuming. In order to do that, table > AMs somehow need to tell the necessary DSM size at least. How do table > AMs tell that to parallel vacuum initialization function (e.g., > parallel_vacuum_init()) in vacuum_rel()? Right, you'd have to gather more information sooner (number of indexes, etc). > Also, if we set up the parallel context in vacuum_rel(), we would > somehow need to pass it to the relation_vacuum table AM callback so > that they can use it during their own vacuum operation. Do you mean to > pass it via table_relation_vacuum()? Yes, something like that. But, that would require introducing more knowledge of index vacuuming into that layer. I chatted with Andres about this a bit off-list and he suggested passing a callback for vacuuming the table to parallel_vacuum_init(). I don't know exactly how this would play with parallel_vacuum_main()'s current functionality, though. I think this is a similar approach to what you mentioned in your earlier comment > Based on our previous discussions, I have contemplated modifying the > patch to define the callback functions for parallel table vacuum > within a structure in vacuumparallel.c instead of using table AM's > callbacks. This approach would require the leader to pass both the > library name and handler function name to vacuumparallel.c, enabling > workers to locate the handler function via load_external_function() > and access the callback functions. Although it's technically feasible, > I'm not sure that the design is elegant; while table AM seeking to use > parallel index vacuuming can simply invoke parallel_vacuum_init(), > those requiring parallel table vacuum would need to both provide the > handler function and supply the library and handler function names. Are there other drawbacks to requiring the library and handler function names? I'd find it a bit awkward to pass ParallelWorkerMain as the bgw_libary/function name in LaunchParallelWorkers and then have another point where you pass other library and function names, but I'm not really sure how to resolve that. In your patch, for the workers, I know you added a state machine to parallel_vacuum_main() so that it would do table and index vacuuming. That makes sense to me. But doing it this way seems like it is limiting extensibility. But finding somewhere to pass the callback that doesn't feel like an awkward addition to the existing function passed to parallel workers is also hard. - Melanie
On 9/8/25 17:40, Melanie Plageman wrote: > On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman >> <melanieplageman@gmail.com> wrote: >> >>> If you do parallel worker setup below heap_vacuum_rel(), then how are >>> you supposed to use those workers to do non-heap table vacuuming? >> >> IIUC non-heap tables can call parallel_vacuum_init() in its >> relation_vacuum table AM callback implementation in order to >> initialize parallel table vacuum, parallel index vacuum, or both. > > Yep, it's just that that is more like a library helper function. > Currently, in master, the flow is: > > leader does: > heap_vacuum_rel() > dead_items_alloc() > parallel_vacuum_init() > passes parallel_vacuum_main() to parallel context > lazy_scan_heap() > ... > parallel_vacuum_process_all_indexes() > LaunchParallelWorkers() > ... > parallel_vacuum_process_[un]safe_indexes() > > For the leader, I don't remember where your patch moves > LaunchParallelWorkers() -- since it would have to move to before the > first heap vacuuming phase. When I was imagining what made the most > sense to make parallelism agnostic of heap, it seemed like it would be > moving LaunchParallelWorkers() above heap_vacuum_rel(). But I > recognize that is a pretty big change, and I haven't tried it, so I > don't know all the complications. (sounds like you mention some > below). > > Basically, right now in master, we expose some functions for parallel > index vacuuming that people can call in their own table implementation > of table vacuuming, but we don't invoke any of them in table-AM > agnostic code. That may work fine for just index vacuuming, but now > that we are trying to make parts of the table vacuuming itself > extensible, the interface feels more awkward. > > I won't push the idea further because I'm not even sure if it works, > but the thing that felt the most natural to me was pushing parallelism > above the table AM. > > ... I took a quick look at the patch this week. I don't have a very strong opinion on the changes to table AM API, and I somewhat agree with this impression. It's not clear to me why we should be adding callbacks that are AM-specific (and only ever called from that particular AM) to the common AM interface. I keep thinking about how we handle parallelism in index builds. The index AM API did not get a bunch of new callbacks, it's all handled within the existing ambuild() callback. Shouldn't we be doing something like that for relation_vacuum()? That is, each AM-specific relation_vacuum() would be responsible for starting / stopping workers and all that. relation_vacuum would need some flag indicating it's OK to use parallelism, of course. We might add some shared infrastructure to make that more convenient, of course. This is roughly how index builds handle parallelism. It's not perfect (e.g. the plan_create_index_workers can do unexpected stuff for non-btree indexes). Interestingly enough, this seems to be the exact opposite of what Melanie proposed above, i.e. moving the parallelism "above" the table AM. Which AFAICS means we'd keep the new table AM callbacks, but move the calls "up" above the AM-specific parts. That should be possible, and maybe even cleaner. But ISTM it'd require a much more extensive refactoring of how vacuum works. Haven't tried, ofc. I also repeated the stress test / benchmark, measuring impact with different number of indexes, amount of deleted tuples, etc. Attached is a PDF summarizing the results for each part of the patch series (with different number of workers). For this tests, the improvements are significant only with no indexes - then the 0004 patch saves 30-50%. But as soon as indexes are added, the index cleanup starts to dominate. It's just an assessment, I'm not saying we shouldn't parallelize the heap cleanup. I assume there are workloads for which patches will make much more difference. But, what would such cases look like? If I want to maximize the impact of this patch series, what should I do? FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very extensive, so I did that (in order to run the tests), attached is the result as v19. This also reminds I heard a couple complaints we don't allow parallelism in autovacuum. We have parallel index vacuum, but it's disabled in autovacuum (and users really don't want to run manual vacuum). That question is almost entirely orthogonal to this patch, of course. I'm not suggesting this patch has (or even should) to do anything about it. But I wonder if you have any thoughts on that? I believe the reason why parallelism is disabled in autovacuum is that we want autovacuum to be a background process, with minimal disruption to user workload. It probably wouldn't be that hard to allow autovacuum to do parallel stuff, but it feels similar to adding autovacuum workers. That's rarely the solution, without increasing the cost limit. regards -- Tomas Vondra
Вложения
- v19-0005-Add-more-parallel-vacuum-tests.patch
- v19-0004-Support-parallelism-for-collecting-dead-items-du.patch
- v19-0003-Move-lazy-heap-scan-related-variables-to-new-str.patch
- v19-0002-vacuumparallel.c-Support-parallel-vacuuming-for-.patch
- v19-0001-Introduces-table-AM-APIs-for-parallel-table-vacu.patch
- create.sql
- parallel-vacuum-test.sh
- parallel-heap-vacuum.pdf
On Wed, Sep 17, 2025 at 7:25 AM Tomas Vondra <tomas@vondra.me> wrote: > I took a quick look at the patch this week. I don't have a very strong > opinion on the changes to table AM API, and I somewhat agree with this > impression. It's not clear to me why we should be adding callbacks that > are AM-specific (and only ever called from that particular AM) to the > common AM interface. We clearly should not do that. > I keep thinking about how we handle parallelism in index builds. The > index AM API did not get a bunch of new callbacks, it's all handled > within the existing ambuild() callback. Shouldn't we be doing something > like that for relation_vacuum()? I have a feeling that we might have made the wrong decision there. That approach will probably require a good deal of code to be duplicated for each AM. I'm not sure what the final solution should look like here, but we want the common parts like worker setup to use common code, while allowing each AM to insert its own logic in the places where that is needed. The challenge in my view is to figure out how best to arrange things so as to make that possible. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/17/25 18:01, Robert Haas wrote: > On Wed, Sep 17, 2025 at 7:25 AM Tomas Vondra <tomas@vondra.me> wrote: >> I took a quick look at the patch this week. I don't have a very strong >> opinion on the changes to table AM API, and I somewhat agree with this >> impression. It's not clear to me why we should be adding callbacks that >> are AM-specific (and only ever called from that particular AM) to the >> common AM interface. > > We clearly should not do that. > >> I keep thinking about how we handle parallelism in index builds. The >> index AM API did not get a bunch of new callbacks, it's all handled >> within the existing ambuild() callback. Shouldn't we be doing something >> like that for relation_vacuum()? > > I have a feeling that we might have made the wrong decision there. > That approach will probably require a good deal of code to be > duplicated for each AM. I'm not sure what the final solution should > look like here, but we want the common parts like worker setup to use > common code, while allowing each AM to insert its own logic in the > places where that is needed. The challenge in my view is to figure out > how best to arrange things so as to make that possible. > But a lot of the parallel-mode setup is already wrapped in some API (for example LaunchParallelWorkers, WaitForParallelWorkersToAttach, CreateParallelContext, ...). I guess we might "invert" how the parallel builds work - invent a set of callbacks / API an index AM would need to implement to support parallel builds. And then those callbacks would be called from a single "parallel index build" routine. But I don't think there's a lot of duplicated code, at least based on my experience with implementing parallel builds for BRIN and GIN. Look at the BRIN code, for example. Most of the parallel stuff happens in _brin_begin_parallel. Maybe more of it could be generalized a bit more (some of the shmem setup?). But most of it is tied to the AM-specific state / how parallel builds work for that particular AM. regards -- Tomas Vondra
On Wed, Sep 17, 2025 at 12:23 PM Tomas Vondra <tomas@vondra.me> wrote: > Look at the BRIN code, for example. Most of the parallel stuff happens > in _brin_begin_parallel. Maybe more of it could be generalized a bit > more (some of the shmem setup?). But most of it is tied to the > AM-specific state / how parallel builds work for that particular AM. Well, the code for PARALLEL_KEY_WAL_USAGE, PARALLEL_KEY_BUFFER_USAGE, and PARALLEL_KEY_QUERY_TEXT is duplicated, for instance. That's not a ton of code, perhaps, but it may evolve over time, and having to keep copies for a bunch of different AMs in sync is not ideal. -- Robert Haas EDB: http://www.enterprisedb.com
On 9/17/25 18:32, Robert Haas wrote: > On Wed, Sep 17, 2025 at 12:23 PM Tomas Vondra <tomas@vondra.me> wrote: >> Look at the BRIN code, for example. Most of the parallel stuff happens >> in _brin_begin_parallel. Maybe more of it could be generalized a bit >> more (some of the shmem setup?). But most of it is tied to the >> AM-specific state / how parallel builds work for that particular AM. > > Well, the code for PARALLEL_KEY_WAL_USAGE, PARALLEL_KEY_BUFFER_USAGE, > and PARALLEL_KEY_QUERY_TEXT is duplicated, for instance. That's not a > ton of code, perhaps, but it may evolve over time, and having to keep > copies for a bunch of different AMs in sync is not ideal. > True. And I agree it's not great it might break if we need to setup the wal/buffer usage tracking a bit differently (and forget to update all the places, or even can't do that in custom AMs). I suppose we could wrap that in a helper, and call that? That's what I meant by "maybe we could generalize the shmem setup". The alternative would be to have a single AM-agnostic place doing parallel builds with any index AM, and calls "AM callbacks" instead. AFAICS that's pretty much how Melanie imagines the parallel vacuum to work (at least that's how I understand it). I'm not sure which way is better. I'm terrible in designing APIs. For the parallel heap vacuum, the patches seem to be a bit 50:50. The per-AM callbacks are there, but each AM still has to do the custom code anyway. -- Tomas Vondra
On Wed, Sep 17, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote: > True. And I agree it's not great it might break if we need to setup the > wal/buffer usage tracking a bit differently (and forget to update all > the places, or even can't do that in custom AMs). Exactly. > I suppose we could wrap that in a helper, and call that? That's what I > meant by "maybe we could generalize the shmem setup". Yep, that's one possibility. > The alternative would be to have a single AM-agnostic place doing > parallel builds with any index AM, and calls "AM callbacks" instead. > AFAICS that's pretty much how Melanie imagines the parallel vacuum to > work (at least that's how I understand it). And that's the other possibility. > I'm not sure which way is better. I'm terrible in designing APIs. I'm not sure either. I don't think I'm terrible at designing APIs (and you probably aren't either) but I don't know enough about this particular problem space to be certain what's best. > For the parallel heap vacuum, the patches seem to be a bit 50:50. The > per-AM callbacks are there, but each AM still has to do the custom code > anyway. Unless I misunderstand, that seems like the worst of all possible situations. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 17, 2025 at 4:25 AM Tomas Vondra <tomas@vondra.me> wrote: > > On 9/8/25 17:40, Melanie Plageman wrote: > > On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >> On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman > >> <melanieplageman@gmail.com> wrote: > >> > >>> If you do parallel worker setup below heap_vacuum_rel(), then how are > >>> you supposed to use those workers to do non-heap table vacuuming? > >> > >> IIUC non-heap tables can call parallel_vacuum_init() in its > >> relation_vacuum table AM callback implementation in order to > >> initialize parallel table vacuum, parallel index vacuum, or both. > > > > Yep, it's just that that is more like a library helper function. > > Currently, in master, the flow is: > > > > leader does: > > heap_vacuum_rel() > > dead_items_alloc() > > parallel_vacuum_init() > > passes parallel_vacuum_main() to parallel context > > lazy_scan_heap() > > ... > > parallel_vacuum_process_all_indexes() > > LaunchParallelWorkers() > > ... > > parallel_vacuum_process_[un]safe_indexes() > > > > For the leader, I don't remember where your patch moves > > LaunchParallelWorkers() -- since it would have to move to before the > > first heap vacuuming phase. When I was imagining what made the most > > sense to make parallelism agnostic of heap, it seemed like it would be > > moving LaunchParallelWorkers() above heap_vacuum_rel(). But I > > recognize that is a pretty big change, and I haven't tried it, so I > > don't know all the complications. (sounds like you mention some > > below). > > > > Basically, right now in master, we expose some functions for parallel > > index vacuuming that people can call in their own table implementation > > of table vacuuming, but we don't invoke any of them in table-AM > > agnostic code. That may work fine for just index vacuuming, but now > > that we are trying to make parts of the table vacuuming itself > > extensible, the interface feels more awkward. > > > > I won't push the idea further because I'm not even sure if it works, > > but the thing that felt the most natural to me was pushing parallelism > > above the table AM. > > > > ... > > I took a quick look at the patch this week. I don't have a very strong > opinion on the changes to table AM API, and I somewhat agree with this > impression. It's not clear to me why we should be adding callbacks that > are AM-specific (and only ever called from that particular AM) to the > common AM interface. I agree that we should not add AM-specific callbacks to the AM interface. > I keep thinking about how we handle parallelism in index builds. The > index AM API did not get a bunch of new callbacks, it's all handled > within the existing ambuild() callback. Shouldn't we be doing something > like that for relation_vacuum()? > > That is, each AM-specific relation_vacuum() would be responsible for > starting / stopping workers and all that. relation_vacuum would need > some flag indicating it's OK to use parallelism, of course. We might add > some shared infrastructure to make that more convenient, of course. > > This is roughly how index builds handle parallelism. It's not perfect > (e.g. the plan_create_index_workers can do unexpected stuff for > non-btree indexes). In parallel vacuum cases, I believe we need to consider that vacuumparallel.c (particularly ParallelVacuumContext) effectively serves as a shared infrastructure. It handles several core functions, including the initialization of shared dead items storage (shared TidStore), worker management, buffer/WAL usage, vacuum-delay settings, and provides APIs for parallel index vacuuming/cleanup. If we were to implement parallel heap vacuum naively in vacuumlazy.c, we would likely end up with significant code duplication. To avoid this redundancy, my idea is to extend vacuumparallel.c to support parallel table vacuum. This approach would allow table AMs to use vacuumparallel.c for parallel table vacuuming, parallel index vacuuming, or both. While I agree that adding new table AM callbacks for parallel table vacuum to the table AM interface isn't a good idea, I think passing the necessary callbacks for parallel heap vacuum to vacuumparallel.c through parallel_vacuum_init() follows a similar principle. This means that all information needed for parallel table/index vacuuming would remain within vacuumparallel.c, and table AMs could trigger parallel table vacuum, parallel index vacuuming, and parallel index cleanup through the APIs provided by vacuumparallel.c. The alternative approach, as you suggested, would be to have the table AM's relation_vacuum() handle the parallel processing implementation while utilizing the infrastructure and APIs provided by vacuumparallel.c. Based on the feedback received, I'm now evaluating what shared information and helper functions would be necessary for this latter approach. > I also repeated the stress test / benchmark, measuring impact with > different number of indexes, amount of deleted tuples, etc. Attached is > a PDF summarizing the results for each part of the patch series (with > different number of workers). For this tests, the improvements are > significant only with no indexes - then the 0004 patch saves 30-50%. But > as soon as indexes are added, the index cleanup starts to dominate. > > It's just an assessment, I'm not saying we shouldn't parallelize the > heap cleanup. I assume there are workloads for which patches will make > much more difference. But, what would such cases look like? If I want to > maximize the impact of this patch series, what should I do? Another use case is executing vacuum without index cleanup for faster freezing XIDs on a large table. > > FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very > extensive, so I did that (in order to run the tests), attached is the > result as v19. Thank you. > This also reminds I heard a couple complaints we don't allow parallelism > in autovacuum. We have parallel index vacuum, but it's disabled in > autovacuum (and users really don't want to run manual vacuum). > > That question is almost entirely orthogonal to this patch, of course. > I'm not suggesting this patch has (or even should) to do anything about > it. But I wonder if you have any thoughts on that? > > I believe the reason why parallelism is disabled in autovacuum is that > we want autovacuum to be a background process, with minimal disruption > to user workload. It probably wouldn't be that hard to allow autovacuum > to do parallel stuff, but it feels similar to adding autovacuum workers. > That's rarely the solution, without increasing the cost limit. A patch enabling autovacuum to use parallel vacuum capabilities has already been proposed[1], and I've been reviewing it. There seems to be consensus that users would benefit from parallel index vacuuming during autovacuum in certain scenarios. As an initial implementation, parallel vacuum in autovacuum will be disabled by default, but users can enable it by specifying parallel degrees through a new storage option for specific tables where they want parallel vacuuming. For your information, while the implementation itself is relatively straightforward, we're still facing one unresolved issue; the system doesn't support reloading the configuration file during parallel mode, but it is necessary for autovacuum to update vacuum cost delay parameters[2]. Regards, [1] https://www.postgresql.org/message-id/CACG%3DezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw%3Dcw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD21AoD6HhraqhOgkQJOrr0ixZkAZuqJRpzGv-B%2B_-ad6d5aPw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote: > I believe the reason why parallelism is disabled in autovacuum is that > we want autovacuum to be a background process, with minimal disruption > to user workload. It probably wouldn't be that hard to allow autovacuum > to do parallel stuff, but it feels similar to adding autovacuum workers. > That's rarely the solution, without increasing the cost limit. I continue to find this argument extremely unconvincing. It's very common for autovacuum to be continuously be busy with the one large table that has a bunch of indexes. Vacuuming that one table is what prevents the freeze horizon to move forward / prevents getting out of anti-wraparound territory in time. Greetings, Andres Freund
Hi, On 2025-09-17 12:01:47 -0400, Robert Haas wrote: > > I keep thinking about how we handle parallelism in index builds. The > > index AM API did not get a bunch of new callbacks, it's all handled > > within the existing ambuild() callback. Shouldn't we be doing something > > like that for relation_vacuum()? > > I have a feeling that we might have made the wrong decision there. You might be right. > That approach will probably require a good deal of code to be > duplicated for each AM. I'm not sure what the final solution should > look like here, but we want the common parts like worker setup to use > common code, while allowing each AM to insert its own logic in the > places where that is needed. The challenge in my view is to figure out > how best to arrange things so as to make that possible. I've actually been thinking about the structure of the index build code recently. There have been a bunch of requests, one recently on the list, to build multiple indexes with one table scan - our current code structure would make that a pretty hard feature to develop. Greetings, Andres Freund
On 9/18/25 01:22, Andres Freund wrote: > Hi, > > On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote: >> I believe the reason why parallelism is disabled in autovacuum is that >> we want autovacuum to be a background process, with minimal disruption >> to user workload. It probably wouldn't be that hard to allow autovacuum >> to do parallel stuff, but it feels similar to adding autovacuum workers. >> That's rarely the solution, without increasing the cost limit. > > I continue to find this argument extremely unconvincing. It's very common for > autovacuum to be continuously be busy with the one large table that has a > bunch of indexes. Vacuuming that one table is what prevents the freeze horizon > to move forward / prevents getting out of anti-wraparound territory in time. > OK. I'm not claiming the argument is correct, I mostly asking if this was the argument for not allowing parallelism in autovacuum. I don't doubt an autovacuum worker can get "stuck" on a huge table, holding back the freeze horizon. But does it happen even with an increased cost limit? And is the bottleneck I/O or CPU? If it's vacuum_cost_limit, then the right "fix" is increasing the limit. Just adding workers improves nothing. If it's waiting on I/O, then adding workers is not going to help much. With CPU bottleneck it might, though. Does that match the cases you saw? regards -- Tomas Vondra
On 9/18/25 01:18, Masahiko Sawada wrote: > On Wed, Sep 17, 2025 at 4:25 AM Tomas Vondra <tomas@vondra.me> wrote: >> >> On 9/8/25 17:40, Melanie Plageman wrote: >>> On Wed, Aug 27, 2025 at 2:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> >>>> On Tue, Aug 26, 2025 at 8:55 AM Melanie Plageman >>>> <melanieplageman@gmail.com> wrote: >>>> >>>>> If you do parallel worker setup below heap_vacuum_rel(), then how are >>>>> you supposed to use those workers to do non-heap table vacuuming? >>>> >>>> IIUC non-heap tables can call parallel_vacuum_init() in its >>>> relation_vacuum table AM callback implementation in order to >>>> initialize parallel table vacuum, parallel index vacuum, or both. >>> >>> Yep, it's just that that is more like a library helper function. >>> Currently, in master, the flow is: >>> >>> leader does: >>> heap_vacuum_rel() >>> dead_items_alloc() >>> parallel_vacuum_init() >>> passes parallel_vacuum_main() to parallel context >>> lazy_scan_heap() >>> ... >>> parallel_vacuum_process_all_indexes() >>> LaunchParallelWorkers() >>> ... >>> parallel_vacuum_process_[un]safe_indexes() >>> >>> For the leader, I don't remember where your patch moves >>> LaunchParallelWorkers() -- since it would have to move to before the >>> first heap vacuuming phase. When I was imagining what made the most >>> sense to make parallelism agnostic of heap, it seemed like it would be >>> moving LaunchParallelWorkers() above heap_vacuum_rel(). But I >>> recognize that is a pretty big change, and I haven't tried it, so I >>> don't know all the complications. (sounds like you mention some >>> below). >>> >>> Basically, right now in master, we expose some functions for parallel >>> index vacuuming that people can call in their own table implementation >>> of table vacuuming, but we don't invoke any of them in table-AM >>> agnostic code. That may work fine for just index vacuuming, but now >>> that we are trying to make parts of the table vacuuming itself >>> extensible, the interface feels more awkward. >>> >>> I won't push the idea further because I'm not even sure if it works, >>> but the thing that felt the most natural to me was pushing parallelism >>> above the table AM. >>> >>> ... >> >> I took a quick look at the patch this week. I don't have a very strong >> opinion on the changes to table AM API, and I somewhat agree with this >> impression. It's not clear to me why we should be adding callbacks that >> are AM-specific (and only ever called from that particular AM) to the >> common AM interface. > > I agree that we should not add AM-specific callbacks to the AM interface. > >> I keep thinking about how we handle parallelism in index builds. The >> index AM API did not get a bunch of new callbacks, it's all handled >> within the existing ambuild() callback. Shouldn't we be doing something >> like that for relation_vacuum()? >> >> That is, each AM-specific relation_vacuum() would be responsible for >> starting / stopping workers and all that. relation_vacuum would need >> some flag indicating it's OK to use parallelism, of course. We might add >> some shared infrastructure to make that more convenient, of course. >> >> This is roughly how index builds handle parallelism. It's not perfect >> (e.g. the plan_create_index_workers can do unexpected stuff for >> non-btree indexes). > > In parallel vacuum cases, I believe we need to consider that > vacuumparallel.c (particularly ParallelVacuumContext) effectively > serves as a shared infrastructure. It handles several core functions, > including the initialization of shared dead items storage (shared > TidStore), worker management, buffer/WAL usage, vacuum-delay settings, > and provides APIs for parallel index vacuuming/cleanup. If we were to > implement parallel heap vacuum naively in vacuumlazy.c, we would > likely end up with significant code duplication. > > To avoid this redundancy, my idea is to extend vacuumparallel.c to > support parallel table vacuum. This approach would allow table AMs to > use vacuumparallel.c for parallel table vacuuming, parallel index > vacuuming, or both. While I agree that adding new table AM callbacks > for parallel table vacuum to the table AM interface isn't a good idea, > I think passing the necessary callbacks for parallel heap vacuum to > vacuumparallel.c through parallel_vacuum_init() follows a similar > principle. This means that all information needed for parallel > table/index vacuuming would remain within vacuumparallel.c, and table > AMs could trigger parallel table vacuum, parallel index vacuuming, and > parallel index cleanup through the APIs provided by vacuumparallel.c. > > The alternative approach, as you suggested, would be to have the table > AM's relation_vacuum() handle the parallel processing implementation > while utilizing the infrastructure and APIs provided by > vacuumparallel.c. Based on the feedback received, I'm now evaluating > what shared information and helper functions would be necessary for > this latter approach. > OK >> I also repeated the stress test / benchmark, measuring impact with >> different number of indexes, amount of deleted tuples, etc. Attached is >> a PDF summarizing the results for each part of the patch series (with >> different number of workers). For this tests, the improvements are >> significant only with no indexes - then the 0004 patch saves 30-50%. But >> as soon as indexes are added, the index cleanup starts to dominate. >> >> It's just an assessment, I'm not saying we shouldn't parallelize the >> heap cleanup. I assume there are workloads for which patches will make >> much more difference. But, what would such cases look like? If I want to >> maximize the impact of this patch series, what should I do? > > Another use case is executing vacuum without index cleanup for faster > freezing XIDs on a large table. > Good point. I forgot about this case, where we skip index cleanup. >> >> FWIW the patch needs rebasing, there's a bit of bitrot. It wasn't very >> extensive, so I did that (in order to run the tests), attached is the >> result as v19. > > Thank you. > >> This also reminds I heard a couple complaints we don't allow parallelism >> in autovacuum. We have parallel index vacuum, but it's disabled in >> autovacuum (and users really don't want to run manual vacuum). >> >> That question is almost entirely orthogonal to this patch, of course. >> I'm not suggesting this patch has (or even should) to do anything about >> it. But I wonder if you have any thoughts on that? >> >> I believe the reason why parallelism is disabled in autovacuum is that >> we want autovacuum to be a background process, with minimal disruption >> to user workload. It probably wouldn't be that hard to allow autovacuum >> to do parallel stuff, but it feels similar to adding autovacuum workers. >> That's rarely the solution, without increasing the cost limit. > > A patch enabling autovacuum to use parallel vacuum capabilities has > already been proposed[1], and I've been reviewing it. There seems to > be consensus that users would benefit from parallel index vacuuming > during autovacuum in certain scenarios. As an initial implementation, > parallel vacuum in autovacuum will be disabled by default, but users > can enable it by specifying parallel degrees through a new storage > option for specific tables where they want parallel vacuuming. > Thanks, the v18 patch seems generally sensible, but I only skimmed it. I'll try to take a closer look ... I agree it probably makes sense to disable that by default. One thing that bothers me is that this makes predicting autovacuum behavior even more difficult (but that's not the fault of the patch). > For your information, while the implementation itself is relatively > straightforward, we're still facing one unresolved issue; the system > doesn't support reloading the configuration file during parallel mode, > but it is necessary for autovacuum to update vacuum cost delay > parameters[2]. > Hmmm, that's annoying :-( If it happens only for max_stack_depth (when setting it based on environment) maybe we could simply skip that when in parallel mode? But it's ugly, and we probably will have the same issue for any other GUC_ALLOW_IN_PARALLEL option in a file ... regards -- Tomas Vondra
Hi, On 2025-09-18 02:00:50 +0200, Tomas Vondra wrote: > On 9/18/25 01:22, Andres Freund wrote: > > Hi, > > > > On 2025-09-17 13:25:11 +0200, Tomas Vondra wrote: > >> I believe the reason why parallelism is disabled in autovacuum is that > >> we want autovacuum to be a background process, with minimal disruption > >> to user workload. It probably wouldn't be that hard to allow autovacuum > >> to do parallel stuff, but it feels similar to adding autovacuum workers. > >> That's rarely the solution, without increasing the cost limit. > > > > I continue to find this argument extremely unconvincing. It's very common for > > autovacuum to be continuously be busy with the one large table that has a > > bunch of indexes. Vacuuming that one table is what prevents the freeze horizon > > to move forward / prevents getting out of anti-wraparound territory in time. > > > > OK. I'm not claiming the argument is correct, I mostly asking if this > was the argument for not allowing parallelism in autovacuum. > I don't doubt an autovacuum worker can get "stuck" on a huge table, > holding back the freeze horizon. But does it happen even with an > increased cost limit? And is the bottleneck I/O or CPU? > If it's vacuum_cost_limit, then the right "fix" is increasing the limit. > Just adding workers improves nothing. If it's waiting on I/O, then > adding workers is not going to help much. With CPU bottleneck it might, > though. Does that match the cases you saw? Cost limit can definitely be sometimes be the issue and indeed in that case increasing parallelism is the wrong answer. But I've repeatedly seen autovac being bottlenecked by CPU and/or I/O. Greetings, Andres Freund
On Wed, Sep 17, 2025 at 7:22 PM Andres Freund <andres@anarazel.de> wrote: > I continue to find this argument extremely unconvincing. It's very common for > autovacuum to be continuously be busy with the one large table that has a > bunch of indexes. Vacuuming that one table is what prevents the freeze horizon > to move forward / prevents getting out of anti-wraparound territory in time. The problem is that we're not smart enough to know whether this is the case or not. It's also fairly common for tables to get starved because all of the autovacuum workers are busy. Until we get a real scheduling system for autovacuum, I don't see this area improving very much. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 17, 2025 at 5:24 PM Tomas Vondra <tomas@vondra.me> wrote: > > > > On 9/18/25 01:18, Masahiko Sawada wrote: > > For your information, while the implementation itself is relatively > > straightforward, we're still facing one unresolved issue; the system > > doesn't support reloading the configuration file during parallel mode, > > but it is necessary for autovacuum to update vacuum cost delay > > parameters[2]. > > > > Hmmm, that's annoying :-( > > If it happens only for max_stack_depth (when setting it based on > environment) maybe we could simply skip that when in parallel mode? But > it's ugly, and we probably will have the same issue for any other > GUC_ALLOW_IN_PARALLEL option in a file ... We have the same issue for all non-GUC_ALLOW_IN_PARALLEL parameters. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com