Обсуждение: Fix parallel vacuum buffer usage reporting

Поиск
Список
Период
Сортировка

Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
Hi,

With a db setup with pgbench, we add an additional index:
CREATE INDEX ON pgbench_accounts(abalance)

And trigger several updates and vacuum to reach a stable amount of
dirtied pages:
UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts

The vacuum will report the following:
INFO:  vacuuming "postgres.public.pgbench_accounts"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
...
buffer usage: 122 hits, 165 misses, 4 dirtied

4 pages were reported dirtied. However, we have 5 dirtied blocks at
the end of the vacuum when looking at pg_buffercache:

SELECT c.relname, b.relfilenode
                                                         FROM
pg_buffercache b LEFT JOIN pg_class c
                                     ON b.relfilenode =
pg_relation_filenode(c.oid)
                               WHERE isdirty=true;
            relname            | relfilenode
-------------------------------+-------------
 pg_class                      |        1259
 pgbench_accounts              |       16400
 pgbench_accounts              |       16400
 pgbench_accounts_pkey         |       16408
 pgbench_accounts_abalance_idx |       16480

The missing dirty block comes from the parallel worker vacuuming the
abalance index. Running vacuum with parallel disabled will give the
correct result.

Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
buffer usage. However, those values are not collected at the end of
parallel vacuum workers, leading to incorrect buffer count.

Those vacuum specific globals are redundant with the existing
pgBufferUsage and only used in the verbose output. This patch removes
them and replaces them by pgBufferUsage which is already correctly
collected at the end of parallel workers, fixing the buffer count.

Regards,
Anthonin

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
Hi,

Thank you for the report.

On Fri, Feb 9, 2024 at 6:10 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> Hi,
>
> With a db setup with pgbench, we add an additional index:
> CREATE INDEX ON pgbench_accounts(abalance)
>
> And trigger several updates and vacuum to reach a stable amount of
> dirtied pages:
> UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
> VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts
>
> The vacuum will report the following:
> INFO:  vacuuming "postgres.public.pgbench_accounts"
> INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
> INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
> ...
> buffer usage: 122 hits, 165 misses, 4 dirtied
>
> 4 pages were reported dirtied. However, we have 5 dirtied blocks at
> the end of the vacuum when looking at pg_buffercache:
>
> SELECT c.relname, b.relfilenode
>                                                          FROM
> pg_buffercache b LEFT JOIN pg_class c
>                                      ON b.relfilenode =
> pg_relation_filenode(c.oid)
>                                WHERE isdirty=true;
>             relname            | relfilenode
> -------------------------------+-------------
>  pg_class                      |        1259
>  pgbench_accounts              |       16400
>  pgbench_accounts              |       16400
>  pgbench_accounts_pkey         |       16408
>  pgbench_accounts_abalance_idx |       16480
>
> The missing dirty block comes from the parallel worker vacuuming the
> abalance index. Running vacuum with parallel disabled will give the
> correct result.
>
> Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
> buffer usage. However, those values are not collected at the end of
> parallel vacuum workers, leading to incorrect buffer count.

True. I think we should fix it also on backbranches.

>
> Those vacuum specific globals are redundant with the existing
> pgBufferUsage and only used in the verbose output. This patch removes
> them and replaces them by pgBufferUsage which is already correctly
> collected at the end of parallel workers, fixing the buffer count.

It seems to make sense to remove VacuumPageHit and friends, only on
the master branch, if we can use BufferUsage instead.

As for the proposed patch, the following part should handle the temp tables too:

                        appendStringInfo(&buf, _("avg read rate: %.3f
MB/s, avg write rate: %.3f MB/s\n"),
                                                         read_rate, write_rate);
                        appendStringInfo(&buf, _("buffer usage: %lld
hits, %lld misses, %lld dirtied\n"),
-                                                        (long long)
AnalyzePageHit,
-                                                        (long long)
AnalyzePageMiss,
-                                                        (long long)
AnalyzePageDirty);
+                                                        (long long)
bufferusage.shared_blks_hit,
+                                                        (long long)
bufferusage.shared_blks_read,
+                                                        (long long)
bufferusage.shared_blks_dirtied);
                        appendStringInfo(&buf, _("system usage: %s"),
pg_rusage_show(&ru0));

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
Hi, 

Thanks for the review.
 
On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
As for the proposed patch, the following part should handle the temp tables too:

True, I've missed the local blocks. I've updated the patch:
- read_rate and write_rate now include local block usage
- I've added a specific output for reporting local blocks instead of combining them in the same output.

Regards,
Anthonin
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Alena Rybakina
Дата:

Hi, thank you for your work with this subject.

While I was reviewing your code, I noticed that your patch conflicts with another patch [0] that been committed.

[0] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

On 28.03.2024 13:12, Anthonin Bonnefoy wrote:
Hi, 

Thanks for the review.
 
On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
As for the proposed patch, the following part should handle the temp tables too:

True, I've missed the local blocks. I've updated the patch:
- read_rate and write_rate now include local block usage
- I've added a specific output for reporting local blocks instead of combining them in the same output.

Regards,
Anthonin
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Hi, thank you for your work with this subject.

While I was reviewing your code, I noticed that your patch conflicts with another patch [0] that been committed.

[0] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com


I've rebased the patch and also split the changes:
1: Use pgBufferUsage in Vacuum and Analyze block reporting
2: Stop tracking buffer usage in the now unused Vacuum{Hit,Miss,Dirty} variables
3: Remove declarations of Vacuum{Hit,Miss,Dirty}
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
On Mon, Apr 22, 2024 at 5:07 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>>
>> Hi, thank you for your work with this subject.
>>
>> While I was reviewing your code, I noticed that your patch conflicts with another patch [0] that been committed.
>>
>> [0]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
>
>
> I've rebased the patch and also split the changes:

Thank you for updating the patch!

> 1: Use pgBufferUsage in Vacuum and Analyze block reporting

I think that if the anayze command doesn't have the same issue, we
don't need to change it. Making the vacuum and the analyze consistent
is a good point but I'd like to avoid doing unnecessary changes in
back branches. I think the patch set would contain:

(a) make lazy vacuum use BufferUsage instead of
VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
(b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
variables for consistency and simplicity (only for HEAD, if we agree).

BTW I realized that VACUUM VERBOSE running on a temp table always
shows the number of dirtied buffers being 0, which seems to be a bug.
The patch (a) will resolve it as well.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
Thanks for the review! 
 
I think that if the anayze command doesn't have the same issue, we
don't need to change it.
Good point, I've wrongly assumed that analyze was also impacted but there's no parallel analyze so the block count is correct.
 
(a) make lazy vacuum use BufferUsage instead of
VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
(b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
variables for consistency and simplicity (only for HEAD, if we agree).
 I've isolated the fix in the first patch. I've kept the analyze change and removal of VacuumPage* variables split for clarity sake.

BTW I realized that VACUUM VERBOSE running on a temp table always
shows the number of dirtied buffers being 0, which seems to be a bug.
The patch (a) will resolve it as well.
Indeed, it visible with the following test:

SET temp_buffers='1024';
CREATE TEMPORARY TABLE vacuum_fix (aid int, bid int);
INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 1000000);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET bid = bid+1;
VACUUM (VERBOSE, INDEX_CLEANUP ON) vacuum_fix;

Pre-patch:
  avg read rate: 254.751 MB/s, avg write rate: 0.029 MB/s
  buffer usage: 8853 hits, 8856 misses, 1 dirtied
  WAL usage: 1 records, 1 full page images, 3049 bytes
The dirtied page is from pg_class (so not a local buffer) 

With the fix: 
  avg read rate: 250.230 MB/s, avg write rate: 250.230 MB/s
  buffer usage: 8853 hits, 8856 misses, 8856 dirtied
  WAL usage: 1 records, 1 full page images, 3049 bytes
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Alena Rybakina
Дата:
On 22.04.2024 11:07, Anthonin Bonnefoy wrote:
On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:

Hi, thank you for your work with this subject.

While I was reviewing your code, I noticed that your patch conflicts with another patch [0] that been committed.

[0] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com


I've rebased the patch and also split the changes:
1: Use pgBufferUsage in Vacuum and Analyze block reporting
2: Stop tracking buffer usage in the now unused Vacuum{Hit,Miss,Dirty} variables
3: Remove declarations of Vacuum{Hit,Miss,Dirty}

Hi! Thank you for your work, and I have reviewed your corrections.

I tested the main postgres branch with and without your fix using a script that was written by me. It consists of five scenarios and I made a comparison in the logs between the original version of the master branch and the master branch with your patch:

1. I added 1 million data to the table and deleted 10% of them. I ran vacuum verbose and didn't see any differences.
buffer usage: 12585 hits, 0 misses, 4 dirtied
2. I opened another connection with a repeatable read transaction through the dblink extension and launched a query updating the records in the table under test. Later, I ran vacuum verbose again and also didn't see any differences.
buffer usage: 19424 hits, 0 misses, 6 dirtied
3. I commited transaction and ran vacuum verbose again. Everything is fine in the logs.
buffer usage: 22960 hits, 0 misses, 11456 dirtied
4.  I deleted all the data from the table and later started vacuum verbose again. The number of pages in the buffer matched with your patch too.
5.I inserted 1 million data into the table and updated it, and I found the difference between the original master version and the version with your patch:
with your patch: buffer usage: 32315 hits, 606 misses, 1547 dirtied
original version: buffer usage: 32348 hits, 573 misses, 1456 dirtied
I suppose, something wasn't calculated.

The same script was run, but using vacuum verbose analyze, and I saw the difference again in the fifth step:
with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
master: buffer usage: 32346 hits, 573 misses, 1360 dirtied

I have attached a test file (vacuum_check_logs.sql) and four log files: two with the vacuum verbose command (vacuum_log and vacuum_log_with_patch files) and two others with the vacuum verbose analyze command (vacuum_analyze_test_with_patch and vacuum_analyze_test files). Both test approaches show logs with and without your changes.
1: Use pgBufferUsage in Vacuum and Analyze block reporting
I think that if the anayze command doesn't have the same issue, we
don't need to change it. Making the vacuum and the analyze consistent
is a good point but I'd like to avoid doing unnecessary changes in
back branches. I think the patch set would contain:

(a) make lazy vacuum use BufferUsage instead of
VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
(b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
variables for consistency and simplicity (only for HEAD, if we agree).

BTW I realized that VACUUM VERBOSE running on a temp table always
shows the number of dirtied buffers being 0, which seems to be a bug.
The patch (a) will resolve it as well.

I agree with that. I think we can leave these changes to the analysis command for master, but I also doubt the need to backport his changes to back versions.
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:

On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
I tested the main postgres branch with and without your fix using a script that was written by me. It consists of five scenarios and I made a comparison in the logs between the original version of the master branch and the master branch with your patch:
 Hi! Thanks for the tests.

I have attached a test file (vacuum_check_logs.sql)
Looking at the script, you won't trigger the problem. The reporting issue will only happen if there's a parallel index vacuum and it will only happen if there's at least 2 indexes [0]. You will need to create an additional index.

The same script was run, but using vacuum verbose analyze, and I saw the difference again in the fifth step:
with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
Isn't there a chance for the checkpointer to run during this time? That could make the conditions between the two runs slightly different and explain the change in buffer report.

 

Re: Fix parallel vacuum buffer usage reporting

От
Alena Rybakina
Дата:

Hi!

The same script was run, but using vacuum verbose analyze, and I saw the difference again in the fifth step:
with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
Isn't there a chance for the checkpointer to run during this time? That could make the conditions between the two runs slightly different and explain the change in buffer report.

Thank you for the link I accounted it in my next experiments.

I repeated the test without processing checkpoints with a single index, and the number of pages in the buffer used almost matched:

master branch: buffer usage: 32315 hits, 606 misses, 4486 dirtied

with applied patch v4 version: buffer usage: 32315 hits, 606 misses, 4489 dirtied

I think you are right - the problem was interfering with the checkpoint process, by the way I checked the first version patch. To cut a long story short, everything is fine now with one index.

Just in case, I'll explain: I considered this case because your patch could have impact influenced it too.

On 25.04.2024 10:17, Anthonin Bonnefoy wrote:

On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
I tested the main postgres branch with and without your fix using a script that was written by me. It consists of five scenarios and I made a comparison in the logs between the original version of the master branch and the master branch with your patch:
 Hi! Thanks for the tests.

I have attached a test file (vacuum_check_logs.sql)
The reporting issue will only happen if there's a parallel index vacuum and it will only happen if there's at least 2 indexes [0]. You will need to create an additional index.

Speaking of the problem, I added another index and repeated the test and found a significant difference:

  • I found it when I commited the transaction (3):

master: 2964 hits, 0 misses, 0 dirtied

with applied patch v4 version: buffer usage: 33013 hits, 0 misses, 3 dirtied

  • When I deleted all the data from the table and later started vacuum verbose again (4):

master: buffer usage: 51486 hits, 0 misses, 0 dirtied

with applied patch v4 version:buffer usage: 77924 hits, 0 misses, 0 dirtied

  • when I inserted 1 million data into the table and updated it (5):

master:buffer usage: 27904 hits, 5021 misses, 1777 dirtied

with applied patch v4 version:buffer usage: 41051 hits, 9973 misses, 2564 dirtied

As I see, the number of pages is significantly more than it was in the master branch and ,frankly, I couldn't fully figure out if it was a mistake or not. I attached a test script (vacuum_checks_logs.sql) with two indexes and no checkpoints, I also attached log files: the first one (vacuum_test) is the result of testing on the master branch, the second file with your applied patch (vacuum_test_v4).

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
On Fri, Apr 26, 2024 at 9:12 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Hi!
>>
>> The same script was run, but using vacuum verbose analyze, and I saw the difference again in the fifth step:
>> with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
>> master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
>
> Isn't there a chance for the checkpointer to run during this time? That could make the conditions between the two
runsslightly different and explain the change in buffer report. 
>
> [0]
https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831
>
> Looking at the script, you won't trigger the problem.
>
> Thank you for the link I accounted it in my next experiments.
>
> I repeated the test without processing checkpoints with a single index, and the number of pages in the buffer used
almostmatched: 
>
> master branch: buffer usage: 32315 hits, 606 misses, 4486 dirtied
>
> with applied patch v4 version: buffer usage: 32315 hits, 606 misses, 4489 dirtied
>
> I think you are right - the problem was interfering with the checkpoint process, by the way I checked the first
versionpatch. To cut a long story short, everything is fine now with one index. 
>
> Just in case, I'll explain: I considered this case because your patch could have impact influenced it too.
>
> On 25.04.2024 10:17, Anthonin Bonnefoy wrote:
>
>
> On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>>
>> I tested the main postgres branch with and without your fix using a script that was written by me. It consists of
fivescenarios and I made a comparison in the logs between the original version of the master branch and the master
branchwith your patch: 
>
>  Hi! Thanks for the tests.
>
>> I have attached a test file (vacuum_check_logs.sql)
>
> The reporting issue will only happen if there's a parallel index vacuum and it will only happen if there's at least 2
indexes[0]. You will need to create an additional index. 
>
> Speaking of the problem, I added another index and repeated the test and found a significant difference:
>
> I found it when I commited the transaction (3):
>
> master: 2964 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version: buffer usage: 33013 hits, 0 misses, 3 dirtied
>
> When I deleted all the data from the table and later started vacuum verbose again (4):
>
> master: buffer usage: 51486 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version:buffer usage: 77924 hits, 0 misses, 0 dirtied
>
> when I inserted 1 million data into the table and updated it (5):
>
> master:buffer usage: 27904 hits, 5021 misses, 1777 dirtied
>
> with applied patch v4 version:buffer usage: 41051 hits, 9973 misses, 2564 dirtied
>
> As I see, the number of pages is significantly more than it was in the master branch and ,frankly, I couldn't fully
figureout if it was a mistake or not. 

I think that the patch fixes the problem correctly.

I've run pgindent and updated the commit message. I realized that
parallel vacuum was introduced in pg13 but buffer usage reporting in
VACUUM command was implemented in pg15. Therefore, in pg13 and pg14,
VACUUM (PARALLEL) is available but VACUUM (PARALLEL, VERBOSE) doesn't
show the buffer usage report. Autovacuum does show the buffer usage
report but parallel autovacuum is not supported. Therefore, we should
backpatch it down to 15, not 13.

I'm going to push the patch down to pg15, barring any objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
I've done some additional tests to validate the reported numbers. Using pg_statio, it's possible to get the minimum number of block hits (Full script attached).

-- Save block hits before vacuum
SELECT pg_stat_force_next_flush();
SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where relname='vestat' \gset
vacuum (verbose, index_cleanup on) vestat;
-- Check the difference
SELECT pg_stat_force_next_flush();
SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit,
       idx_blks_hit - :idx_blks_hit as delta_idx_hit,
       heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum
    FROM pg_statio_all_tables where relname='vestat';

Output:
...
buffer usage: 14676 hits, 0 misses, 667 dirtied
buffer usage (new): 16081 hits, 0 misses, 667 dirtied
...
 -[ RECORD 1 ]--+------
delta_heap_hit | 9747
delta_idx_hit  | 6325
sum            | 16072

From pg_statio, we had 16072 blocks for the relation + indexes. 
Pre-patch, we are under reporting with 14676.
Post-patch, we have 16081. The 9 additional block hits come from vacuum accessing catalog tables like pg_class or pg_class_oid_index.

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
On Tue, Apr 30, 2024 at 3:34 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> I've done some additional tests to validate the reported numbers. Using pg_statio, it's possible to get the minimum
numberof block hits (Full script attached). 
>
> -- Save block hits before vacuum
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where relname='vestat' \gset
> vacuum (verbose, index_cleanup on) vestat;
> -- Check the difference
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit,
>        idx_blks_hit - :idx_blks_hit as delta_idx_hit,
>        heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum
>     FROM pg_statio_all_tables where relname='vestat';
>
> Output:
> ...
> buffer usage: 14676 hits, 0 misses, 667 dirtied
> buffer usage (new): 16081 hits, 0 misses, 667 dirtied
> ...
>  -[ RECORD 1 ]--+------
> delta_heap_hit | 9747
> delta_idx_hit  | 6325
> sum            | 16072
>
> From pg_statio, we had 16072 blocks for the relation + indexes.
> Pre-patch, we are under reporting with 14676.
> Post-patch, we have 16081. The 9 additional block hits come from vacuum accessing catalog tables like pg_class or
pg_class_oid_index.
>

Thank you for further testing! I've pushed the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Fix parallel vacuum buffer usage reporting

От
Alena Rybakina
Дата:

Hi!

On 30.04.2024 05:18, Masahiko Sawada wrote:
On Fri, Apr 26, 2024 at 9:12 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
Hi!
The same script was run, but using vacuum verbose analyze, and I saw the difference again in the fifth step:
with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
Isn't there a chance for the checkpointer to run during this time? That could make the conditions between the two runs slightly different and explain the change in buffer report.

[0] https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831

Looking at the script, you won't trigger the problem.

Thank you for the link I accounted it in my next experiments.

I repeated the test without processing checkpoints with a single index, and the number of pages in the buffer used almost matched:

master branch: buffer usage: 32315 hits, 606 misses, 4486 dirtied

with applied patch v4 version: buffer usage: 32315 hits, 606 misses, 4489 dirtied

I think you are right - the problem was interfering with the checkpoint process, by the way I checked the first version patch. To cut a long story short, everything is fine now with one index.

Just in case, I'll explain: I considered this case because your patch could have impact influenced it too.

On 25.04.2024 10:17, Anthonin Bonnefoy wrote:


On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina <lena.ribackina@yandex.ru> wrote:
I tested the main postgres branch with and without your fix using a script that was written by me. It consists of five scenarios and I made a comparison in the logs between the original version of the master branch and the master branch with your patch:
 Hi! Thanks for the tests.

I have attached a test file (vacuum_check_logs.sql)
The reporting issue will only happen if there's a parallel index vacuum and it will only happen if there's at least 2 indexes [0]. You will need to create an additional index.

Speaking of the problem, I added another index and repeated the test and found a significant difference:

I found it when I commited the transaction (3):

master: 2964 hits, 0 misses, 0 dirtied

with applied patch v4 version: buffer usage: 33013 hits, 0 misses, 3 dirtied

When I deleted all the data from the table and later started vacuum verbose again (4):

master: buffer usage: 51486 hits, 0 misses, 0 dirtied

with applied patch v4 version:buffer usage: 77924 hits, 0 misses, 0 dirtied

when I inserted 1 million data into the table and updated it (5):

master:buffer usage: 27904 hits, 5021 misses, 1777 dirtied

with applied patch v4 version:buffer usage: 41051 hits, 9973 misses, 2564 dirtied

As I see, the number of pages is significantly more than it was in the master branch and ,frankly, I couldn't fully figure out if it was a mistake or not.
I think that the patch fixes the problem correctly.

I've run pgindent and updated the commit message. I realized that
parallel vacuum was introduced in pg13 but buffer usage reporting in
VACUUM command was implemented in pg15. Therefore, in pg13 and pg14,
VACUUM (PARALLEL) is available but VACUUM (PARALLEL, VERBOSE) doesn't
show the buffer usage report. Autovacuum does show the buffer usage
report but parallel autovacuum is not supported. Therefore, we should
backpatch it down to 15, not 13.

I'm going to push the patch down to pg15, barring any objections.

Regards,
I agree with you about porting and I saw that the patch is working correctly.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fix parallel vacuum buffer usage reporting

От
Anthonin Bonnefoy
Дата:
On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for further testing! I've pushed the patch.
Thanks!

Here is the rebased version for the follow-up patch removing VacuumPage variables. Though I'm not sure if I should create a dedicated mail thread since the bug was fixed and the follow-up is more of a refactoring. What do you think? 
Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
On Fri, May 3, 2024 at 3:41 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
> On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> Thank you for further testing! I've pushed the patch.
>
> Thanks!
>
> Here is the rebased version for the follow-up patch removing VacuumPage variables. Though I'm not sure if I should
createa dedicated mail thread since the bug was fixed and the follow-up is more of a refactoring. What do you think? 

I'd suggest starting a new thread or changing the subject as the
current subject no longer matches what we're discussing.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Fix parallel vacuum buffer usage reporting

От
Nazir Bilal Yavuz
Дата:
Hi,

Thank you for working on this!

On Wed, 1 May 2024 at 06:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for further testing! I've pushed the patch.

I realized a behaviour change while looking at 'Use pgBufferUsage for
block reporting in analyze' thread [1]. Since that change applies here
as well, I thought it is better to mention it here.

Before this commit, VacuumPageMiss did not count the blocks if its
read was already completed by other backends [2]. Now,
'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
counts every block attempted to be read; possibly double counting if
someone else has already completed the read. I do not know which
behaviour is correct but I wanted to mention this.

[1] https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com

[2] In the WaitReadBuffers() function, see comment:
        /*
         * Skip this block if someone else has already completed it.  If an
         * I/O is already in progress in another backend, this will wait for
         * the outcome: either done, or something went wrong and we will
         * retry.
         */

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Fix parallel vacuum buffer usage reporting

От
Alena Rybakina
Дата:
Hi! I could try to check it with the test, but I want to ask you about 
details, because I'm not sure that I completely understand the test case.

You mean that we need to have two backends and on one of them we deleted 
the tuples before vacuum called the other, do you?

On 10.05.2024 13:25, Nazir Bilal Yavuz wrote:
> Hi,
>
> Thank you for working on this!
>
> On Wed, 1 May 2024 at 06:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for further testing! I've pushed the patch.
> I realized a behaviour change while looking at 'Use pgBufferUsage for
> block reporting in analyze' thread [1]. Since that change applies here
> as well, I thought it is better to mention it here.
>
> Before this commit, VacuumPageMiss did not count the blocks if its
> read was already completed by other backends [2]. Now,
> 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> counts every block attempted to be read; possibly double counting if
> someone else has already completed the read. I do not know which
> behaviour is correct but I wanted to mention this.
>
> [1] https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com
>
> [2] In the WaitReadBuffers() function, see comment:
>          /*
>           * Skip this block if someone else has already completed it.  If an
>           * I/O is already in progress in another backend, this will wait for
>           * the outcome: either done, or something went wrong and we will
>           * retry.
>           */
>
-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Fix parallel vacuum buffer usage reporting

От
Nazir Bilal Yavuz
Дата:
Hi,

On Fri, 10 May 2024 at 14:49, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
>
> Hi! I could try to check it with the test, but I want to ask you about
> details, because I'm not sure that I completely understand the test case.
>
> You mean that we need to have two backends and on one of them we deleted
> the tuples before vacuum called the other, do you?
>

I think triggering a parallel vacuum is enough. I am able to see the
differences with the following:

You can apply the attached diff file to see the differences between
the previous version and the patched version. Then, run this query:

CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
(autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 1000000);
CREATE INDEX a_idx on vacuum_fix (aid);
CREATE INDEX b_idx on vacuum_fix (bid);
CREATE INDEX c_idx on vacuum_fix (cid);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1;
VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;

After that I saw:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
...
...
buffer usage: 29343 hits, 9580 misses in the previous version, 14165
misses in the patched version, 14262 dirtied

Patched version counts 14165 misses but the previous version counts
9580 misses in this specific example.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Nazir Bilal Yavuz
Дата:
Hi,

On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 14:49, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
> >
> > Hi! I could try to check it with the test, but I want to ask you about
> > details, because I'm not sure that I completely understand the test case.
> >
> > You mean that we need to have two backends and on one of them we deleted
> > the tuples before vacuum called the other, do you?
> >
>
> I think triggering a parallel vacuum is enough. I am able to see the
> differences with the following:
>
> You can apply the attached diff file to see the differences between
> the previous version and the patched version. Then, run this query:
>
> CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
> (autovacuum_enabled=false);
> INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 1000000);
> CREATE INDEX a_idx on vacuum_fix (aid);
> CREATE INDEX b_idx on vacuum_fix (bid);
> CREATE INDEX c_idx on vacuum_fix (cid);
> VACUUM vacuum_fix;
> UPDATE vacuum_fix SET aid = aid + 1;
> VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;
>
> After that I saw:
>
> INFO:  vacuuming "test.public.vacuum_fix"
> INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
> INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
> ...
> ...
> buffer usage: 29343 hits, 9580 misses in the previous version, 14165
> misses in the patched version, 14262 dirtied
>
> Patched version counts 14165 misses but the previous version counts
> 9580 misses in this specific example.

I am sorry that I showed the wrong thing, this is exactly what is
fixed in this patch. Actually, I do not know how to trigger it;
currently I am looking for it. I will share if anything comes to my
mind.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Nazir Bilal Yavuz
Дата:
Hi,

On Fri, 10 May 2024 at 16:55, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Fri, 10 May 2024 at 14:49, Alena Rybakina <lena.ribackina@yandex.ru> wrote:
> > >
> > > Hi! I could try to check it with the test, but I want to ask you about
> > > details, because I'm not sure that I completely understand the test case.
> > >
> > > You mean that we need to have two backends and on one of them we deleted
> > > the tuples before vacuum called the other, do you?
> > >

There should be some other backend(s) which will try to read the same
buffer with the ongoing VACUUM operation. I think it works now but the
reproduction steps are a bit racy. See:

1- Build Postgres with attached diff, it is the same
see_previous_output.diff that I shared two mails ago.

2- Run Postgres, all settings are default.

3- Use two client backends, let's name them as A and B client backends.

4- On A client backend, run:

CREATE TABLE vacuum_fix (aid int, bid int) with (autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 20000000);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1, bid = bid + 1;

5- Now it will be a bit racy, SQL commands below need to be run at the
same time. The aim is for VACUUM on A client backend and SELECT on B
client backend to read the same buffers at the same time. So, some of
the buffers will be double counted.

Firstly, run VACUUM on A client backend; immediately after running
VACUUM, run SELECT on B backend.

A client backend:
VACUUM VERBOSE vacuum_fix;

B client backend:
SELECT * from vacuum_fix WHERE aid = -1;

This is the output of the VACUUM VERBOSE on my end:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 0
pages: 0 removed, 176992 remain, 176992 scanned (100.00% of total)
...
...
buffer usage: 254181 hits, 99030 misses in the previous version, 99865
misses in the patched version, 106830 dirtied
...
VACUUM
Time: 2578.217 ms (00:02.578)

VACUUM does not run parallel, so this test case does not trigger what
is fixed in this thread. As it can be seen, there is ~1000 buffers
difference.

I am not sure if there is an easier way to reproduce this but I hope this helps.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Fix parallel vacuum buffer usage reporting

От
Masahiko Sawada
Дата:
On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Wed, 1 May 2024 at 06:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for further testing! I've pushed the patch.
>
> I realized a behaviour change while looking at 'Use pgBufferUsage for
> block reporting in analyze' thread [1]. Since that change applies here
> as well, I thought it is better to mention it here.
>
> Before this commit, VacuumPageMiss did not count the blocks if its
> read was already completed by other backends [2]. Now,
> 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> counts every block attempted to be read; possibly double counting if
> someone else has already completed the read.

True. IIUC there is such a difference only in HEAD but not in v15 and
v16. The following comment in WaitReadBuffers() says that it's a
traditional behavior that we count blocks as read even if someone else
has already completed its I/O:

    /*
     * We count all these blocks as read by this backend.  This is traditional
     * behavior, but might turn out to be not true if we find that someone
     * else has beaten us and completed the read of some of these blocks.  In
     * that case the system globally double-counts, but we traditionally don't
     * count this as a "hit", and we don't have a separate counter for "miss,
     * but another backend completed the read".
     */

So I think using pgBufferUsage for (parallel) vacuum is a legitimate
usage and makes it more consistent with other parallel operations.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Fix parallel vacuum buffer usage reporting

От
Nazir Bilal Yavuz
Дата:
Hi,

On Fri, 10 May 2024 at 19:09, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > Thank you for working on this!
> >
> > On Wed, 1 May 2024 at 06:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Thank you for further testing! I've pushed the patch.
> >
> > I realized a behaviour change while looking at 'Use pgBufferUsage for
> > block reporting in analyze' thread [1]. Since that change applies here
> > as well, I thought it is better to mention it here.
> >
> > Before this commit, VacuumPageMiss did not count the blocks if its
> > read was already completed by other backends [2]. Now,
> > 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> > counts every block attempted to be read; possibly double counting if
> > someone else has already completed the read.
>
> True. IIUC there is such a difference only in HEAD but not in v15 and
> v16. The following comment in WaitReadBuffers() says that it's a
> traditional behavior that we count blocks as read even if someone else
> has already completed its I/O:
>
>     /*
>      * We count all these blocks as read by this backend.  This is traditional
>      * behavior, but might turn out to be not true if we find that someone
>      * else has beaten us and completed the read of some of these blocks.  In
>      * that case the system globally double-counts, but we traditionally don't
>      * count this as a "hit", and we don't have a separate counter for "miss,
>      * but another backend completed the read".
>      */
>
> So I think using pgBufferUsage for (parallel) vacuum is a legitimate
> usage and makes it more consistent with other parallel operations.

That sounds logical. Thank you for the clarification.

--
Regards,
Nazir Bilal Yavuz
Microsoft