Обсуждение: Speed up pg_checksums in cases where checksum already set
The attached patch makes an optimization to pg_checksums which prevents rewriting the block if the checksum is already what we expect. This can lead to much faster runs in cases where it is already set (e.g. enabled -> disabled -> enable, external helper process, interrupted runs, future parallel processes). There is also an effort to not sync the data directory if no changes were written. Finally, added a bit more output on how many files were actually changed, e.g.:
Checksum operation completed
Files scanned: 1236
Blocks scanned: 23283
Files modified: 38
Blocks modified: 19194
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster
Files scanned: 1236
Blocks scanned: 23283
Files modified: 38
Blocks modified: 19194
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster
Cheers,
Greg
Вложения
On Thu, May 27, 2021 at 5:24 AM Greg Sabino Mullane <htamfids@gmail.com> wrote: > > The attached patch makes an optimization to pg_checksums which prevents rewriting the block if the checksum is alreadywhat we expect. This can lead to much faster runs in cases where it is already set (e.g. enabled -> disabled -> enable,external helper process, interrupted runs, future parallel processes). There is also an effort to not sync the datadirectory if no changes were written. Finally, added a bit more output on how many files were actually changed, e.g.: I don't know how often this will actually help as probably people aren't toggling the checksum state that often, but it seems like a good idea overall. The patch looks sensible to me.
On Wed, May 26, 2021 at 05:23:55PM -0400, Greg Sabino Mullane wrote: > The attached patch makes an optimization to pg_checksums which prevents > rewriting the block if the checksum is already what we expect. This can > lead to much faster runs in cases where it is already set (e.g. enabled -> > disabled -> enable, external helper process, interrupted runs, future > parallel processes). Makes sense. > There is also an effort to not sync the data directory > if no changes were written. Finally, added a bit more output on how many > files were actually changed, e.g.: - if (do_sync) + if (do_sync && total_files_modified) { pg_log_info("syncing data directory"); fsync_pgdata(DataDir, PG_VERSION_NUM); Here, I am on the edge. It could be an advantage to force a flush of the data folder anyway, no? Say, all the pages have a correct checksum and they are in the OS cache, but they may not have been flushed yet. That would emulate what initdb -S does already. > Checksum operation completed > Files scanned: 1236 > Blocks scanned: 23283 > Files modified: 38 > Blocks modified: 19194 > pg_checksums: syncing data directory > pg_checksums: updating control file > Checksums enabled in cluster The addition of the number of files modified looks like an advantage. -- Michael
Вложения
In one of the checksum patches, there was an understanding that the pages should be written even if the checksum is correct, to handle replicas. From the v19 patch: https://www.postgresql.org/message-id/F7AFCFCD-8F77-4546-8D42-C7F675A4B680%40yesql.se + * Mark the buffer as dirty and force a full page write. We have to + * re-write the page to WAL even if the checksum hasn't changed, + * because if there is a replica it might have a slightly different + * version of the page with an invalid checksum, caused by unlogged + * changes (e.g. hintbits) on the master happening while checksums + * were off. This can happen if there was a valid checksum on the page + * at one point in the past, so only when checksums are first on, then + * off, and then turned on again. pg_checksums(1) says: | When using a replication setup with tools which perform direct copies of relation file blocks (for example pg_rewind(1)),enabling or disabling checksums can lead to page | corruptions in the shape of incorrect checksums if the operation is not done consistently across all nodes. Whenenabling or disabling checksums in a replication setup, it | is thus recommended to stop all the clusters before switching them all consistently. Destroying all standbys, performingthe operation on the primary and finally recreating | the standbys from scratch is also safe. Does your patch complicate things for the "stop all the clusters before switching them all" case? -- Justin
On Wed, May 26, 2021 at 09:29:43PM -0500, Justin Pryzby wrote: > In one of the checksum patches, there was an understanding that the pages > should be written even if the checksum is correct, to handle replicas. > > From the v19 patch: > https://www.postgresql.org/message-id/F7AFCFCD-8F77-4546-8D42-C7F675A4B680%40yesql.se > + * Mark the buffer as dirty and force a full page write. We have to > + * re-write the page to WAL even if the checksum hasn't changed, > + * because if there is a replica it might have a slightly different > + * version of the page with an invalid checksum, caused by unlogged > + * changes (e.g. hintbits) on the master happening while checksums > + * were off. This can happen if there was a valid checksum on the page > + * at one point in the past, so only when checksums are first on, then > + * off, and then turned on again. I am not really following the line of argument here. pg_checksums relies on the fact that the cluster has been safely shut down before running. So, if this comes to standbys, they would have reached a consistent point, and the shutdown makes sure that all pages are flushed. -- Michael
Вложения
Thanks for the quick replies, everyone.
On Wed, May 26, 2021 at 10:17 PM Michael Paquier <michael@paquier.xyz> wrote:
- if (do_sync)
+ if (do_sync && total_files_modified)
Here, I am on the edge. It could be an advantage to force a flush of
the data folder anyway, no?
I was originally on the fence about including this as well, but it seems like since the database is shut down and already in a consistent state, there seems no advantage to syncing if we have not made any changes. Things are no better or worse than when we arrived. However, the real-world use case of running pg_checksums --enable and getting no changed blocks is probably fairly rare, so if there is a strong objection, I'm happy reverting to just (do_sync). (I'm not sure how cheap a sync is, I assume it's low impact as the database is shut down, I guess it becomes a "might as well while we are here"?)
On Wed, May 26, 2021 at 10:29 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
In one of the checksum patches, there was an understanding that the pages
should be written even if the checksum is correct, to handle replicas.
...
Does your patch complicate things for the "stop all the clusters before
switching them all" case?
I cannot imagine how it would, but, like Michael, I'm not really understanding the reasoning here. We only run when safely shutdown, so no WAL or dirty buffers need concern us :). Of course, once the postmaster is up and running, fiddling with checksums becomes vastly more complicated, as evidenced by that thread. I'm happy sticking to and speeding up the offline version for now.
Cheers,
Greg
On Thu, May 27, 2021 at 10:29:14AM -0400, Greg Sabino Mullane wrote: > I was originally on the fence about including this as well, but it seems > like since the database is shut down and already in a consistent state, > there seems no advantage to syncing if we have not made any changes. Things > are no better or worse than when we arrived. However, the real-world use > case of running pg_checksums --enable and getting no changed blocks is > probably fairly rare, so if there is a strong objection, I'm happy > reverting to just (do_sync). (I'm not sure how cheap a sync is, I assume > it's low impact as the database is shut down, I guess it becomes a "might > as well while we are here"?) I understand that this should be rare, but I don't want to take any bets either. With this patch, we could finish with cases where some pages are still in the OS cache but don't get flushed because a previous cancellation let the cluster in a state where all the page checksums have been written out but a portion of the files were not synced. A follow-up run of pg_checksums would see that all the pages are correct, but would think that no sync is required, incorrectly. -- Michael
Вложения
Fair enough; thanks for the feedback. Attached is a new version that does an unconditional sync (well, unless do_sync is false, a flag I am not particularly fond of).
Cheers,
Greg
Вложения
Newer version attach that adds a small documentation tweak as well.
Cheers,
Greg
Вложения
On Wed, Jun 02, 2021 at 05:09:36PM -0400, Greg Sabino Mullane wrote: > Newer version attach that adds a small documentation tweak as well. - enabling checksums, every file in the cluster is rewritten in-place. + enabling checksums, every file in the cluster with a changed checksum is + rewritten in-place. This doc addition is a bit confusing, as it could mean that each file has just one single checksum. We could be more precise, say: "When enabling checksums, each relation file block with a changed checksum is rewritten in place." Should we also mention that the sync happens even if no blocks are rewritten based on the reasoning of upthread (aka we'd better do the final flush as an interrupted pg_checksums may let a portion of the files as not flushed)? -- Michael
Вложения
On Fri, Jun 18, 2021 at 1:57 AM Michael Paquier <michael@paquier.xyz> wrote:
This doc addition is a bit confusing, as it could mean that each file
has just one single checksum. We could be more precise, say:
"When enabling checksums, each relation file block with a changed
checksum is rewritten in place."
Agreed, I like that wording. New patch attached.
Should we also mention that the sync happens even if no blocks are
rewritten based on the reasoning of upthread (aka we'd better do the
final flush as an interrupted pg_checksums may let a portion of the
files as not flushed)?
I don't know that we need to bother: the default is already to sync and one has to go out of one's way using the -N argument to NOT sync, so I think it's a pretty safe assumption to everyone (except those who read my first version of my patch!) that syncing always happens.
Cheers,
Greg
Вложения
On Fri, Jun 18, 2021 at 08:01:17PM -0400, Greg Sabino Mullane wrote: > I don't know that we need to bother: the default is already to sync and one > has to go out of one's way using the -N argument to NOT sync, so I think > it's a pretty safe assumption to everyone (except those who read my first > version of my patch!) that syncing always happens. Perhaps you are right to keep it simple. If people would like to document that more precisely, it could always be changed if necessary. What you have here sounds pretty much right to me. -- Michael
Вложения
On Wed, Jun 23, 2021 at 09:39:37AM +0900, Michael Paquier wrote: > Perhaps you are right to keep it simple. If people would like to > document that more precisely, it could always be changed if > necessary. What you have here sounds pretty much right to me. So, I was looking at this one today, and got confused with the name of the counters once the patch was in place as this leads to having things like "blocks" and "total_blocks_modified", which is a bit confusing as "blocks" stands for the number of blocks scanned, including new pages. I have simply suffixed "files" and "blocks" with "_scanned" to be more consistent with the new counters that are named "_written", giving the attached. We still need to have the per-file counter in scan_file() with the global counters updated at the end of a file scan for the sake of the file counter, of course. Does that look fine to you? -- Michael
Вложения
On Tue, Jun 29, 2021 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:
Does that look fine to you?
Looks great, I appreciate the renaming.
Cheers,
Greg
On Tue, Jun 29, 2021 at 09:10:30AM -0400, Greg Sabino Mullane wrote: > Looks great, I appreciate the renaming. Applied, then. -- Michael