Обсуждение: Multiple full page writes in a single checkpoint?
Cluster file encryption plans to use the LSN and page number as the nonce for heap/index pages. I am looking into the use of a unique nonce during hint bit changes. (You need to use a new nonce for re-encrypting a page that changes.) log_hint_bits already gives us a unique nonce for the first hint bit change on a page during a checkpoint, but we only encrypt on page write to the file system, so I am researching if log_hint_bits will already generate a unique LSN for every page write to the file system, even if there are multiple hint-bit-caused page writes to the file system during a single checkpoint. (We already know this works for multiple checkpoints.) Our docs on full_page_writes states: When this parameter is on, the <productname>PostgreSQL</productname> server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. and wal_log_hints states: When this parameter is <literal>on</literal>, the <productname>PostgreSQL</productname> server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint, even for non-critical modifications of so-called hint bits. However, imagine these steps: 1. checkpoint starts 2. page is modified by row or hint bit change 3. page gets a new LSN and is marked as dirty 4. page image is flushed to WAL 5. pages is written to disk and marked as clean 6. page is modified by data or hint bit change 7. pages gets a new LSN and is marked as dirty 8. page image is flushed to WAL 9. checkpoint completes 10. pages is written to disk and marked as clean Is the above case valid, and would it cause two full page writes to WAL? More specifically, wouldn't it cause every write of the page to the file system to use a new LSN? If so, this means wal_log_hints is sufficient to guarantee a new nonce for every page image, even for multiple hint bit changes and page writes during a single checkpoint, and there is then no need for a hit bit counter on the page --- the unique LSN does that for us. I know log_hint_bits was designed to fix torn pages, but it seems to also do exactly what cluster file encryption needs. If the above is all true, should we update the docs, READMEs, or C comments about this? I think the cluster file encryption patch would at least need to document that we need to keep this behavior, because I don't think log_hint_bits needs to behave this way for checksum purposes because of the way full page writes are processed during crash recovery. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-02-03 18:05:56 -0500, Bruce Momjian wrote: > log_hint_bits already gives us a unique nonce for the first hint bit > change on a page during a checkpoint, but we only encrypt on page write > to the file system, so I am researching if log_hint_bits will already > generate a unique LSN for every page write to the file system, even if > there are multiple hint-bit-caused page writes to the file system during > a single checkpoint. (We already know this works for multiple > checkpoints.) No, it won't: > However, imagine these steps: > > 1. checkpoint starts > 2. page is modified by row or hint bit change > 3. page gets a new LSN and is marked as dirty > 4. page image is flushed to WAL > 5. pages is written to disk and marked as clean > 6. page is modified by data or hint bit change > 7. pages gets a new LSN and is marked as dirty > 8. page image is flushed to WAL > 9. checkpoint completes > 10. pages is written to disk and marked as clean > > Is the above case valid, and would it cause two full page writes to WAL? > More specifically, wouldn't it cause every write of the page to the file > system to use a new LSN? No. 8) won't happen. Look e.g. at XLogSaveBufferForHint(): /* * Update RedoRecPtr so that we can make the right decision */ RedoRecPtr = GetRedoRecPtr(); /* * We assume page LSN is first data on *every* page that can be passed to * XLogInsert, whether it has the standard page layout or not. Since we're * only holding a share-lock on the page, we must take the buffer header * lock when we look at the LSN. */ lsn = BufferGetLSNAtomic(buffer); if (lsn <= RedoRecPtr) /* wal log hint bit */ The RedoRecPtr is determined at 1. and doesn't change between 4) and 8). The LSN for 4) has to be *past* the RedoRecPtr from 1). Therefore we don't do another FPW. Changing this is *completely* infeasible. In a lot of workloads it'd cause a *massive* explosion of WAL volume. Like quadratically. You'll need to find another way to generate a nonce. In the non-hint bit case you'll automatically have a higher LSN in 7/8 though. So you won't need to do anything about getting a higher nonce. For the hint bit case in 8 you could consider just using any LSN generated after 4 (preferrably already flushed to disk) - but that seems somewhat ugly from a debuggability POV :/. Alternatively you could just create tiny WAL record to get a new LSN, but that'll sometimes trigger new WAL flushes when the pages are dirtied. Greetings, Andres Freund
On Wed, Feb 3, 2021 at 03:29:13PM -0800, Andres Freund wrote: > > Is the above case valid, and would it cause two full page writes to WAL? > > More specifically, wouldn't it cause every write of the page to the file > > system to use a new LSN? > > No. 8) won't happen. Look e.g. at XLogSaveBufferForHint(): > > /* > * Update RedoRecPtr so that we can make the right decision > */ > RedoRecPtr = GetRedoRecPtr(); > > /* > * We assume page LSN is first data on *every* page that can be passed to > * XLogInsert, whether it has the standard page layout or not. Since we're > * only holding a share-lock on the page, we must take the buffer header > * lock when we look at the LSN. > */ > lsn = BufferGetLSNAtomic(buffer); > > if (lsn <= RedoRecPtr) > /* wal log hint bit */ > > The RedoRecPtr is determined at 1. and doesn't change between 4) and > 8). The LSN for 4) has to be *past* the RedoRecPtr from 1). Therefore we > don't do another FPW. OK, so, what is happening is that it knows the page LSN is after the start of the current checkpoint (the redo point), so it knows not do to a full page write again? Smart, and makes sense. > Changing this is *completely* infeasible. In a lot of workloads it'd > cause a *massive* explosion of WAL volume. Like quadratically. You'll > need to find another way to generate a nonce. Do we often do multiple writes to the file system of the same page during a single checkpoint, particularly only-hint-bit-modified pages? I didn't think so. > In the non-hint bit case you'll automatically have a higher LSN in 7/8 > though. So you won't need to do anything about getting a higher nonce. Yes, I was counting on that. :-) > For the hint bit case in 8 you could consider just using any LSN generated > after 4 (preferably already flushed to disk) - but that seems somewhat > ugly from a debuggability POV :/. Alternatively you could just create > tiny WAL record to get a new LSN, but that'll sometimes trigger new WAL > flushes when the pages are dirtied. Yes, that would make sense. I do need the first full page write during a checkpoint to be sure I don't have torn pages that have some part of the page encrypted with one LSN and a second part with a different LSN. You are right that I don't need a second full page write during the same checkpoint because a torn page would just restore the first full page write and throw away the second LSN and hint bit changes, which is fine. I hadn't gotten to ask about that until I found if the previous assumptions were true, which they were not. Is the logical approach here to modify XLogSaveBufferForHint() so if a page write is not needed, to create a dummy WAL record that just increments the WAL location and updates the page LSN? (Is there a small WAL record I should reuse?) I can try to add a hint-bit-page-write page counter, but that might overflow, and then we will need a way to change the LSN anyway. I am researching this so I can give a clear report on the impact of adding this feature. I will update the wiki once we figure this out. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Hi, On 2021-02-03 19:21:25 -0500, Bruce Momjian wrote: > On Wed, Feb 3, 2021 at 03:29:13PM -0800, Andres Freund wrote: > > Changing this is *completely* infeasible. In a lot of workloads it'd > > cause a *massive* explosion of WAL volume. Like quadratically. You'll > > need to find another way to generate a nonce. > > Do we often do multiple writes to the file system of the same page > during a single checkpoint, particularly only-hint-bit-modified pages? > I didn't think so. It can easily happen. Consider ringbuffer using scans (like vacuum, seqscan) - they'll force the buffer out to disk soon after it's been dirtied. And often will read the same page again a short bit later. Or just any workload that's a bit bigger than shared buffers (but data is in the OS cache). Subsequent scans will often have new hint bits to set. > Is the logical approach here to modify XLogSaveBufferForHint() so if a > page write is not needed, to create a dummy WAL record that just > increments the WAL location and updates the page LSN? > (Is there a small WAL record I should reuse?) I think an explicit record type would be better. Or a hint record without an associated FPW. > I can try to add a hint-bit-page-write page counter, but that might > overflow, and then we will need a way to change the LSN anyway. That's just a question of width... Greetings, Andres Freund
On Wed, Feb 3, 2021 at 05:00:19PM -0800, Andres Freund wrote: > Hi, > > On 2021-02-03 19:21:25 -0500, Bruce Momjian wrote: > > On Wed, Feb 3, 2021 at 03:29:13PM -0800, Andres Freund wrote: > > > Changing this is *completely* infeasible. In a lot of workloads it'd > > > cause a *massive* explosion of WAL volume. Like quadratically. You'll > > > need to find another way to generate a nonce. > > > > Do we often do multiple writes to the file system of the same page > > during a single checkpoint, particularly only-hint-bit-modified pages? > > I didn't think so. > > It can easily happen. Consider ringbuffer using scans (like vacuum, > seqscan) - they'll force the buffer out to disk soon after it's been > dirtied. And often will read the same page again a short bit later. Or > just any workload that's a bit bigger than shared buffers (but data is > in the OS cache). Subsequent scans will often have new hint bits to > set. Oh, good point. > > Is the logical approach here to modify XLogSaveBufferForHint() so if a > > page write is not needed, to create a dummy WAL record that just > > increments the WAL location and updates the page LSN? > > (Is there a small WAL record I should reuse?) > > I think an explicit record type would be better. Or a hint record > without an associated FPW. OK. > > I can try to add a hint-bit-page-write page counter, but that might > > overflow, and then we will need a way to change the LSN anyway. > > That's just a question of width... Yeah, the hint bit counter is just delaying the inevitasble, plus it changes the page format, which I am trying to avoid. Also, I need this dummy record only if the page is marked clean, meaning a write to the file system already happened in the current checkpoint --- should not be to bad. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On Wed, Feb 3, 2021 at 08:07:16PM -0500, Bruce Momjian wrote: > > > I can try to add a hint-bit-page-write page counter, but that might > > > overflow, and then we will need a way to change the LSN anyway. > > > > That's just a question of width... > > Yeah, the hint bit counter is just delaying the inevitable, plus it > changes the page format, which I am trying to avoid. Also, I need this > dummy record only if the page is marked clean, meaning a write > to the file system already happened in the current checkpoint --- should > not be to bad. Here is a proof-of-concept patch to do this. Thanks for your help. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Вложения
On Wed, Feb 3, 2021 at 08:07:16PM -0500, Bruce Momjian wrote: > > > I can try to add a hint-bit-page-write page counter, but that might > > > overflow, and then we will need a way to change the LSN anyway. > > > > That's just a question of width... > > Yeah, the hint bit counter is just delaying the inevitable, plus it > changes the page format, which I am trying to avoid. Also, I need this > dummy record only if the page is marked clean, meaning a write > to the file system already happened in the current checkpoint --- should > not be to bad. In looking your comments on Sawada-san's POC patch for buffer encryption: https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de I see that he put a similar function call in exactly the same place I did, but you pointed out that he was inserting into WAL while holding a buffer lock. I restructured my patch to not make that same mistake, and modified it for non-permanent buffers --- attached. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee