Обсуждение: Multiple full page writes in a single checkpoint?

Поиск
Список
Период
Сортировка

Multiple full page writes in a single checkpoint?

От
Bruce Momjian
Дата:
Cluster file encryption plans to use the LSN and page number as the
nonce for heap/index pages.  I am looking into the use of a unique nonce
during hint bit changes.  (You need to use a new nonce for re-encrypting
a page that changes.)

log_hint_bits already gives us a unique nonce for the first hint bit
change on a page during a checkpoint, but we only encrypt on page write
to the file system, so I am researching if log_hint_bits will already
generate a unique LSN for every page write to the file system, even if
there are multiple hint-bit-caused page writes to the file system during
a single checkpoint.  (We already know this works for multiple
checkpoints.)

Our docs on full_page_writes states:

    When this parameter is on, the
    <productname>PostgreSQL</productname> server writes the entire
    content of each disk page to WAL during the first modification
    of that page after a checkpoint.

and wal_log_hints states:

    When this parameter is <literal>on</literal>, the
    <productname>PostgreSQL</productname> server writes the entire
    content of each disk page to WAL during the first modification of
    that page after a checkpoint, even for non-critical modifications
    of so-called hint bits.

However, imagine these steps:

1.  checkpoint starts
2.  page is modified by row or hint bit change
3.  page gets a new LSN and is marked as dirty
4.  page image is flushed to WAL
5.  pages is written to disk and marked as clean
6.  page is modified by data or hint bit change
7.  pages gets a new LSN and is marked as dirty
8.  page image is flushed to WAL
9.  checkpoint completes
10. pages is written to disk and marked as clean

Is the above case valid, and would it cause two full page writes to WAL?
More specifically, wouldn't it cause every write of the page to the file
system to use a new LSN?

If so, this means wal_log_hints is sufficient to guarantee a new nonce
for every page image, even for multiple hint bit changes and page writes
during a single checkpoint, and there is then no need for a hit bit
counter on the page --- the unique LSN does that for us.  I know
log_hint_bits was designed to fix torn pages, but it seems to also do
exactly what cluster file encryption needs.

If the above is all true, should we update the docs, READMEs, or C
comments about this?  I think the cluster file encryption patch would at
least need to document that we need to keep this behavior, because I
don't think log_hint_bits needs to behave this way for checksum
purposes because of the way full page writes are processed during crash
recovery.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Multiple full page writes in a single checkpoint?

От
Andres Freund
Дата:
Hi,

On 2021-02-03 18:05:56 -0500, Bruce Momjian wrote:
> log_hint_bits already gives us a unique nonce for the first hint bit
> change on a page during a checkpoint, but we only encrypt on page write
> to the file system, so I am researching if log_hint_bits will already
> generate a unique LSN for every page write to the file system, even if
> there are multiple hint-bit-caused page writes to the file system during
> a single checkpoint.  (We already know this works for multiple
> checkpoints.)

No, it won't:

> However, imagine these steps:
> 
> 1.  checkpoint starts
> 2.  page is modified by row or hint bit change
> 3.  page gets a new LSN and is marked as dirty
> 4.  page image is flushed to WAL
> 5.  pages is written to disk and marked as clean
> 6.  page is modified by data or hint bit change
> 7.  pages gets a new LSN and is marked as dirty
> 8.  page image is flushed to WAL
> 9.  checkpoint completes
> 10. pages is written to disk and marked as clean
> 
> Is the above case valid, and would it cause two full page writes to WAL?
> More specifically, wouldn't it cause every write of the page to the file
> system to use a new LSN?

No. 8) won't happen.  Look e.g. at XLogSaveBufferForHint():

    /*
     * Update RedoRecPtr so that we can make the right decision
     */
    RedoRecPtr = GetRedoRecPtr();

    /*
     * We assume page LSN is first data on *every* page that can be passed to
     * XLogInsert, whether it has the standard page layout or not. Since we're
     * only holding a share-lock on the page, we must take the buffer header
     * lock when we look at the LSN.
     */
    lsn = BufferGetLSNAtomic(buffer);

    if (lsn <= RedoRecPtr)
        /* wal log hint bit */

The RedoRecPtr is determined at 1. and doesn't change between 4) and
8). The LSN for 4) has to be *past* the RedoRecPtr from 1). Therefore we
don't do another FPW.


Changing this is *completely* infeasible. In a lot of workloads it'd
cause a *massive* explosion of WAL volume. Like quadratically. You'll
need to find another way to generate a nonce.

In the non-hint bit case you'll automatically have a higher LSN in 7/8
though. So you won't need to do anything about getting a higher nonce.

For the hint bit case in 8 you could consider just using any LSN generated
after 4 (preferrably already flushed to disk) - but that seems somewhat
ugly from a debuggability POV :/. Alternatively you could just create
tiny WAL record to get a new LSN, but that'll sometimes trigger new WAL
flushes when the pages are dirtied.

Greetings,

Andres Freund



Re: Multiple full page writes in a single checkpoint?

От
Bruce Momjian
Дата:
On Wed, Feb  3, 2021 at 03:29:13PM -0800, Andres Freund wrote:
> > Is the above case valid, and would it cause two full page writes to WAL?
> > More specifically, wouldn't it cause every write of the page to the file
> > system to use a new LSN?
> 
> No. 8) won't happen.  Look e.g. at XLogSaveBufferForHint():
> 
>     /*
>      * Update RedoRecPtr so that we can make the right decision
>      */
>     RedoRecPtr = GetRedoRecPtr();
> 
>     /*
>      * We assume page LSN is first data on *every* page that can be passed to
>      * XLogInsert, whether it has the standard page layout or not. Since we're
>      * only holding a share-lock on the page, we must take the buffer header
>      * lock when we look at the LSN.
>      */
>     lsn = BufferGetLSNAtomic(buffer);
> 
>     if (lsn <= RedoRecPtr)
>         /* wal log hint bit */
> 
> The RedoRecPtr is determined at 1. and doesn't change between 4) and
> 8). The LSN for 4) has to be *past* the RedoRecPtr from 1). Therefore we
> don't do another FPW.

OK, so, what is happening is that it knows the page LSN is after the
start of the current checkpoint (the redo point), so it knows not do to
a full page write again?  Smart, and makes sense.

> Changing this is *completely* infeasible. In a lot of workloads it'd
> cause a *massive* explosion of WAL volume. Like quadratically. You'll
> need to find another way to generate a nonce.

Do we often do multiple writes to the file system of the same page
during a single checkpoint, particularly only-hint-bit-modified pages?
I didn't think so.

> In the non-hint bit case you'll automatically have a higher LSN in 7/8
> though. So you won't need to do anything about getting a higher nonce.

Yes, I was counting on that.  :-)

> For the hint bit case in 8 you could consider just using any LSN generated
> after 4 (preferably already flushed to disk) - but that seems somewhat
> ugly from a debuggability POV :/. Alternatively you could just create
> tiny WAL record to get a new LSN, but that'll sometimes trigger new WAL
> flushes when the pages are dirtied.

Yes, that would make sense.  I do need the first full page write during
a checkpoint to be sure I don't have torn pages that have some part of
the page encrypted with one LSN and a second part with a different LSN. 
You are right that I don't need a second full page write during the same
checkpoint because a torn page would just restore the first full page
write and throw away the second LSN and hint bit changes, which is fine.

I hadn't gotten to ask about that until I found if the previous
assumptions were true, which they were not.

Is the logical approach here to modify XLogSaveBufferForHint() so if a
page write is not needed, to create a dummy WAL record that just
increments the WAL location and updates the page LSN?  (Is there a small
WAL record I should reuse?)  I can try to add a hint-bit-page-write page
counter, but that might overflow, and then we will need a way to change
the LSN anyway.

I am researching this so I can give a clear report on the impact of
adding this feature.  I will update the wiki once we figure this out.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Multiple full page writes in a single checkpoint?

От
Andres Freund
Дата:
Hi,

On 2021-02-03 19:21:25 -0500, Bruce Momjian wrote:
> On Wed, Feb  3, 2021 at 03:29:13PM -0800, Andres Freund wrote:
> > Changing this is *completely* infeasible. In a lot of workloads it'd
> > cause a *massive* explosion of WAL volume. Like quadratically. You'll
> > need to find another way to generate a nonce.
>
> Do we often do multiple writes to the file system of the same page
> during a single checkpoint, particularly only-hint-bit-modified pages?
> I didn't think so.

It can easily happen. Consider ringbuffer using scans (like vacuum,
seqscan) - they'll force the buffer out to disk soon after it's been
dirtied. And often will read the same page again a short bit later. Or
just any workload that's a bit bigger than shared buffers (but data is
in the OS cache).  Subsequent scans will often have new hint bits to
set.


> Is the logical approach here to modify XLogSaveBufferForHint() so if a
> page write is not needed, to create a dummy WAL record that just
> increments the WAL location and updates the page LSN?
> (Is there a small WAL record I should reuse?)

I think an explicit record type would be better. Or a hint record
without an associated FPW.


> I can try to add a hint-bit-page-write page counter, but that might
> overflow, and then we will need a way to change the LSN anyway.

That's just a question of width...

Greetings,

Andres Freund



Re: Multiple full page writes in a single checkpoint?

От
Bruce Momjian
Дата:
On Wed, Feb  3, 2021 at 05:00:19PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2021-02-03 19:21:25 -0500, Bruce Momjian wrote:
> > On Wed, Feb  3, 2021 at 03:29:13PM -0800, Andres Freund wrote:
> > > Changing this is *completely* infeasible. In a lot of workloads it'd
> > > cause a *massive* explosion of WAL volume. Like quadratically. You'll
> > > need to find another way to generate a nonce.
> >
> > Do we often do multiple writes to the file system of the same page
> > during a single checkpoint, particularly only-hint-bit-modified pages?
> > I didn't think so.
> 
> It can easily happen. Consider ringbuffer using scans (like vacuum,
> seqscan) - they'll force the buffer out to disk soon after it's been
> dirtied. And often will read the same page again a short bit later. Or
> just any workload that's a bit bigger than shared buffers (but data is
> in the OS cache).  Subsequent scans will often have new hint bits to
> set.

Oh, good point.

> > Is the logical approach here to modify XLogSaveBufferForHint() so if a
> > page write is not needed, to create a dummy WAL record that just
> > increments the WAL location and updates the page LSN?
> > (Is there a small WAL record I should reuse?)
> 
> I think an explicit record type would be better. Or a hint record
> without an associated FPW.

OK.

> > I can try to add a hint-bit-page-write page counter, but that might
> > overflow, and then we will need a way to change the LSN anyway.
> 
> That's just a question of width...

Yeah, the hint bit counter is just delaying the inevitasble, plus it
changes the page format, which I am trying to avoid.  Also, I need this
dummy record only if the page is marked clean, meaning a write
to the file system already happened in the current checkpoint --- should
not be to bad.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Multiple full page writes in a single checkpoint?

От
Bruce Momjian
Дата:
On Wed, Feb  3, 2021 at 08:07:16PM -0500, Bruce Momjian wrote:
> > > I can try to add a hint-bit-page-write page counter, but that might
> > > overflow, and then we will need a way to change the LSN anyway.
> > 
> > That's just a question of width...
> 
> Yeah, the hint bit counter is just delaying the inevitable, plus it
> changes the page format, which I am trying to avoid.  Also, I need this
> dummy record only if the page is marked clean, meaning a write
> to the file system already happened in the current checkpoint --- should
> not be to bad.

Here is a proof-of-concept patch to do this.  Thanks for your help.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Вложения

Re: Multiple full page writes in a single checkpoint?

От
Bruce Momjian
Дата:
On Wed, Feb  3, 2021 at 08:07:16PM -0500, Bruce Momjian wrote:
> > > I can try to add a hint-bit-page-write page counter, but that might
> > > overflow, and then we will need a way to change the LSN anyway.
> > 
> > That's just a question of width...
> 
> Yeah, the hint bit counter is just delaying the inevitable, plus it
> changes the page format, which I am trying to avoid.  Also, I need this
> dummy record only if the page is marked clean, meaning a write
> to the file system already happened in the current checkpoint --- should
> not be to bad.

In looking your comments on Sawada-san's POC patch for buffer
encryption:

    https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de

I see that he put a similar function call in exactly the same place I
did, but you pointed out that he was inserting into WAL while holding a
buffer lock.

I restructured my patch to not make that same mistake, and modified it
for non-permanent buffers --- attached.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Вложения