Обсуждение: Pre-allocating WAL files

Поиск

Список

Период

Сортировка

Pre-allocating WAL files

От

Andres Freund

Дата:

25 декабря 2020 г., 20:09:53

Hi,

When running write heavy transactional workloads I've many times
observed that one needs to run the benchmarks for quite a while till
they get to their steady state performance. The most significant reason
for that is that initially WAL files will not get recycled, but need to
be freshly initialized. That's 16MB of writes that need to synchronously
finish before a small write transaction can even start to be written
out...

I think there's two useful things we could do:

1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
   segment_size - 1) / segment_size WAL segments exist from the current
   point in the WAL. Perhaps with the number of bytes defaulting to
   min_wal_size if not explicitly specified?

2) Have checkpointer (we want walwriter to run with low latency to flush
   out async commits etc) occasionally check if WAL files need to be
   pre-allocated.

   Checkpointer already tracks the amount of WAL that's expected to be
   generated till the end of the checkpoint, so it seems like it's a
   pretty good candidate to do so.

   To keep checkpointer pre-allocating when idle we could signal it
   whenever a record has crossed a segment boundary.


With a plain pgbench run I see a 2.5x reduction in throughput in the
periods where we initialize WAL files.

Greetings,

Andres Freund

Re: Pre-allocating WAL files

От

vignesh C

Дата:

05 июля 2021 г., 16:52:27

On Mon, Jun 7, 2021 at 8:48 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/25/20, 12:09 PM, "Andres Freund" <andres@anarazel.de> wrote:
> > When running write heavy transactional workloads I've many times
> > observed that one needs to run the benchmarks for quite a while till
> > they get to their steady state performance. The most significant reason
> > for that is that initially WAL files will not get recycled, but need to
> > be freshly initialized. That's 16MB of writes that need to synchronously
> > finish before a small write transaction can even start to be written
> > out...
> >
> > I think there's two useful things we could do:
> >
> > 1) Add pg_wal_preallocate(uint64 bytes) that ensures (bytes +
> >    segment_size - 1) / segment_size WAL segments exist from the current
> >    point in the WAL. Perhaps with the number of bytes defaulting to
> >    min_wal_size if not explicitly specified?
> >
> > 2) Have checkpointer (we want walwriter to run with low latency to flush
> >    out async commits etc) occasionally check if WAL files need to be
> >    pre-allocated.
> >
> >    Checkpointer already tracks the amount of WAL that's expected to be
> >    generated till the end of the checkpoint, so it seems like it's a
> >    pretty good candidate to do so.
> >
> >    To keep checkpointer pre-allocating when idle we could signal it
> >    whenever a record has crossed a segment boundary.
> >
> >
> > With a plain pgbench run I see a 2.5x reduction in throughput in the
> > periods where we initialize WAL files.
>
> I've been exploring this independently a bit and noticed this message.
> Attached is a proof-of-concept patch for a separate "WAL allocator"
> process that maintains a pool of WAL-segment-sized files that can be
> claimed whenever a new segment file is needed.  An early version of
> this patch attempted to spread the I/O like non-immediate checkpoints
> do, but I couldn't point to any real benefit from doing so, and it
> complicated things quite a bit.
>
> I like the idea of trying to bake this into an existing process such
> as the checkpointer.  I'll admit that creating a new process just for
> WAL pre-allocation feels a bit heavy-handed, but it was a nice way to
> keep this stuff modularized.  I can look into moving this
> functionality into the checkpointer process if this is something that
> folks are interested in.

Thanks for posting the patch, the patch no more applies on Head:
Applying: wal segment pre-allocation
error: patch failed: src/backend/access/transam/xlog.c:3283
error: src/backend/access/transam/xlog.c: patch does not apply

Can you rebase the patch and post, it might help if someone is picking
it up for review.

Regards,
Vignesh

Re: Pre-allocating WAL files

От

Maxim Orlov

Дата:

06 октября 2021 г., 12:19:23

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, failed
Spec compliant: not tested
Documentation: not tested

Hi!

We've looked through the code and everything looks good except few minor things:
1). Using dedicated bg worker seems not optimal, it introduces a lot of redundant code for little single action.
We'd join initial proposal of Andres to implement it as an extra fuction of checkpointer.
2). In our view, it is better shift #define PREALLOCSEGDIR outside the function body.
3). It is better to have at least small comments on functions GetNumPreallocatedWalSegs, SetNumPreallocatedWalSegs,

We've also tested performance difference between master branch and this patch and noticed no significant difference in
performance.
We used pgbench with some sort of "standard" settings:
$ pgbench -c50 -s50 -T200 -P1 -r postgres
...and with...
$ pgbench -c100 -s50 -T200 -P1 -r postgres

When looking at every second output of pgbench we saw regular spikes of latency (aroud 5-10 times increase) and this
patternwas similar with and without patch. Overall average latency stat for 200 sec of pgbench also looks pretty much
thesame with and without patch. Could you provide your testing setup to see the effect, please.

The check-world was successfull.

Overall patch looks good, but in our view it's better to have experimental support of the performance improvements to
becommited.

---
Best regards,
Maxim Orlov, Pavel Borisov.

The new status of this patch is: Waiting on Author

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

06 октября 2021 г., 16:31:42

On 10/6/21, 5:20 AM, "Maxim Orlov" <m.orlov@postgrespro.ru> wrote:
> We've looked through the code and everything looks good except few minor things:
> 1). Using dedicated bg worker seems not optimal, it introduces a lot of redundant code for little single action.
>     We'd join initial proposal of Andres to implement it as an extra fuction of checkpointer.

Thanks for taking a look!

I have been thinking about the right place to put this logic.  My
first thought is that it sounds like something that ought to go in the
WAL writer process, but as Andres noted upthread, that is undesirable
because it'll add latency for asynchronous commits.  The checkpointer
process seems to be another candidate, but I'm not totally sure how
it'll fit in.  My patch works by maintaining a small pool of pre-
allocated segments that is quickly replenished whenever one is used.
If the checkpointer is spending most of its time checkpointing, this
small pool could remain empty for long periods of time.  To keep pre-
allocating WAL while we're checkpointing, perhaps we could add another
function like CheckpointWriteDelay() that is called periodically.
There still might be several operations in CheckPointGuts() that take
a while and leave the segment pool empty, but maybe that's okay for
now.

Nathan

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

08 октября 2021 г., 20:48:58

On 10/6/21, 9:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> I have been thinking about the right place to put this logic.  My
> first thought is that it sounds like something that ought to go in the
> WAL writer process, but as Andres noted upthread, that is undesirable
> because it'll add latency for asynchronous commits.  The checkpointer
> process seems to be another candidate, but I'm not totally sure how
> it'll fit in.  My patch works by maintaining a small pool of pre-
> allocated segments that is quickly replenished whenever one is used.
> If the checkpointer is spending most of its time checkpointing, this
> small pool could remain empty for long periods of time.  To keep pre-
> allocating WAL while we're checkpointing, perhaps we could add another
> function like CheckpointWriteDelay() that is called periodically.
> There still might be several operations in CheckPointGuts() that take
> a while and leave the segment pool empty, but maybe that's okay for
> now.

Here is a first attempt at adding the pre-allocation logic to the
checkpointer.  I went ahead and just used CheckpointWriteDelay() for
pre-allocating during checkpoints.  I've done a few pgbench runs, and
it seems to work pretty well.  Initialization is around 15% faster,
and I'm seeing about a 5% increase in TPS with a simple-update
workload with wal_recycle turned off.  Of course, these improvements
go away once segments can be recycled.

Nathan

Вложения

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

10 ноября 2021 г., 18:59:43

On 10/8/21, 1:55 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> Here is a first attempt at adding the pre-allocation logic to the
> checkpointer.  I went ahead and just used CheckpointWriteDelay() for
> pre-allocating during checkpoints.  I've done a few pgbench runs, and
> it seems to work pretty well.  Initialization is around 15% faster,
> and I'm seeing about a 5% increase in TPS with a simple-update
> workload with wal_recycle turned off.  Of course, these improvements
> go away once segments can be recycled.

Here is a rebased version of this patch set.  I'm getting the sense
that there isn't a whole lot of interest in this feature, so I'll
likely withdraw it if it goes too much longer without traction.

Nathan

Вложения

Re: Pre-allocating WAL files

От

Bharath Rupireddy

Дата:

07 декабря 2021 г., 08:27:56

On Thu, Nov 11, 2021 at 12:29 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 10/8/21, 1:55 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> > Here is a first attempt at adding the pre-allocation logic to the
> > checkpointer.  I went ahead and just used CheckpointWriteDelay() for
> > pre-allocating during checkpoints.  I've done a few pgbench runs, and
> > it seems to work pretty well.  Initialization is around 15% faster,
> > and I'm seeing about a 5% increase in TPS with a simple-update
> > workload with wal_recycle turned off.  Of course, these improvements
> > go away once segments can be recycled.
>
> Here is a rebased version of this patch set.  I'm getting the sense
> that there isn't a whole lot of interest in this feature, so I'll
> likely withdraw it if it goes too much longer without traction.

As I mentioned in the other thread at [1], let's continue the discussion here.

Why can't the walwriter pre-allocate some of the WAL segments instead
of a new background process? Of course, it might delay the main
functionality of the walwriter i.e. flush and sync the WAL files, but
having checkpointer do the pre-allocation makes it do another extra
task. Here the amount of walwriter work vs checkpointer work, I'm not
sure which one does more work compared to the other.

Another idea could be to let walwrtier or checkpointer pre-allocate
the WAL files whichever seems free as-of-the-moment when the WAL
segment pre-allocation request comes. We can go further to let the
user choose which process i.e. checkpointer or walwrtier do the
pre-allocation with a GUC maybe?

[1] - https://www.postgresql.org/message-id/CALj2ACVqYJX9JugooRC1chb2sHqv-C9mYEBE1kxwn%2BTn9vY42A%40mail.gmail.com

Regards,
Bharath Rupireddy.

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

07 декабря 2021 г., 17:34:46

On 12/7/21, 12:29 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> Why can't the walwriter pre-allocate some of the WAL segments instead
> of a new background process? Of course, it might delay the main
> functionality of the walwriter i.e. flush and sync the WAL files, but
> having checkpointer do the pre-allocation makes it do another extra
> task. Here the amount of walwriter work vs checkpointer work, I'm not
> sure which one does more work compared to the other.

The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits.  If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.

> Another idea could be to let walwrtier or checkpointer pre-allocate
> the WAL files whichever seems free as-of-the-moment when the WAL
> segment pre-allocation request comes. We can go further to let the
> user choose which process i.e. checkpointer or walwrtier do the
> pre-allocation with a GUC maybe?

My latest patch set [0] adds WAL pre-allocation to the checkpointer.
In that patch set, WAL pre-allocation is done both outside of
checkpoints as well as during checkpoints (via
CheckPointWriteDelay()).  

Nathan

[0] https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

07 декабря 2021 г., 21:01:12

On 12/7/21, 9:35 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/7/21, 12:29 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
>> Why can't the walwriter pre-allocate some of the WAL segments instead
>> of a new background process? Of course, it might delay the main
>> functionality of the walwriter i.e. flush and sync the WAL files, but
>> having checkpointer do the pre-allocation makes it do another extra
>> task. Here the amount of walwriter work vs checkpointer work, I'm not
>> sure which one does more work compared to the other.
>
> The argument against adding it to the WAL writer is that we want it to
> run with low latency to flush asynchronous commits.  If we added WAL
> pre-allocation to the WAL writer, there could periodically be large
> delays.

To your point on trying to avoid giving the checkpointer extra tasks
(basically what we are talking about on the other thread [0]), WAL
pre-allocation might not be of much concern because it will generally
be a small, fixed (and configurable) amount of work, and it can be
performed concurrently with the checkpoint.  Plus, WAL pre-allocation
should ordinarily be phased out as WAL segments become eligible for
recycling.  IMO it's not comparable to tasks like
CheckPointSnapBuild() that can delay checkpointing for a long time.

Nathan

[0] https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

Re: Pre-allocating WAL files

От

Pavel Borisov

Дата:

30 декабря 2021 г., 11:45:42

> pre-allocating during checkpoints. I've done a few pgbench runs, and
> it seems to work pretty well. Initialization is around 15% faster,
> and I'm seeing about a 5% increase in TPS with a simple-update
> workload with wal_recycle turned off. Of course, these improvements
> go away once segments can be recycled.

I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.

I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain makes me think again before adding this feature. I did tests myself a couple of months ago and got similar results.

Really don't know whether is it worth the effort.

Wish you and all hackers happy New Year!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Pre-allocating WAL files

От

Maxim Orlov

Дата:

30 декабря 2021 г., 11:51:10

I did check the patch too and found it to be ok. Check and check-world are passed.

Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the pre-allocation.

On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

> pre-allocating during checkpoints. I've done a few pgbench runs, and
> it seems to work pretty well. Initialization is around 15% faster,
> and I'm seeing about a 5% increase in TPS with a simple-update
> workload with wal_recycle turned off. Of course, these improvements
> go away once segments can be recycled.

I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain makes me think again before adding this feature. I did tests myself a couple of months ago and got similar results.
Really don't know whether is it worth the effort.

Wish you and all hackers happy New Year!
--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

---

Best regards,

Maxim Orlov.

Re: Pre-allocating WAL files

От

"Bossart, Nathan"

Дата:

05 января 2022 г., 22:08:49

On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
> I did check the patch too and found it to be ok. Check and check-world are passed. 
> Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the pre-allocation.
>
> On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>> I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems. 
>> I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain
makesme think again before adding this feature. I did tests myself a couple of months ago and got similar results.
 
>> Really don't know whether is it worth the effort.

Thank you both for your review.

Nathan

Re: Pre-allocating WAL files

От

Bharath Rupireddy

Дата:

15 января 2022 г., 08:06:05

On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
> > I did check the patch too and found it to be ok. Check and check-world are passed.
> > Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the
pre-allocation.
> >
> > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> >> I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
> >> I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance gain
makesme think again before adding this feature. I did tests myself a couple of months ago and got similar results.

> >> Really don't know whether is it worth the effort.
>
> Thank you both for your review.

It may have been discussed earlier, let me ask this here - IIUC the
whole point of pre-allocating WAL files is that creating new WAL files
of wal_segment_size requires us to write zero-filled empty pages to
the disk which is costly. With the advent of
fallocate/posix_fallocate, isn't file allocation going to be much
faster on platforms where fallocate is supported? IIRC, the
"Asynchronous and "direct" IO support for PostgreSQL." has a way to
use fallocate. If at all, we move ahead and use fallocate, then the
whole point of pre-allocating WAL files becomes unnecessary?

Having said above, the idea of pre-allocating WAL files is still
relevant, given the portability of fallocate/posix_fallocate.

Regards,
Bharath Rupireddy.

Re: Pre-allocating WAL files

От

Bharath Rupireddy

Дата:

15 января 2022 г., 08:08:26

On Sat, Jan 15, 2022 at 1:36 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >
> > On 12/30/21, 3:52 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
> > > I did check the patch too and found it to be ok. Check and check-world are passed.
> > > Overall idea seems to be good in my opinion, but I'm not sure where is the optimal place to put the
pre-allocation.
> > >
> > > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> > >> I've checked the patch v7. It applies cleanly, code is good, check-world tests passed without problems.
> > >> I think it's ok to use checkpointer for this and the overall patch can be committed. But the seen performance
gainmakes me think again before adding this feature. I did tests myself a couple of months ago and got similar
results.
> > >> Really don't know whether is it worth the effort.
> >
> > Thank you both for your review.
>
> It may have been discussed earlier, let me ask this here - IIUC the
> whole point of pre-allocating WAL files is that creating new WAL files
> of wal_segment_size requires us to write zero-filled empty pages to
> the disk which is costly. With the advent of
> fallocate/posix_fallocate, isn't file allocation going to be much
> faster on platforms where fallocate is supported? IIRC, the
> "Asynchronous and "direct" IO support for PostgreSQL." has a way to
> use fallocate. If at all, we move ahead and use fallocate, then the
> whole point of pre-allocating WAL files becomes unnecessary?
>
> Having said above, the idea of pre-allocating WAL files is still
> relevant, given the portability of fallocate/posix_fallocate.

Adding one more point: do we have any numbers like how much total time
WAL files allocation usually takes, maybe under a high-write load
server?

Regards,
Bharath Rupireddy.

Re: Pre-allocating WAL files

От

Justin Pryzby

Дата:

01 марта 2022 г., 14:40:44

On Thu, Dec 30, 2021 at 02:51:10PM +0300, Maxim Orlov wrote:
> I did check the patch too and found it to be ok. Check and check-world are
> passed.

FYI: this is currently failing in cfbot on linux.

https://cirrus-ci.com/task/4934371210690560
https://api.cirrus-ci.com/v1/artifact/task/4934371210690560/log/src/test/regress/regression.diffs

 DROP TABLESPACE regress_tblspace_renamed;
+ERROR:  tablespace "regress_tblspace_renamed" is not empty

-- 
Justin

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

01 марта 2022 г., 15:07:04

On Tue, Mar 01, 2022 at 08:40:44AM -0600, Justin Pryzby wrote:
> FYI: this is currently failing in cfbot on linux.
> 
> https://cirrus-ci.com/task/4934371210690560
> https://api.cirrus-ci.com/v1/artifact/task/4934371210690560/log/src/test/regress/regression.diffs
> 
>  DROP TABLESPACE regress_tblspace_renamed;
> +ERROR:  tablespace "regress_tblspace_renamed" is not empty

I believe this is due to an existing bug.  This patch set seems to
influence the timing to make it more likely.  I'm tracking the fix here:

    https://commitfest.postgresql.org/37/3544/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

17 марта 2022 г., 23:12:12

It seems unlikely that this will be committed for v15, so I've adjusted the
commitfest entry to v16 and moved it to the next commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

08 апреля 2022 г., 20:30:03

On Thu, Mar 17, 2022 at 04:12:12PM -0700, Nathan Bossart wrote:
> It seems unlikely that this will be committed for v15, so I've adjusted the
> commitfest entry to v16 and moved it to the next commitfest.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

14 июля 2022 г., 18:34:07

On Fri, Apr 08, 2022 at 01:30:03PM -0700, Nathan Bossart wrote:
> On Thu, Mar 17, 2022 at 04:12:12PM -0700, Nathan Bossart wrote:
>> It seems unlikely that this will be committed for v15, so I've adjusted the
>> commitfest entry to v16 and moved it to the next commitfest.
> 
> rebased

It's now been over a year since I first posted a patch in this thread, and
I still sense very little interest for this feature.  I intend to mark it
as Withdrawn at the end of this commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

25 июля 2022 г., 16:24:17

On Thu, Jul 14, 2022 at 11:34:07AM -0700, Nathan Bossart wrote:
> It's now been over a year since I first posted a patch in this thread, and
> I still sense very little interest for this feature.  I intend to mark it
> as Withdrawn at the end of this commitfest.

Done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Pre-allocating WAL files

От

Andy Fan

Дата:

21 января 2025 г., 06:31:27

Hi Nathan,

Come from [0] and thanks for working on this. Here are some design
review/question after my first going through the patch.

1. walwriter vs checkpointer?  I prefer to walwriter for now because.. 

a. checkpointer is hard to do it in a timely manner either because
checkpoint itself may take a long time or the checkpoint_timeout
is much bigger than commit_delay. but walwriter could do this timely.
I think this is an important consideration for this feature. 

b. We want walwriter to run with low latency to flush out async
commits. This is true, but preallocating a wal doesn't increase the
latency too much. After all, even user uses the aysnc commit, the walfile
allocating is done by walwriter already in our current implementation.

2. How many xlogfile should be preallocated by checkpointer/walwriter
once. In your patch it is controled by wal-preallocate-max-size. How
about just preallocate *the next one* xlogfile for the simplification
purpose?

3. Why is the purpose of preallocated_segments directory? what in my
mind is we just prellocate the normal filename so that XLogWrite could
open it directly. This is same as what wal_recycle does and we can reuse
the same strategy to clean up them if they are not needed anymore.

So the poc in my mind for this feature is:
- keep track the latested reallocated (by wal_recycle or preallocated)
logfile in XLogCtl.
- logwriter check current wal insert_pos and prellocate the *next one*
walfile if it doesn't preallocated yet.
- we need to handle race condition carefully between wal_recycle, user
backend and preallocation. 

[0] https://www.postgresql.org/message-id/Z46BwCNAEjLyW85Z%40nathan

-- 
Best Regards
Andy Fan

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

21 января 2025 г., 18:52:51

On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
> Come from [0] and thanks for working on this. Here are some design
> review/question after my first going through the patch.

Thanks for taking a look.

> 1. walwriter vs checkpointer?  I prefer to walwriter for now because.. 
> 
> a. checkpointer is hard to do it in a timely manner either because
> checkpoint itself may take a long time or the checkpoint_timeout
> is much bigger than commit_delay. but walwriter could do this timely.
> I think this is an important consideration for this feature. 
> 
> b. We want walwriter to run with low latency to flush out async
> commits. This is true, but preallocating a wal doesn't increase the
> latency too much. After all, even user uses the aysnc commit, the walfile
> allocating is done by walwriter already in our current implementation.

I attempted to deal with this by having pre-allocation requests set the
checkpointer's latch and performing the pre-allocation within the
checkpointer's main loop and during write delays.  However, checkpointing
does a number of other things that could just as easily delay
pre-allocation, so it's probably worth considering the WAL writer.

> 2. How many xlogfile should be preallocated by checkpointer/walwriter
> once. In your patch it is controled by wal-preallocate-max-size. How
> about just preallocate *the next one* xlogfile for the simplification
> purpose?

We could probably start with something like that.  IIRC it was difficult to
create workloads where you'd need more than 1-2 at a time, provided
whatever is pre-allocating refills the pool quickly.

> 3. Why is the purpose of preallocated_segments directory? what in my
> mind is we just prellocate the normal filename so that XLogWrite could
> open it directly. This is same as what wal_recycle does and we can reuse
> the same strategy to clean up them if they are not needed anymore.

The purpose is to limit the use of pre-allocated segments to only
situations where WAL recycling is not sufficient.  Basically, if writing a
record would require a new segment to be created, we can quickly pull a
pre-allocated one instead of creating it ourselves.  Besides simplifying
matters, this prevents a lot of unnecessary pre-allocation, since many
workloads will almost never need anything beyond the recycled segments.

-- 
nathan

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

21 января 2025 г., 19:13:14

On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
> On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
>> 3. Why is the purpose of preallocated_segments directory? what in my
>> mind is we just prellocate the normal filename so that XLogWrite could
>> open it directly. This is same as what wal_recycle does and we can reuse
>> the same strategy to clean up them if they are not needed anymore.
> 
> The purpose is to limit the use of pre-allocated segments to only
> situations where WAL recycling is not sufficient.  Basically, if writing a
> record would require a new segment to be created, we can quickly pull a
> pre-allocated one instead of creating it ourselves.  Besides simplifying
> matters, this prevents a lot of unnecessary pre-allocation, since many
> workloads will almost never need anything beyond the recycled segments.

That being said, it would be nice to avoid the fsync() overhead to move a
pre-allocated WAL into place.  My first instinct is that would be
substantially more complicated and may not actually improve matters all
that much, but I agree that it's worth exploring.

-- 
nathan

Re: Pre-allocating WAL files

От

Andres Freund

Дата:

21 января 2025 г., 19:23:06

Hi,

On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
> >> 3. Why is the purpose of preallocated_segments directory? what in my
> >> mind is we just prellocate the normal filename so that XLogWrite could
> >> open it directly. This is same as what wal_recycle does and we can reuse
> >> the same strategy to clean up them if they are not needed anymore.
> > 
> > The purpose is to limit the use of pre-allocated segments to only
> > situations where WAL recycling is not sufficient.  Basically, if writing a
> > record would require a new segment to be created, we can quickly pull a
> > pre-allocated one instead of creating it ourselves.  Besides simplifying
> > matters, this prevents a lot of unnecessary pre-allocation, since many
> > workloads will almost never need anything beyond the recycled segments.

I don't really understand that argument - we should be able to predict rather
precisely whether we need to preallocate or not. We have the recent WAL "fill
rate", we know the end of the WAL and we can easily track how far ahead of the
current point we have allocated.  Why preallocate when we have a large reserve
of "future" segments? Why preallocate in a separate directory when we have no
future segments?

> That being said, it would be nice to avoid the fsync() overhead to move a
> pre-allocated WAL into place.  My first instinct is that would be
> substantially more complicated and may not actually improve matters all
> that much, but I agree that it's worth exploring.

FWIW, I've seen the fsyncs around recycling being a rather substantial
bottleneck. To the point of the main benefit of larger segments being the
reduction in number of fsyncs at the end of a checkpoint.  I think we should
be able to make the fsyncs a lot more efficient by batching them, first rename
a bunch of files, then fsync them and the directory. The current pattern
bascially requires a separate filesystem jouranl flush for each WAL segment.

Greetings,

Andres Freund

Re: Pre-allocating WAL files

От

Andy Fan

Дата:

22 января 2025 г., 04:14:22

Andres Freund <andres@anarazel.de> writes:

Hi,

> FWIW, I've seen the fsyncs around recycling being a rather substantial
> bottleneck. To the point of the main benefit of larger segments being the
> reduction in number of fsyncs at the end of a checkpoint.  I think we should
> be able to make the fsyncs a lot more efficient by batching them, first rename
> a bunch of files, then fsync them and the directory. The current pattern
> bascially requires a separate filesystem jouranl flush for each WAL segment.

For education purpose, how to fsync files in batch? 'man fsync' tells me
user can only fsync one file each time.

int fsync(int fd);

The fsync manual seems not saying fsync on a directory would fsync all
the files under that directory.

-- 
Best Regards
Andy Fan

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

22 января 2025 г., 18:50:59

On Tue, Jan 21, 2025 at 11:23:06AM -0500, Andres Freund wrote:
> On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
>> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
>> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
>> >> 3. Why is the purpose of preallocated_segments directory? what in my
>> >> mind is we just prellocate the normal filename so that XLogWrite could
>> >> open it directly. This is same as what wal_recycle does and we can reuse
>> >> the same strategy to clean up them if they are not needed anymore.
>> > 
>> > The purpose is to limit the use of pre-allocated segments to only
>> > situations where WAL recycling is not sufficient.  Basically, if writing a
>> > record would require a new segment to be created, we can quickly pull a
>> > pre-allocated one instead of creating it ourselves.  Besides simplifying
>> > matters, this prevents a lot of unnecessary pre-allocation, since many
>> > workloads will almost never need anything beyond the recycled segments.
> 
> I don't really understand that argument - we should be able to predict rather
> precisely whether we need to preallocate or not. We have the recent WAL "fill
> rate", we know the end of the WAL and we can easily track how far ahead of the
> current point we have allocated.  Why preallocate when we have a large reserve
> of "future" segments? Why preallocate in a separate directory when we have no
> future segments?

If we can indeed reliably predict whether we need pre-allocation, then
sure, let's just create future segments directly in pg_wal.  I'm not sure
we could reliably predict whether WAL will be recycled in time, so we might
pre-allocate a bit more than necessary, but that's not too terrible.  My
"pooling" approach was intended to keep the pre-allocation to a minimum
(IME you really only need a couple at any given time) and to avoid the
guesswork involved in predicting.

>> That being said, it would be nice to avoid the fsync() overhead to move a
>> pre-allocated WAL into place.  My first instinct is that would be
>> substantially more complicated and may not actually improve matters all
>> that much, but I agree that it's worth exploring.
> 
> FWIW, I've seen the fsyncs around recycling being a rather substantial
> bottleneck. To the point of the main benefit of larger segments being the
> reduction in number of fsyncs at the end of a checkpoint.  I think we should
> be able to make the fsyncs a lot more efficient by batching them, first rename
> a bunch of files, then fsync them and the directory. The current pattern
> bascially requires a separate filesystem jouranl flush for each WAL segment.

+1, these kinds of fsync() patterns should be fixed.

-- 
nathan

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

22 января 2025 г., 18:56:33

On Wed, Jan 22, 2025 at 01:14:22AM +0000, Andy Fan wrote:
> Andres Freund <andres@anarazel.de> writes:
>> FWIW, I've seen the fsyncs around recycling being a rather substantial
>> bottleneck. To the point of the main benefit of larger segments being the
>> reduction in number of fsyncs at the end of a checkpoint.  I think we should
>> be able to make the fsyncs a lot more efficient by batching them, first rename
>> a bunch of files, then fsync them and the directory. The current pattern
>> bascially requires a separate filesystem jouranl flush for each WAL segment.
> 
> For education purpose, how to fsync files in batch? 'man fsync' tells me
> user can only fsync one file each time.
> 
> int fsync(int fd);
> 
> The fsync manual seems not saying fsync on a directory would fsync all
> the files under that directory.

I think Andres means that we should wait until the end of recycling to
fsync() the directory so that we aren't flushing it for every single
recycled segment.  This sort of batching approach could also work well with
pre_sync_fname(), so that by the time we actually call fsync() on the
files, it has very little to do.

-- 
nathan

Re: Pre-allocating WAL files

От

Andres Freund

Дата:

22 января 2025 г., 19:21:03

Hi,

On 2025-01-22 01:14:22 +0000, Andy Fan wrote:
> Andres Freund <andres@anarazel.de> writes:
> > FWIW, I've seen the fsyncs around recycling being a rather substantial
> > bottleneck. To the point of the main benefit of larger segments being the
> > reduction in number of fsyncs at the end of a checkpoint.  I think we should
> > be able to make the fsyncs a lot more efficient by batching them, first rename
> > a bunch of files, then fsync them and the directory. The current pattern
> > bascially requires a separate filesystem jouranl flush for each WAL segment.
> 
> For education purpose, how to fsync files in batch? 'man fsync' tells me
> user can only fsync one file each time.
> 
> int fsync(int fd);
> 
> The fsync manual seems not saying fsync on a directory would fsync all
> the files under that directory.

Right now we do something that essentially boils down to

// recycle WAL file oldname1
fsync(open(oldname1));
rename(oldname1, newname1);
fsync(open(newname1));
fsync(open("pg_wal"));

// recycle WAL file oldname2
fsync(open(oldname2));
rename(oldname2, newname2);
fsync(open(newname2));
fsync(open("pg_wal"));
...

// recycle WAL file oldnameN
fsync(open(oldnameN));
rename(oldnameN, newnameN);
fsync(open(newnameN));
fsync(open("pg_wal"));
...

Most of the time the fsync on oldname won't have to do any IO (because
presumably we'll have flushed it before), but the rename obviously requires a
metadata update and thus the fsync will have work to do (whether it's the
fsync on newname or the directory will differ between filesystems).

This pattern basically forces the filesystem to do at least one journal flush
for every single WAL segment. I.e. each recycled segment will have at least
the latency of a single synchronous durable write IO.

But if we instead change it to something like this:

fsync(open(oldname1));
fsync(open(oldname2));
..
fsync(open(oldnameN));

rename(oldname1, newname1);
rename(oldname2, newname2);
..
rename(oldnameN, newnameN);

fsync(open(newname1));
fsync(open(newname2));
..
fsync(open(newnameN));

fsync(open("pg_wal"));

Most filesystems will be able to combine many of the the journal flushes
triggered by the renames into much bigger journal flushes. That means the
overall time for recycling is much lower than the earlier one, since there are
far fewer synchronous durable writes.

Here's a rough approximation of the effect using shell commands:

andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old;
done;sync;for i in $(seq 1 $N); do mv test.$i.old test.$i.new; sync; done;)

real    0m7.218s
user    0m0.431s
sys    0m4.892s

andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old;
done;sync;for i in $(seq 1 $N); do mv test.$i.old test.$i.new; done; sync)

real    0m2.678s
user    0m0.282s
sys    0m2.402s

The only difference between the two versions is that the latter can combine
the journal flushes, due to the sync happening outside of the loop.

This is a somewhat poor approximation of how this would work in postgres,
including likely exaggerating the gain (I think sync flushes the filesystem
superblock too), but it does show the principle.

Greetings,

Andres Freund

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

22 января 2025 г., 20:43:20

On Wed, Jan 22, 2025 at 11:21:03AM -0500, Andres Freund wrote:
> fsync(open(oldname1));
> fsync(open(oldname2));
> ..
> fsync(open(oldnameN));
> 
> rename(oldname1, newname1);
> rename(oldname2, newname2);
> ..
> rename(oldnameN, newnameN);
> 
> fsync(open(newname1));
> fsync(open(newname2));
> ..
> fsync(open(newnameN));
> 
> fsync(open("pg_wal"));

What is the purpose of syncing the file before the rename?

-- 
nathan

Re: Pre-allocating WAL files

От

Andres Freund

Дата:

22 января 2025 г., 21:00:08

Hi,

On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote:
> On Wed, Jan 22, 2025 at 11:21:03AM -0500, Andres Freund wrote:
> > fsync(open(oldname1));
> > fsync(open(oldname2));
> > ..
> > fsync(open(oldnameN));
> >
> > rename(oldname1, newname1);
> > rename(oldname2, newname2);
> > ..
> > rename(oldnameN, newnameN);
> >
> > fsync(open(newname1));
> > fsync(open(newname2));
> > ..
> > fsync(open(newnameN));
> >
> > fsync(open("pg_wal"));
>
> What is the purpose of syncing the file before the rename?

It's from the general durable_rename() code. The reason it's there that it's
required for "atomically replace a file" use case. Imagine the following:

create_and_fill("somefile.tmp");
rename("somefile.tmp", "somefile");
fsync("somefile.tmp");
fsync(".");

If you crash (OS/HW level) in the wrong moment (between rename() taking effect
in-memory and the fsyncs), you might end up with "somefile" pointing to the
*new* file, because the rename took affect, but the new file's content not
having reached disk yet. I.e. "somefile" will be empty.  Whether that's
possible depends on filesystem semantics (e.g. on ext4 it's possible with
data=writeback, I think it's always possible on xfs).

In contrast to that, if you fsync("somefile.tmp") before the rename, a crash
between rename() and the later fsyncs will have "somefile" either pointing to
the *old and valid contents* or the *new and valid contents*, without a chance
for an empty file.

However, for the case of WAL recycling, we shouldn't need fsync() before the
rename, because we ought to already have done so when creating
(c.f. XLogFileInitInternal() or when recycling it last time.

I suspect the theoretically superfluous fsync() won't have a meaningful
performance impact most of the time though, because

a) There shouldn't be any dirty data for the file, obviously we need to have
   flushed the WAL past the recycled segment

b) Except for the first to-be-recycled segment, we just fsynced after the last
   rename, so there won't be any filesystem journal data that needs to be
   flushed

I'm not entirely sure about a) though - depending on mount options it's
possible that the fsync() will flush file modification times when using
wal_sync_method=fdatasync.  But even if that's possibly reachable, I doubt
it'll be common, due to a checkpoint having to complete between the WAL flush
and recycling. Could be worth experimenting with.

Greetings,

Andres

Re: Pre-allocating WAL files

От

Nathan Bossart

Дата:

23 января 2025 г., 21:21:12

On Wed, Jan 22, 2025 at 01:00:08PM -0500, Andres Freund wrote:
> On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote:
>> What is the purpose of syncing the file before the rename?
> 
> It's from the general durable_rename() code. The reason it's there that it's
> required for "atomically replace a file" use case. Imagine the following:
> 
> create_and_fill("somefile.tmp");
> rename("somefile.tmp", "somefile");
> fsync("somefile.tmp");
> fsync(".");
> 
> If you crash (OS/HW level) in the wrong moment (between rename() taking effect
> in-memory and the fsyncs), you might end up with "somefile" pointing to the
> *new* file, because the rename took affect, but the new file's content not
> having reached disk yet. I.e. "somefile" will be empty.  Whether that's
> possible depends on filesystem semantics (e.g. on ext4 it's possible with
> data=writeback, I think it's always possible on xfs).
> 
> In contrast to that, if you fsync("somefile.tmp") before the rename, a crash
> between rename() and the later fsyncs will have "somefile" either pointing to
> the *old and valid contents* or the *new and valid contents*, without a chance
> for an empty file.

Got it, thanks for explaining.  If the contents are sync'd before the
rename(), do we still need to fsync() it again afterwards, too?  I'd expect
that to ordinarily not have much to do, but perhaps I'm forgetting about
some metadata that isn't covered by the fsync() on the directory.

> However, for the case of WAL recycling, we shouldn't need fsync() before the
> rename, because we ought to already have done so when creating
> (c.f. XLogFileInitInternal() or when recycling it last time.

Makes sense.

> I suspect the theoretically superfluous fsync() won't have a meaningful
> performance impact most of the time though, because
> 
> a) There shouldn't be any dirty data for the file, obviously we need to have
>    flushed the WAL past the recycled segment
> 
> b) Except for the first to-be-recycled segment, we just fsynced after the last
>    rename, so there won't be any filesystem journal data that needs to be
>    flushed
> 
> I'm not entirely sure about a) though - depending on mount options it's
> possible that the fsync() will flush file modification times when using
> wal_sync_method=fdatasync.  But even if that's possibly reachable, I doubt
> it'll be common, due to a checkpoint having to complete between the WAL flush
> and recycling. Could be worth experimenting with.

Yeah, I'm not too worried about the performance impact of some superfluous
fsync() calls, either, but I wasn't sure I properly understood the fsync()
pattern in durable_rename() (and figured it'd be nice to get it documented
in the archives).

-- 
nathan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Pre-allocating WAL files

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения