Обсуждение: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

Поиск
Список
Период
Сортировка

AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

От
Andres Freund
Дата:
Hi,

I just created a primary with wal_segment_size=512. Then tried to create a
standby via pg_basebackup. The pg_basebackup appeared to just hang, for quite
a while, but did eventually complete. Over a minute for an empty cluster, when
using -c fast.

In this case I had used wal_sync_method=open_datasync - it's often faster and
if we want to scale WAL writes more we'll have to use it more widely (you
can't have multiple fdatasyncs in progress and reason about which one affects
what, but you can have multiple DSYNC writes in progress at the same time).

After a bit of confused staring and debugging I figured out that the problem
is that the RequestXLogSwitch() within the code for starting a basebackup was
triggering writing back the WAL in individual 8kB writes via
GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
writes is durable - on this drive each take about 1ms.


Normally we write out WAL in bigger chunks - but as it turns out, we don't
have any logic for doing larger writes when AdvanceXLInsertBuffers() is called
from within GetXLogBuffer(). We just try to make enough space so that one
buffer can be replaced.


The time for a single SELECT pg_switch_wal() on this system, when using
open_datasync and a 512MB segment, are:

wal_buffers   time for pg_switch_xlog()
16            64s
100           53s
400           13s
600           1.3s

That's pretty bad.  We don't really benefit from more buffering here, it just
avoids flushing in tiny increments. With a smaller wal_buffers, the large
record by pg_switch_xlog() needs to replace buffers it inself inserted, and
does so one-by-one. If we never re-encounter an buffer we inserted ourself
earlier due to a larger wal_buffers, the problem isn't present.

This can bit with smaller segments too, it doesn't require large ones ones.


The reason this doesn't constantly become an issue is that walwriter normally
tries to write out WAL, and if it does, the AdvanceXLInsertBuffers() called in
backends doesn't need to (walsender also calls AdvanceXLInsertBuffers(), but
it won't ever write out data).

In my case, walsender is actually trying to do something - but it never gets
WALWriteLock. The semaphore does get set after AdvanceXLInsertBuffers()
releases WALWriteLock, but on this system walwriter never succeeds taking the
lwlock before AdvanceXLInsertBuffers() succeeds re-acquiring it.


I think it might be a lucky accident that the problem was visible this
blatantly in this one case - I suspect that this behaviour is encountered
during normal operation in the wild, but much harder to pinpoint, because it
doesn't happen "exclusively".

E.g. I see a lot higher throughput bulk-loading data with larger wal_buffers
when using open_datasync, but basically no difference when using
fdatasync. And there are a lot of wal_buffers_full writes.


To fix this, I suspect we need to make
GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
specific case, we even know for sure that we are going to fill a lot more
buffers, so no heuristic would be needed. In other cases however we need some
heuristic to know how much to write out.

Given how *extremely* aggressive we are about flushing out nearly all pending
WAL in XLogFlush(), I'm not sure there's much point in not also being somewhat
aggressive in GetXLogBuffer()->AdvanceXLInsertBuffer().


Greetings,

Andres Freund



Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

От
Heikki Linnakangas
Дата:
On 10/11/2023 05:54, Andres Freund wrote:
> In this case I had used wal_sync_method=open_datasync - it's often faster and
> if we want to scale WAL writes more we'll have to use it more widely (you
> can't have multiple fdatasyncs in progress and reason about which one affects
> what, but you can have multiple DSYNC writes in progress at the same time).

Not sure I understand that. If you issue an fdatasync, it will sync all 
writes that were complete before the fdatasync started. Right? If you 
have multiple fdatasyncs in progress, that's true for each fdatasync. Or 
is there a bottleneck in the kernel with multiple in-progress fdatasyncs 
or something?

> After a bit of confused staring and debugging I figured out that the problem
> is that the RequestXLogSwitch() within the code for starting a basebackup was
> triggering writing back the WAL in individual 8kB writes via
> GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
> writes is durable - on this drive each take about 1ms.

I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() 
is relatively fast. But with open_datasync, it's not.

> To fix this, I suspect we need to make
> GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
> specific case, we even know for sure that we are going to fill a lot more
> buffers, so no heuristic would be needed. In other cases however we need some
> heuristic to know how much to write out.

+1. Maybe use the same logic as in XLogFlush().

I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It 
would be nice to pass a hard minimum XLogRecPtr that it must write up 
to, but still allow it to write more than that if it's convenient.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

От
Andres Freund
Дата:
Hi,

On 2023-11-10 17:16:35 +0200, Heikki Linnakangas wrote:
> On 10/11/2023 05:54, Andres Freund wrote:
> > In this case I had used wal_sync_method=open_datasync - it's often faster and
> > if we want to scale WAL writes more we'll have to use it more widely (you
> > can't have multiple fdatasyncs in progress and reason about which one affects
> > what, but you can have multiple DSYNC writes in progress at the same time).
>
> Not sure I understand that. If you issue an fdatasync, it will sync all
> writes that were complete before the fdatasync started. Right? If you have
> multiple fdatasyncs in progress, that's true for each fdatasync. Or is there
> a bottleneck in the kernel with multiple in-progress fdatasyncs or
> something?

Many filesystems only allow a single fdatasync to really be in progress at the
same time, they eventually acquire an inode specific lock.  More problematic
cases include things like a write followed by an fdatasync, followed by a
write of the same block in another process/thread - there's very little
guarantee about which contents of that block are now durable.

But more importantly, using fdatasync doesn't scale because it effectively has
to flush the entire write cache one the device - which often contains plenty
other dirty data. Whereas O_DSYNC can use FUA writes, which just makes the
individual WAL writes write through the cache, while leaving the rest of the
cache "unaffected".


> > After a bit of confused staring and debugging I figured out that the problem
> > is that the RequestXLogSwitch() within the code for starting a basebackup was
> > triggering writing back the WAL in individual 8kB writes via
> > GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
> > writes is durable - on this drive each take about 1ms.
>
> I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is
> relatively fast. But with open_datasync, it's not.

I'm not sure that was an explicit assumption rather than just how it worked
out.


> > To fix this, I suspect we need to make
> > GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
> > specific case, we even know for sure that we are going to fill a lot more
> > buffers, so no heuristic would be needed. In other cases however we need some
> > heuristic to know how much to write out.
>
> +1. Maybe use the same logic as in XLogFlush().

I've actually been wondering about moving all the handling of WALWriteLock to
XLogWrite() and/or a new function called from all the places calling
XLogWrite().

I suspect we can't quite use the same logic in AdvanceXLInsertBuffer() as we
do in XLogFlush() - we e.g. don't ever want to trigger flushing out a
partially filled page, for example. Or really ever want to unnecessarily wait
for a WAL insertion to complete when we don't have to.


> I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It
> would be nice to pass a hard minimum XLogRecPtr that it must write up to,
> but still allow it to write more than that if it's convenient.

Yes, I've also thought that.  In the AIOified WAL code I ended up tracking
"minimum" and "optimal" write/flush locations.

Greetings,

Andres Freund