Re: WIP: Vectored writeback

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: WIP: Vectored writeback
Дата
Msg-id CA+hUKG+Yh1PNdj+i=V6PAei3VNAMRhhY4CfRV0tdA6q3_cPrww@mail.gmail.com
обсуждение исходный текст
Ответ на [MASSMAIL]WIP: Vectored writeback  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Here is a new straw-man patch set.  I'd already shown the basic
techniques for vectored writes from the buffer pool (FlushBuffers(),
note the "s"), but that was sort of kludged into place while I was
hacking on the lower level bits and pieces, and now I'm building
layers further up.  The main idea is: you can clean buffers with a
"WriteStream", and here are a bunch of example users to show that
working.

A WriteStream is approximately the opposite of a ReadStream (committed
in v17).  You push pinned dirty buffers into it (well they don't have
to be dirty, and it's OK if someone else cleans the buffer
concurrently, the point is that you recently dirtied them).  It
combines buffers up to io_combine_limit, but defers writing as long as
possible within some limits to avoid flushing the WAL, and tries to
coordinate with the WAL writer.  The WAL writer interaction is a very
tricky problem, and that aspect is only a toy for now, but it's at
least partially successful (see problems at end).

The CHECKPOINT code uses the WriteStream API directly.  It creates one
stream per tablespace, so that the existing load balancing algorithm
doesn't defeat the I/O combining algorithm.  Unsurprisingly, it looks
like this:

postgres=# checkpoint;

...
pwritev(18,...,2,0x1499e000) = 131072 (0x20000)
pwrite(18,...,131072,0x149be000) = 131072 (0x20000)
pwrite(18,...,131072,0x149de000) = 131072 (0x20000)
...

Sometimes you'll see it signalling the WAL writer.  It builds up a
queue of writes that it doesn't want to perform yet, in the hope of
getting a free ride WRT WAL.

Other places can benefit from a more centrally placed write stream,
indirectly.  Our BAS_BULKWRITE and BAS_VACUUM buffer access strategies
already perform "write-behind".  That's a name I borrowed from some OS
stuff, where the kernel has clues that bulk data (for example a big
file copy) will not likely be needed again soon so you want to get it
out of the way soon before it trashes your whole buffer pool (AKA
"scan resistance"), but you want to defer just a little bit to perform
I/O combining.  That applies directly here, but we have the additional
concern of delaying the referenced WAL write in the hope that someone
else will do it for us.

In this experiment, I am trying to give that pre-existing behaviour an
explicit name (better names welcome!), and optimise it.  If you're
dirtying buffers in a ring, you'll soon crash into your own tail and
have to write it out, and it is very often sequential blocks due to
the scan-like nature of many bulk I/O jobs, so I/O combining is very
effective.  The main problem is that you'll often have to flush WAL
first, which this patch set tries to address to some extent.  In the
strategy write-behind case you don't really need a LSN reordering
queue, just a plain FIFO queue would do, but hopefully that doesn't
cost much.  (Cf CHECKPOINT, which sorts blocks by buffer tag, but
expects LSNs in random order, so it does seem to need reordering.)

With this patch set, instead of calling ReleaseBuffer() after you've
dirtied a buffer in one of those bulk writing code paths, you can use
StrategyReleaseBuffer(), and the strategy will fire it into the stream
to get I/O combining and LSN reordering; it'll be unpinned later, and
certainly before you get the same buffer back for a new block.  So
those write-behind user patches are very short, they just do
s/ReleaseBuffer/StrategyReleaseBuffer/ plus minor details.
Unsurprisingly, it looks like this:

postgres=# copy t from program 'seq -f %1.0f 1 10000000';

...
pwrite(44,...,131072,0x2f986000) = 131072 (0x20000) <-- streaming write-behind!
pwrite(44,...,131072,0x2f966000) = 131072 (0x20000)
pwrite(44,...,131072,0x2f946000) = 131072 (0x20000)
...

postgres=# vacuum t;

...
pwrite(35,...,131072,0x3fb3e000) = 131072 (0x20000) <-- streaming write-behind!
preadv(35,...,122880}],2,0x3fb7a000) = 131072 (0x20000) <-- from Melanie's patch
pwritev(35,...,2,0x3fb5e000) = 131072 (0x20000)
pread(35,...,131072,0x3fb9a000) = 131072 (0x20000)
...

Next I considered how to get INSERT, UPDATE, DELETE to participate.
The problem is that they use BAS_BULKREAD, even though they might
dirty buffers.  In master, BAS_BULKREAD doesn't do write-behind,
instead it uses the "reject" mechanism: as soon as it smells a dirty
buffer, it escapes the ring and abandons all hope of scan resistance.
As buffer/README says in parentheses:

  Bulk writes work similarly to VACUUM.  Currently this applies only to
  COPY IN and CREATE TABLE AS SELECT.  (Might it be interesting to make
  seqscan UPDATE and DELETE use the bulkwrite strategy?)  For bulk writes
  we use a ring size of 16MB (but not more than 1/8th of shared_buffers).

Hmm... what I'm now thinking is that the distinction might be a little
bogus.  Who knows how much scanned data will finish up being dirtied?
I wonder if it would make more sense to abandon
BAS_BULKREAD/BAS_BULKWRITE, and instead make an adaptive strategy.  A
ring that starts small, and grows/shrinks in response to dirty data
(instead of "rejecting").  That would have at least superficial
similarities to the ARC algorithm, the "adaptive" bit that controls
ring size (it's interested in recency vs frequency, but here it's more
like "we're willing to waste more memory on dirty data, because we
need to keep it around longer, to avoid flushing the WAL, but not
longer than that" which may be a different dimension to value cached
data on, I'm not sure).

Of course there must be some workloads/machines where using a strategy
(instead of BAS_BULKREAD when it degrades to BAS_NORMAL behaviour)
will be slower because of WAL flushes, but that's not a fair fight:
the flip side of that coin is that you've trashed the buffer pool,
which is an external cost paid by someone else, ie it's anti-social,
BufferAccessStrategy's very raison d'être.

Anyway, in the meantime, I hacked heapam.c to use BAS_BULKWRITE just
to see how it would work with this patch set.  (This causes an
assertion to fail in some test, something about the stats for
different IO contexts that was upset by IOCONTEXT_BULKWRITE, which I
didn't bother to debug, it's only a demo hack.)  Unsurprisingly, it
looks like this:

postgres=# delete from t;

...
pread(25,...,131072,0xc89e000) = 131072 (0x20000)   <-- already committed
pread(25,...,131072,0xc8be000) = 131072 (0x20000)       read-stream behaviour
kill(75954,SIGURG)             = 0 (0x0)            <-- hey WAL writer!
pread(25,...,131072,0xc8de000) = 131072 (0x20000)
pread(25,...,131072,0xc8fe000) = 131072 (0x20000)
...
pwrite(25,...,131072,0x15200000) = 131072 (0x20000) <-- write-behind!
pwrite(25,...,131072,0x151e0000) = 131072 (0x20000)
pwrite(25,...,131072,0x151c0000) = 131072 (0x20000)
...

UPDATE and INSERT conceptually work too, but they suffer from other
stupid page-at-a-time problems around extension so it's more fun to
look at DELETE first.

The whole write-behind notion, and the realisation that we already
have it and should just make it into a "real thing", jumped out at me
while studying Melanie's VACUUM pass 1 and VACUUM pass 2 patches for
adding read streams.  Rebased and attached here.  That required
hacking on the new tidstore.c stuff a bit.  (We failed to get the
VACUUM read stream bits into v17, but the study of that led to the
default BAS_VACUUM size being cranked up to reflect modern realities,
and generally sent me down this rabbit hole for a while.)

Some problems:
* If you wake the WAL writer more often, throughput might actually go
down on high latency storage due to serialisation of WAL flushes.  So
far I have declined to try to write an adaptive algorithm to figure
out whether to do it, and where the threshold should be.  I suspect it
might involve measuring time and hill-climbing...  One option is to
abandon this part (ie just do no worse than master at WAL flushing),
or at least consider that a separate project.
* This might hold too many pins!  It does respect the limit mechanism,
but that can let you have a lot of pins (it's a bit TOCTOU-racy too,
we might need something smarter).  One idea would be to release pins
while writes are in the LSN queue, and reacquire them with
ReadRecentBuffer() as required, since we don't really care if someone
else evicts them in the meantime.
* It seems a bit weird that we *also* have the WritebackContext
machinery.  I could probably subsume that whole mechanism into
write_stream.c.  If you squint, sync_file_range() is a sort of dual of
POSIX_FADV_WILLNEED, which the read counterpart looks after.
* I would like to merge Heikki's bulk write stuff into this somehow,
not yet thought about it much.

The patches are POC-quality only and certainly have bugs/missed edge
cases/etc.  Thoughts, better ideas, references to writing about this
problem space, etc, welcome.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: Re: New committers: Melanie Plageman, Richard Guo
Следующее
От: jian he
Дата:
Сообщение: add tab-complete for memory, serialize option and other minor issues.