Re: WIP: Vectored writeback
От | Thomas Munro |
---|---|
Тема | Re: WIP: Vectored writeback |
Дата | |
Msg-id | CA+hUKG+Yh1PNdj+i=V6PAei3VNAMRhhY4CfRV0tdA6q3_cPrww@mail.gmail.com обсуждение исходный текст |
Ответ на | [MASSMAIL]WIP: Vectored writeback (Thomas Munro <thomas.munro@gmail.com>) |
Список | pgsql-hackers |
Here is a new straw-man patch set. I'd already shown the basic techniques for vectored writes from the buffer pool (FlushBuffers(), note the "s"), but that was sort of kludged into place while I was hacking on the lower level bits and pieces, and now I'm building layers further up. The main idea is: you can clean buffers with a "WriteStream", and here are a bunch of example users to show that working. A WriteStream is approximately the opposite of a ReadStream (committed in v17). You push pinned dirty buffers into it (well they don't have to be dirty, and it's OK if someone else cleans the buffer concurrently, the point is that you recently dirtied them). It combines buffers up to io_combine_limit, but defers writing as long as possible within some limits to avoid flushing the WAL, and tries to coordinate with the WAL writer. The WAL writer interaction is a very tricky problem, and that aspect is only a toy for now, but it's at least partially successful (see problems at end). The CHECKPOINT code uses the WriteStream API directly. It creates one stream per tablespace, so that the existing load balancing algorithm doesn't defeat the I/O combining algorithm. Unsurprisingly, it looks like this: postgres=# checkpoint; ... pwritev(18,...,2,0x1499e000) = 131072 (0x20000) pwrite(18,...,131072,0x149be000) = 131072 (0x20000) pwrite(18,...,131072,0x149de000) = 131072 (0x20000) ... Sometimes you'll see it signalling the WAL writer. It builds up a queue of writes that it doesn't want to perform yet, in the hope of getting a free ride WRT WAL. Other places can benefit from a more centrally placed write stream, indirectly. Our BAS_BULKWRITE and BAS_VACUUM buffer access strategies already perform "write-behind". That's a name I borrowed from some OS stuff, where the kernel has clues that bulk data (for example a big file copy) will not likely be needed again soon so you want to get it out of the way soon before it trashes your whole buffer pool (AKA "scan resistance"), but you want to defer just a little bit to perform I/O combining. That applies directly here, but we have the additional concern of delaying the referenced WAL write in the hope that someone else will do it for us. In this experiment, I am trying to give that pre-existing behaviour an explicit name (better names welcome!), and optimise it. If you're dirtying buffers in a ring, you'll soon crash into your own tail and have to write it out, and it is very often sequential blocks due to the scan-like nature of many bulk I/O jobs, so I/O combining is very effective. The main problem is that you'll often have to flush WAL first, which this patch set tries to address to some extent. In the strategy write-behind case you don't really need a LSN reordering queue, just a plain FIFO queue would do, but hopefully that doesn't cost much. (Cf CHECKPOINT, which sorts blocks by buffer tag, but expects LSNs in random order, so it does seem to need reordering.) With this patch set, instead of calling ReleaseBuffer() after you've dirtied a buffer in one of those bulk writing code paths, you can use StrategyReleaseBuffer(), and the strategy will fire it into the stream to get I/O combining and LSN reordering; it'll be unpinned later, and certainly before you get the same buffer back for a new block. So those write-behind user patches are very short, they just do s/ReleaseBuffer/StrategyReleaseBuffer/ plus minor details. Unsurprisingly, it looks like this: postgres=# copy t from program 'seq -f %1.0f 1 10000000'; ... pwrite(44,...,131072,0x2f986000) = 131072 (0x20000) <-- streaming write-behind! pwrite(44,...,131072,0x2f966000) = 131072 (0x20000) pwrite(44,...,131072,0x2f946000) = 131072 (0x20000) ... postgres=# vacuum t; ... pwrite(35,...,131072,0x3fb3e000) = 131072 (0x20000) <-- streaming write-behind! preadv(35,...,122880}],2,0x3fb7a000) = 131072 (0x20000) <-- from Melanie's patch pwritev(35,...,2,0x3fb5e000) = 131072 (0x20000) pread(35,...,131072,0x3fb9a000) = 131072 (0x20000) ... Next I considered how to get INSERT, UPDATE, DELETE to participate. The problem is that they use BAS_BULKREAD, even though they might dirty buffers. In master, BAS_BULKREAD doesn't do write-behind, instead it uses the "reject" mechanism: as soon as it smells a dirty buffer, it escapes the ring and abandons all hope of scan resistance. As buffer/README says in parentheses: Bulk writes work similarly to VACUUM. Currently this applies only to COPY IN and CREATE TABLE AS SELECT. (Might it be interesting to make seqscan UPDATE and DELETE use the bulkwrite strategy?) For bulk writes we use a ring size of 16MB (but not more than 1/8th of shared_buffers). Hmm... what I'm now thinking is that the distinction might be a little bogus. Who knows how much scanned data will finish up being dirtied? I wonder if it would make more sense to abandon BAS_BULKREAD/BAS_BULKWRITE, and instead make an adaptive strategy. A ring that starts small, and grows/shrinks in response to dirty data (instead of "rejecting"). That would have at least superficial similarities to the ARC algorithm, the "adaptive" bit that controls ring size (it's interested in recency vs frequency, but here it's more like "we're willing to waste more memory on dirty data, because we need to keep it around longer, to avoid flushing the WAL, but not longer than that" which may be a different dimension to value cached data on, I'm not sure). Of course there must be some workloads/machines where using a strategy (instead of BAS_BULKREAD when it degrades to BAS_NORMAL behaviour) will be slower because of WAL flushes, but that's not a fair fight: the flip side of that coin is that you've trashed the buffer pool, which is an external cost paid by someone else, ie it's anti-social, BufferAccessStrategy's very raison d'être. Anyway, in the meantime, I hacked heapam.c to use BAS_BULKWRITE just to see how it would work with this patch set. (This causes an assertion to fail in some test, something about the stats for different IO contexts that was upset by IOCONTEXT_BULKWRITE, which I didn't bother to debug, it's only a demo hack.) Unsurprisingly, it looks like this: postgres=# delete from t; ... pread(25,...,131072,0xc89e000) = 131072 (0x20000) <-- already committed pread(25,...,131072,0xc8be000) = 131072 (0x20000) read-stream behaviour kill(75954,SIGURG) = 0 (0x0) <-- hey WAL writer! pread(25,...,131072,0xc8de000) = 131072 (0x20000) pread(25,...,131072,0xc8fe000) = 131072 (0x20000) ... pwrite(25,...,131072,0x15200000) = 131072 (0x20000) <-- write-behind! pwrite(25,...,131072,0x151e0000) = 131072 (0x20000) pwrite(25,...,131072,0x151c0000) = 131072 (0x20000) ... UPDATE and INSERT conceptually work too, but they suffer from other stupid page-at-a-time problems around extension so it's more fun to look at DELETE first. The whole write-behind notion, and the realisation that we already have it and should just make it into a "real thing", jumped out at me while studying Melanie's VACUUM pass 1 and VACUUM pass 2 patches for adding read streams. Rebased and attached here. That required hacking on the new tidstore.c stuff a bit. (We failed to get the VACUUM read stream bits into v17, but the study of that led to the default BAS_VACUUM size being cranked up to reflect modern realities, and generally sent me down this rabbit hole for a while.) Some problems: * If you wake the WAL writer more often, throughput might actually go down on high latency storage due to serialisation of WAL flushes. So far I have declined to try to write an adaptive algorithm to figure out whether to do it, and where the threshold should be. I suspect it might involve measuring time and hill-climbing... One option is to abandon this part (ie just do no worse than master at WAL flushing), or at least consider that a separate project. * This might hold too many pins! It does respect the limit mechanism, but that can let you have a lot of pins (it's a bit TOCTOU-racy too, we might need something smarter). One idea would be to release pins while writes are in the LSN queue, and reacquire them with ReadRecentBuffer() as required, since we don't really care if someone else evicts them in the meantime. * It seems a bit weird that we *also* have the WritebackContext machinery. I could probably subsume that whole mechanism into write_stream.c. If you squint, sync_file_range() is a sort of dual of POSIX_FADV_WILLNEED, which the read counterpart looks after. * I would like to merge Heikki's bulk write stuff into this somehow, not yet thought about it much. The patches are POC-quality only and certainly have bugs/missed edge cases/etc. Thoughts, better ideas, references to writing about this problem space, etc, welcome.
Вложения
- v2-0001-Teach-WritebackContext-to-work-with-block-ranges.patch
- v2-0002-Provide-vectored-variant-of-FlushOneBuffer.patch
- v2-0003-Provide-stream-API-for-cleaning-the-buffer-pool.patch
- v2-0004-Use-streaming-I-O-in-CHECKPOINT.patch
- v2-0005-Use-streaming-I-O-in-VACUUM-first-pass.patch
- v2-0006-Refactor-tidstore.c-memory-management.patch
- v2-0007-Use-streaming-I-O-in-VACUUM-second-pass.patch
- v2-0008-Use-streaming-I-O-in-BAS-to-do-write-behind.patch
- v2-0009-Use-streaming-I-O-in-COPY-via-write-behind.patch
- v2-0010-Use-streaming-I-O-in-VACUUM-via-write-behind.patch
- v2-0011-Use-streaming-I-O-in-DELETE-via-write-behind.-XXX.patch
В списке pgsql-hackers по дате отправления:
Следующее
От: jian heДата:
Сообщение: add tab-complete for memory, serialize option and other minor issues.