[MASSMAIL]WIP: Vectored writeback
От | Thomas Munro |
---|---|
Тема | [MASSMAIL]WIP: Vectored writeback |
Дата | |
Msg-id | CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: WIP: Vectored writeback
|
Список | pgsql-hackers |
Hi, Here are some vectored writeback patches I worked on in the 17 cycle and posted as part of various patch sets, but didn't get into a good enough shape to take further. They "push" vectored writes out, but I think what they need is to be turned inside out and converted into users of a new hypothetical write_stream.c, so that we have a model that will survive contact with asynchronous I/O and would "pull" writes from a stream that controls I/O concurrency. That all seemed a lot less urgent to work on than reads, hence leaving on ice for now. There is a lot of code that reads, and a small finite amount that writes. I think the patches show some aspects of the problem-space though, and they certainly make checkpointing faster. They cover 2 out of 5ish ways we write relation data: checkpointing, and strategies AKA ring buffers. They make checkpoints look like this, respecting io_combine_limit, instead of lots of 8kB writes: pwritev(9,[...],2,0x0) = 131072 (0x20000) pwrite(9,...,131072,0x20000) = 131072 (0x20000) pwrite(9,...,131072,0x40000) = 131072 (0x20000) pwrite(9,...,131072,0x60000) = 131072 (0x20000) pwrite(9,...,131072,0x80000) = 131072 (0x20000) ... Two more ways data gets written back are: bgwriter and regular BAS_NORMAL buffer eviction, but they are not such natural candidates for write combining. Well, if you know you're going to write out a buffer, *maybe* it's worth probing the buffer pool to see if adjacent block numbers are also present and dirty? I don't know. Before and after? Or maybe it's better to wait for the tree-based mapping table of legend first so it becomes cheaper to navigate in block number order. The 5th way is raw file copy that doesn't go through the buffer pool, such as CREATE DATABASE ... STRATEGY=FILE_COPY, which already works with big writes, and CREATE INDEX via bulk_write.c which is easily converted to vectored writes, and I plan to push the patches for that shortly. I think those should ultimately become stream-based too. Anyway, I wanted to share these uncommitfest patches, having rebased them over relevant recent commits, so I could leave them in working state in case anyone is interested in this file I/O-level stuff...
Вложения
В списке pgsql-hackers по дате отправления: