Re: Confine vacuum skip logic to lazy_scan_skip
От | Melanie Plageman |
---|---|
Тема | Re: Confine vacuum skip logic to lazy_scan_skip |
Дата | |
Msg-id | CAAKRu_bbkmwAzSBgnezancgJeXrQZXy4G4kBTd+5=cr86H5yew@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Confine vacuum skip logic to lazy_scan_skip (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: Confine vacuum skip logic to lazy_scan_skip
|
Список | pgsql-hackers |
On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I've rebased the attached v10 over top of the changes to > > lazy_scan_heap() Heikki just committed and over the v6 streaming read > > patch set. I started testing them and see that you are right, we no > > longer pin too many buffers. However, the uncached example below is > > now slower with streaming read than on master -- it looks to be > > because it is doing twice as many WAL writes and syncs. I'm still > > investigating why that is. --snip-- > 4. For learning/exploration only, I rebased my experimental vectored > FlushBuffers() patch, which teaches the checkpointer to write relation > data out using smgrwritev(). The checkpointer explicitly sorts > blocks, but I think ring buffers should naturally often contain > consecutive blocks in ring order. Highly experimental POC code pushed > to a public branch[2], but I am not proposing anything here, just > trying to understand things. The nicest looking system call trace was > with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads > and WAL writes 128kB at a time: > > pwrite(32,...,131072,0xfc6000) = 131072 (0x20000) > fdatasync(32) = 0 (0x0) > pwrite(27,...,131072,0x6c0000) = 131072 (0x20000) > pread(27,...,131072,0x73e000) = 131072 (0x20000) > pwrite(27,...,131072,0x6e0000) = 131072 (0x20000) > pread(27,...,131072,0x75e000) = 131072 (0x20000) > pwritev(27,[...],3,0x77e000) = 131072 (0x20000) > preadv(27,[...],3,0x77e000) = 131072 (0x20000) > > That was a fun experiment, but... I recognise that efficient cleaning > of ring buffers is a Hard Problem requiring more concurrency: it's > just too late to be flushing that WAL. But we also don't want to > start writing back data immediately after dirtying pages (cf. OS > write-behind for big sequential writes in traditional Unixes), because > we're not allowed to write data out without writing the WAL first and > we currently need to build up bigger WAL writes to do so efficiently > (cf. some other systems that can write out fragments of WAL > concurrently so the latency-vs-throughput trade-off doesn't have to be > so extreme). So we want to defer writing it, but not too long. We > need something cleaning our buffers (or at least flushing the > associated WAL, but preferably also writing the data) not too late and > not too early, and more in sync with our scan than the WAL writer is. > What that machinery should look like I don't know (but I believe > Andres has ideas). I've attached a WIP v11 streaming vacuum patch set here that is rebased over master (by Thomas), so that I could add a CF entry for it. It still has the problem with the extra WAL write and fsync calls investigated by Thomas above. Thomas has some work in progress doing streaming write-behind to alleviate the issues with the buffer access strategy and streaming reads. When he gets a version of that ready to share, he will start a new "Streaming Vacuum" thread. - Melanie
Вложения
В списке pgsql-hackers по дате отправления: