Re: WIP: WAL prefetch (another approach)
От | Thomas Munro |
---|---|
Тема | Re: WIP: WAL prefetch (another approach) |
Дата | |
Msg-id | CA+hUKG+2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: WIP: WAL prefetch (another approach) (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: WIP: WAL prefetch (another approach)
Re: WIP: WAL prefetch (another approach) |
Список | pgsql-hackers |
On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > OK, thanks for looking into this. I guess I'll wait for an updated patch > before testing this further. The storage has limited capacity so I'd > have to either reduce the amount of data/WAL or juggle with the WAL > segments somehow. Doesn't seem worth it. Here's a new WIP version that works for archive-based recovery in my tests. The main change I have been working on is that there is now just a single XLogReaderState, so no more double-reading and double-decoding of the WAL. It provides XLogReadRecord(), as before, but now you can also read further ahead with XLogReadAhead(). The user interface is much like before, except that the GUCs changed a bit. They are now: recovery_prefetch=on recovery_prefetch_fpw=off wal_decode_buffer_size=256kB maintenance_io_concurrency=10 I recommend setting maintenance_io_concurrency and wal_decode_buffer_size much higher than those defaults. There are a few TODOs and questions remaining. One issue I'm wondering about is whether it is OK that bulky FPI data is now memcpy'd into the decode buffer, whereas before we avoided that sometimes, when it didn't happen to cross a page boundary; I have some ideas on how to do better (basically two levels of ring buffer) but I haven't looked into that yet. Another issue is the new 'nowait' API for the page-read callback; I'm trying to figure out if that is sufficient, or something more sophisticated including perhaps a different return value is required. Another thing I'm wondering about is whether I have timeline changes adequately handled. This design opens up a lot of possibilities for future performance improvements. Some example: 1. By adding some workspace to decoded records, the prefetcher can leave breadcrumbs for XLogReadBufferForRedoExtended(), so that it usually avoids the need for a second buffer mapping table lookup. Incidentally this also skips the hot smgropen() calls that Jakub complained about. I have an added an experimental patch like that, but I need to look into the interlocking some more. 2. By inspecting future records in the record->next chain, a redo function could merge work in various ways in quite a simple and localised way. A couple of examples: 2.1. If there is a sequence of records of the same type touching the same page, you could process all of them while you have the page lock. 2.2. If there is a sequence of relation extensions (say, a sequence of multi-tuple inserts to the end of a relation, as commonly seen in bulk data loads) then instead of generating a many pwrite(8KB of zeroes) syscalls record-by-record to extend the relation, a single posix_fallocate(1MB) could extend the file in one shot. Assuming the bgwriter is running and doing a good job, this would remove most of the system calls from bulk-load-recovery. 3. More sophisticated analysis could find records to merge that are a bit further apart, under carefully controlled conditions; for example if you have a sequence like heap-insert, btree-insert, heap-insert, btree-insert, ... then a simple next-record system like 2 won't see the opportunities, but something a teensy bit smarter could. 4. Since the decoding buffer can be placed in shared memory (decoded records contain pointers but only don't point to any other memory region, with the exception of clearly marked oversized records), we could begin to contemplate handing work off to other processes, given a clever dependency analysis scheme and some more infrastructure.
Вложения
В списке pgsql-hackers по дате отправления: