Обсуждение: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
Hello list, I'm submitting a patch for improving an almost 1h long pause at the start of parallel pg_restore of a big archive. Related discussion has taken place at pgsql-performance mailing list at: https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net I think I explain it rather well in the commit message, so I paste it inline: Improve the performance of parallel pg_restore (-j) from a custom format pg_dump archive that does not include data offsets - typically happening when pg_dump has generated it by writing to stdout instead of a file. In this case pg_restore workers manifest constant looping of reading small sizes (4KB) and seeking forward small lenths (around 10KB for a compressed archive): read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 This happens as each worker scans the whole file until it finds the entry it wants, skipping forward each block. In combination to the small block size of the custom format dump, this causes many seeks and low performance. Fix by avoiding forward seeks for jumps of less than 1MB forward. Do instead sequential reads. Performance gain can be significant, depending on the size of the dump and the I/O subsystem. On my local NVMe drive, read speeds for that phase of pg_restore increased from 150MB/s to 3GB/s. This is my first patch submission, all help is much appreciated. Regards, Dimitris P.S. What is the recommended way to test a change, besides a generic make check? And how do I run selectively only the pg_dump/restore tests, in order to speed up my development routine?
Вложения
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote: > > P.S. What is the recommended way to test a change, besides a generic make > check? And how do I run selectively only the pg_dump/restore tests, in order > to speed up my development routine? I have tested it with: make -C src/bin/pg_dump check It didn't break any test, but I also don't see any difference, the performance boost is noticeable only when restoring a huge archive that is missing offsets. Any volunteer to review this one-line patch? Thanks, Dimitris
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Nathan Bossart
Дата:
On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: > It didn't break any test, but I also don't see any difference, the > performance boost is noticeable only when restoring a huge archive that is > missing offsets. This seems generally reasonable to me, but how did you decide on 1MB as the threshold? Have you tested other values? Could the best threshold vary based on the workload and hardware? -- nathan
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and 20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive? Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it madesense to guard against these cases too. I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it wouldbe better to seek() forward for lengths of over 1MB. On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote: >On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: >> It didn't break any test, but I also don't see any difference, the >> performance boost is noticeable only when restoring a huge archive that is >> missing offsets. > >This seems generally reasonable to me, but how did you decide on 1MB as the >threshold? Have you tested other values? Could the best threshold vary >based on the workload and hardware? >
Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
I just managed to run pgindent, here is v2 with the comment style fixed.
Вложения
[PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Fri, 4 Apr 2025, Dimitrios Apostolou wrote: > I just managed to run pgindent, here is v2 with the comment style fixed. Any feedback on this one-liner? Or is the lack of feedback a clue that I have been missing something important in my patch submission? :-) Should I CC people that are frequent committers to the file? Thanks, Dimitris
Dimitrios Apostolou <jimis@gmx.net> writes: > Any feedback on this one-liner? Or is the lack of feedback a clue that I > have been missing something important in my patch submission? :-) The calendar ;-). At this point we're in feature freeze for v18, so things that aren't bugs aren't likely to get much attention until v19 development opens up (in July, unless things are really going badly with v18). You should add your patch to the July commitfest [1] to make sure we don't lose track of it. regards, tom lane [1] https://commitfest.postgresql.org/53/
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Mon, 14 Apr 2025, Tom Lane wrote: > > You should add your patch to the July commitfest [1] to make sure > we don't lose track of it. I rebased the patch (attached) and created an entry in the commitfest: https://commitfest.postgresql.org/patch/5809/ Thanks! Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
Attached now... On Mon, 9 Jun 2025, Dimitrios Apostolou wrote: > > > On Mon, 14 Apr 2025, Tom Lane wrote: > >> >> You should add your patch to the July commitfest [1] to make sure >> we don't lose track of it. > > I rebased the patch (attached) and created an entry in the commitfest: > > https://commitfest.postgresql.org/patch/5809/ > > > Thanks! > Dimitris > > >
Вложения
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Nathan Bossart
Дата:
On Mon, Jun 09, 2025 at 10:09:57PM +0200, Dimitrios Apostolou wrote: > Fix by avoiding forward seeks for jumps of less than 1MB forward. > Do instead sequential reads. > > Performance gain can be significant, depending on the size of the dump > and the I/O subsystem. On my local NVMe drive, read speeds for that > phase of pg_restore increased from 150MB/s to 3GB/s. I was curious about what exactly was leading to the performance gains you are seeing. This page has an explanation: https://www.mjr19.org.uk/IT/fseek.html I also wrote a couple of test programs to show the difference between fseeko-ing and fread-ing through a file with various sizes. On a Linux machine, I see this: log2(n) | fseeko | fread ---------+---------+------- 1 | 109.288 | 5.528 2 | 54.881 | 2.848 3 | 27.65 | 1.504 4 | 13.953 | 0.834 5 | 7.1 | 0.49 6 | 3.665 | 0.322 7 | 1.944 | 0.244 8 | 1.085 | 0.201 9 | 0.658 | 0.185 10 | 0.443 | 0.175 11 | 0.253 | 0.171 12 | 0.102 | 0.162 13 | 0.075 | 0.13 14 | 0.061 | 0.114 15 | 0.054 | 0.1 So, fseeko() starts winning around 4096 bytes. On macOS, the differences aren't quite as dramatic, but 4096 bytes is the break-even point there, too. I imagine there's a buffer around that size somewhere... This doesn't fully explain the results you are seeing, but it does seem to validate the idea. I'm curious if you see further improvement with even lower thresholds (e.g., 8KB, 16KB, 32KB). -- nathan
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Tue, 10 Jun 2025, Nathan Bossart wrote: > I also wrote a couple of test programs to show the difference between > fseeko-ing and fread-ing through a file with various sizes. On a Linux > machine, I see this: > > log2(n) | fseeko | fread > ---------+---------+------- > 1 | 109.288 | 5.528 > 2 | 54.881 | 2.848 > 3 | 27.65 | 1.504 > 4 | 13.953 | 0.834 > 5 | 7.1 | 0.49 > 6 | 3.665 | 0.322 > 7 | 1.944 | 0.244 > 8 | 1.085 | 0.201 > 9 | 0.658 | 0.185 > 10 | 0.443 | 0.175 > 11 | 0.253 | 0.171 > 12 | 0.102 | 0.162 > 13 | 0.075 | 0.13 > 14 | 0.061 | 0.114 > 15 | 0.054 | 0.1 > > So, fseeko() starts winning around 4096 bytes. On macOS, the differences > aren't quite as dramatic, but 4096 bytes is the break-even point there, > too. I imagine there's a buffer around that size somewhere... Thank you for benchmarking! Before answering in more depth, I'm curious, what read-seek pattern do you see on the system call level (as shown by strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K). Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Nathan Bossart
Дата:
On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote: > Thank you for benchmarking! Before answering in more depth, I'm curious, > what read-seek pattern do you see on the system call level (as shown by > strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K). For fseeko(), sizes less than 4096 produce a repeating pattern of read() calls followed by approximately (4096 / size) lseek() calls. For greater sizes, it's just a stream of lseek(). For fread(), sizes less than 4096 produce a stream of read(fd, "...", 4096), and for greater sizes, the only difference is that the last argument is the size. -- nathan
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Wed, 11 Jun 2025, Nathan Bossart wrote: > On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote: >> what read-seek pattern do you see on the system call level (as shown by >> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K). > > For fseeko(), sizes less than 4096 produce a repeating pattern of read() > calls followed by approximately (4096 / size) lseek() calls. For greater > sizes, it's just a stream of lseek(). This is the opposite of what the link you shared before describes, so maybe glibc has changed its behaviour to improve performance. Anyway, the fact that fseek(>4096) produces a stream of lseek()s, means that most likely no I/O is happening. You need to issue a getc() after each fseeko(), like pg_restore is doing. Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Tue, 10 Jun 2025, Nathan Bossart wrote: > So, fseeko() starts winning around 4096 bytes. On macOS, the differences > aren't quite as dramatic, but 4096 bytes is the break-even point there, > too. I imagine there's a buffer around that size somewhere... > > This doesn't fully explain the results you are seeing, but it does seem to > validate the idea. I'm curious if you see further improvement with even > lower thresholds (e.g., 8KB, 16KB, 32KB). By the way, I might have set the threshold to 1MB in my program, but lowering it won't show a difference in my test case, since the lseek()s I was noticing before the patch were mostly 8-16KB forward. Not sure what is the defining factor for that. Maybe the compression algorithm, or how wide the table is? Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Nathan Bossart
Дата:
On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: > By the way, I might have set the threshold to 1MB in my program, but > lowering it won't show a difference in my test case, since the lseek()s I > was noticing before the patch were mostly 8-16KB forward. Not sure what is > the defining factor for that. Maybe the compression algorithm, or how wide > the table is? I may have missed it, but could you share what the strace looks like with the patch applied? -- nathan
On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > So, fseeko() starts winning around 4096 bytes. On macOS, the differences > aren't quite as dramatic, but 4096 bytes is the break-even point there, > too. I imagine there's a buffer around that size somewhere... BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to control FILE buffering. I suspect that glibc ignores the size if you pass NULL for my_buffer, so you'd need to allocate it yourself and it should probably be aligned on PG_IO_ALIGN_SIZE for best results (minimising the number of VM pages that must be held/pinned). Then you might be able to get better and less OS-dependent results. I haven't studied this seek business so I have no opinion on that and what a good size would be, but interesting sizes might be rounded to both PG_IO_ALIGN_SIZE and filesystem block size according to fstat(fileno(stream)). IDK, just a thought...
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Fri, 13 Jun 2025, Nathan Bossart wrote: > On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: >> By the way, I might have set the threshold to 1MB in my program, but >> lowering it won't show a difference in my test case, since the lseek()s I >> was noticing before the patch were mostly 8-16KB forward. Not sure what is >> the defining factor for that. Maybe the compression algorithm, or how wide >> the table is? > > I may have missed it, but could you share what the strace looks like with > the patch applied? read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 12288) = 12288 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 12288) = 12288 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 12288) = 12288 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096 read(4, "..."..., 12288) = 12288 read(4, "..."..., 4096) = 4096 read(4, "..."..., 8192) = 8192 read(4, "..."..., 4096) = 4096
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Sat, 14 Jun 2025, Dimitrios Apostolou wrote: > On Fri, 13 Jun 2025, Nathan Bossart wrote: > >> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: >>> By the way, I might have set the threshold to 1MB in my program, but >>> lowering it won't show a difference in my test case, since the lseek()s I >>> was noticing before the patch were mostly 8-16KB forward. Not sure what >>> is >>> the defining factor for that. Maybe the compression algorithm, or how >>> wide >>> the table is? >> >> I may have missed it, but could you share what the strace looks like with >> the patch applied? > > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 This was from pg_restoring a zstd-compressed custom format dump. Out of curiosity I've tried the same with an uncompressed dump (--compress=none). Surprisingly it seems the blocksize is even smaller. With my patched pg_restore I only get 4K reads and nothing else on the strace output. read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 The unpatched pg_restore gives me the weirdest output ever: read(4, "..."..., 4096) = 4096 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 [ ... repeats about 80 times ...] read(4, "..."..., 4096) = 4096 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 [ ... repeats ... ] Seeing this, I think we should really consider raising the pg_dump block size like Tom suggested on a previous thread. Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
Hi Nathan, I've noticed you've set yourself as a reviewer of this patch in the commitfest. I appreciate it, but you might want to combine it with another simple patch [1] that speeds up the same part of pg_restore: the initial full scan on TOC-less archives. [1] https://commitfest.postgresql.org/patch/5817/ On Saturday 2025-06-14 00:04, Nathan Bossart wrote: > > On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: >> By the way, I might have set the threshold to 1MB in my program, but >> lowering it won't show a difference in my test case, since the lseek()s I >> was noticing before the patch were mostly 8-16KB forward. Not sure what is >> the defining factor for that. Maybe the compression algorithm, or how wide >> the table is? > > I may have missed it, but could you share what the strace looks like with > the patch applied? I hope you've seen my response here, with special focus on the small block size that both compressed and uncompressed custom format archives have. I have been needing to pg_restore 10TB TOC-less dumps recently, and it's a pain to do the full scan, even with both of my patches applied. Maybe the block size could be a command line option of pg_dump, so that one could set it to sizes like 100MB, which sounds like a normal block from the perspective of a 10TB gigantic dump. Regards, Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
От
Dimitrios Apostolou
Дата:
On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote: > Out of curiosity I've tried the same with an uncompressed dump > (--compress=none). Surprisingly it seems the blocksize is even smaller. > > With my patched pg_restore I only get 4K reads and nothing else on the strace > output. > > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 4096) = 4096 To clarify this output again, I have a huge uncompressed custom format dump without TOC (because pg_dump was writing to stdout), and at this point pg_restore is going through the whole archive to find the items it needs. Allow me to explain what goes on at this point since I have better insight now. The code in question, in pg_backup_custom.c: /* * Skip data from current file position. * Data blocks are formatted as an integer length, followed by data. * A zero length indicates the end of the block. */ static void _skipData(ArchiveHandle *AH) { lclContext *ctx = (lclContext *) AH->formatData; size_t blkLen; char *buf = NULL; int buflen = 0; blkLen = ReadInt(AH); while (blkLen != 0) { /* Sequential access is usually faster, so avoid seeking if the * jump forward is shorter than 1MB. */ if (ctx->hasSeek && blkLen > 1024 * 1024) { if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0) pg_fatal("error during file seek: %m"); } else { if (blkLen > buflen) { free(buf); buf = (char *) pg_malloc(blkLen); buflen = blkLen; } if (fread(buf, 1, blkLen, AH->FH) != blkLen) { if (feof(AH->FH)) pg_fatal("could not read from input file: end of file"); else pg_fatal("could not read from input file: %m"); } } blkLen = ReadInt(AH); } free(buf); } blkLen is almost always a number around 35 to 38. So fread() is called all the time doing reads of about ~35 bytes. Then ReadInt() is actually doing getc() a few times. And it loops over. Libc is doing buffering of 4k, and that's how we end up seeing so many 4k reads. This also explains the ~80 lseek() between each 4k read() on the unpatched version, mentioned in previous email. I've tried setvbuf() like Thomas Munro suggested and I saw a significant improvement by allocating and using a 1MB buffer for libc stream buffering. Question that remains: where is pg_dump setting this ~35B length block? Is that easy to change without breaking old versions? Thanks in advance, Dimitris