Обсуждение: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

Поиск
Список
Период
Сортировка

[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Hello list,

I'm submitting a patch for improving an almost 1h long pause at the start
of parallel pg_restore of a big archive. Related discussion has taken
place at pgsql-performance mailing list at:

https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net

I think I explain it rather well in the commit message, so I paste it
inline:


Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.

In this case pg_restore workers manifest constant looping of reading
small sizes (4KB) and seeking forward small lenths (around 10KB for a
compressed archive):

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

This happens as each worker scans the whole file until it finds the
entry it wants, skipping forward each block. In combination to the small
block size of the custom format dump, this causes many seeks and low
performance.

Fix by avoiding forward seeks for jumps of less than 1MB forward.
Do instead sequential reads.

Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.


This is my first patch submission, all help is much appreciated.
Regards,
Dimitris


P.S.  What is the recommended way to test a change, besides a generic make
check? And how do I run selectively only the pg_dump/restore tests, in
order to speed up my development routine?


Вложения

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote:
>
> P.S.  What is the recommended way to test a change, besides a generic make
> check? And how do I run selectively only the pg_dump/restore tests, in order
> to speed up my development routine?

I have tested it with:

   make  -C src/bin/pg_dump  check

It didn't break any test, but I also don't see any difference, the
performance boost is noticeable only when restoring a huge archive that is
missing offsets.

Any volunteer to review this one-line patch?

Thanks,
Dimitris




On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
> It didn't break any test, but I also don't see any difference, the
> performance boost is noticeable only when restoring a huge archive that is
> missing offsets.

This seems generally reasonable to me, but how did you decide on 1MB as the
threshold?  Have you tested other values?  Could the best threshold vary
based on the workload and hardware?

-- 
nathan



Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and
20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive?
Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it
madesense to guard against these cases too. 

I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of
reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it
wouldbe better to seek() forward for lengths of over 1MB.  

On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote:
>On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
>> It didn't break any test, but I also don't see any difference, the
>> performance boost is noticeable only when restoring a huge archive that is
>> missing offsets.
>
>This seems generally reasonable to me, but how did you decide on 1MB as the
>threshold?  Have you tested other values?  Could the best threshold vary
>based on the workload and hardware?
>



Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
I just managed to run pgindent, here is v2 with the comment style fixed.
Вложения

[PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Fri, 4 Apr 2025, Dimitrios Apostolou wrote:

> I just managed to run pgindent, here is v2 with the comment style fixed.

Any feedback on this one-liner? Or is the lack of feedback a clue that I
have been missing something important in my patch submission? :-)

Should I CC people that are frequent committers to the file?


Thanks,
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Any feedback on this one-liner? Or is the lack of feedback a clue that I 
> have been missing something important in my patch submission? :-)

The calendar ;-).  At this point we're in feature freeze for v18,
so things that aren't bugs aren't likely to get much attention
until v19 development opens up (in July, unless things are really
going badly with v18).

You should add your patch to the July commitfest [1] to make sure
we don't lose track of it.

            regards, tom lane

[1] https://commitfest.postgresql.org/53/




On Mon, 14 Apr 2025, Tom Lane wrote:

>
> You should add your patch to the July commitfest [1] to make sure
> we don't lose track of it.

I rebased the patch (attached) and created an entry in the commitfest:

https://commitfest.postgresql.org/patch/5809/


Thanks!
Dimitris




Attached now...

On Mon, 9 Jun 2025, Dimitrios Apostolou wrote:

>
>
> On Mon, 14 Apr 2025, Tom Lane wrote:
>
>>
>>  You should add your patch to the July commitfest [1] to make sure
>>  we don't lose track of it.
>
> I rebased the patch (attached) and created an entry in the commitfest:
>
> https://commitfest.postgresql.org/patch/5809/
>
>
> Thanks!
> Dimitris
>
>
>
Вложения
On Mon, Jun 09, 2025 at 10:09:57PM +0200, Dimitrios Apostolou wrote:
> Fix by avoiding forward seeks for jumps of less than 1MB forward.
> Do instead sequential reads.
> 
> Performance gain can be significant, depending on the size of the dump
> and the I/O subsystem. On my local NVMe drive, read speeds for that
> phase of pg_restore increased from 150MB/s to 3GB/s.

I was curious about what exactly was leading to the performance gains you
are seeing.  This page has an explanation:

    https://www.mjr19.org.uk/IT/fseek.html

I also wrote a couple of test programs to show the difference between
fseeko-ing and fread-ing through a file with various sizes.  On a Linux
machine, I see this:

     log2(n) | fseeko  | fread
    ---------+---------+-------
           1 | 109.288 | 5.528
           2 |  54.881 | 2.848
           3 |   27.65 | 1.504
           4 |  13.953 | 0.834
           5 |     7.1 |  0.49
           6 |   3.665 | 0.322
           7 |   1.944 | 0.244
           8 |   1.085 | 0.201
           9 |   0.658 | 0.185
          10 |   0.443 | 0.175
          11 |   0.253 | 0.171
          12 |   0.102 | 0.162
          13 |   0.075 |  0.13
          14 |   0.061 | 0.114
          15 |   0.054 |   0.1

So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
aren't quite as dramatic, but 4096 bytes is the break-even point there,
too.  I imagine there's a buffer around that size somewhere...

This doesn't fully explain the results you are seeing, but it does seem to
validate the idea.  I'm curious if you see further improvement with even
lower thresholds (e.g., 8KB, 16KB, 32KB). 

-- 
nathan



On Tue, 10 Jun 2025, Nathan Bossart wrote:

> I also wrote a couple of test programs to show the difference between
> fseeko-ing and fread-ing through a file with various sizes.  On a Linux
> machine, I see this:
>
>     log2(n) | fseeko  | fread
>    ---------+---------+-------
>           1 | 109.288 | 5.528
>           2 |  54.881 | 2.848
>           3 |   27.65 | 1.504
>           4 |  13.953 | 0.834
>           5 |     7.1 |  0.49
>           6 |   3.665 | 0.322
>           7 |   1.944 | 0.244
>           8 |   1.085 | 0.201
>           9 |   0.658 | 0.185
>          10 |   0.443 | 0.175
>          11 |   0.253 | 0.171
>          12 |   0.102 | 0.162
>          13 |   0.075 |  0.13
>          14 |   0.061 | 0.114
>          15 |   0.054 |   0.1
>
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

Thank you for benchmarking! Before answering in more depth, I'm curious,
what read-seek pattern do you see on the system call level (as
shown by strace)? In pg_restore it was a constant loop of
read(4K)-lseek(8-16K).

Dimitris



On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
> Thank you for benchmarking! Before answering in more depth, I'm curious,
> what read-seek pattern do you see on the system call level (as shown by
> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).

For fseeko(), sizes less than 4096 produce a repeating pattern of read()
calls followed by approximately (4096 / size) lseek() calls.  For greater
sizes, it's just a stream of lseek().  For fread(), sizes less than 4096
produce a stream of read(fd, "...", 4096), and for greater sizes, the only
difference is that the last argument is the size.

-- 
nathan



On Wed, 11 Jun 2025, Nathan Bossart wrote:

> On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
>> what read-seek pattern do you see on the system call level (as shown by
>> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).
>
> For fseeko(), sizes less than 4096 produce a repeating pattern of read()
> calls followed by approximately (4096 / size) lseek() calls.  For greater
> sizes, it's just a stream of lseek().

This is the opposite of what the link you shared before describes, so
maybe glibc has changed its behaviour to improve performance.

Anyway, the fact that fseek(>4096) produces a stream of lseek()s, means
that most likely no I/O is happening. You need to issue a getc() after
each fseeko(), like pg_restore is doing.


Dimitris




On Tue, 10 Jun 2025, Nathan Bossart wrote:

> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...
>
> This doesn't fully explain the results you are seeing, but it does seem to
> validate the idea.  I'm curious if you see further improvement with even
> lower thresholds (e.g., 8KB, 16KB, 32KB).

By the way, I might have set the threshold to 1MB in my program, but
lowering it won't show a difference in my test case, since the lseek()s I
was noticing before the patch were mostly 8-16KB forward. Not sure what is
the defining factor for that. Maybe the compression algorithm, or how wide
the table is?


Dimitris




On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
> By the way, I might have set the threshold to 1MB in my program, but
> lowering it won't show a difference in my test case, since the lseek()s I
> was noticing before the patch were mostly 8-16KB forward. Not sure what is
> the defining factor for that. Maybe the compression algorithm, or how wide
> the table is?

I may have missed it, but could you share what the strace looks like with
the patch applied?

-- 
nathan



On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to
control FILE buffering.  I suspect that glibc ignores the size if you
pass NULL for my_buffer, so you'd need to allocate it yourself and it
should probably be aligned on PG_IO_ALIGN_SIZE for best results
(minimising the number of VM pages that must be held/pinned).  Then
you might be able to get better and less OS-dependent results.  I
haven't studied this seek business so I have no opinion on that and
what a good size would be, but interesting sizes might be
rounded to both PG_IO_ALIGN_SIZE and filesystem block size according
to fstat(fileno(stream)).  IDK, just a thought...



On Fri, 13 Jun 2025, Nathan Bossart wrote:

> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096





On Sat, 14 Jun 2025, Dimitrios Apostolou wrote:

> On Fri, 13 Jun 2025, Nathan Bossart wrote:
>
>>  On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>>>  By the way, I might have set the threshold to 1MB in my program, but
>>>  lowering it won't show a difference in my test case, since the lseek()s I
>>>  was noticing before the patch were mostly 8-16KB forward. Not sure what
>>>  is
>>>  the defining factor for that. Maybe the compression algorithm, or how
>>>  wide
>>>  the table is?
>>
>>  I may have missed it, but could you share what the strace looks like with
>>  the patch applied?
>
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096


This was from pg_restoring a zstd-compressed custom format dump.

Out of curiosity I've tried the same with an uncompressed dump
(--compress=none). Surprisingly it seems the blocksize is even smaller.

With my patched pg_restore I only get 4K reads and nothing else on
the strace output.

read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096

The unpatched pg_restore gives me the weirdest output ever:

read(4, "..."..., 4096) = 4096
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
[ ... repeats about 80 times ...]
read(4, "..."..., 4096) = 4096
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
[ ... repeats ... ]



Seeing this, I think we should really consider raising the pg_dump block
size like Tom suggested on a previous thread.


Dimitris




Hi Nathan, I've noticed you've set yourself as a reviewer of this patch
in the commitfest. I appreciate it, but you might want to combine it
with another simple patch [1] that speeds up the same part of
pg_restore: the initial full scan on TOC-less archives.

[1] https://commitfest.postgresql.org/patch/5817/


On Saturday 2025-06-14 00:04, Nathan Bossart wrote:
>
> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

I hope you've seen my response here, with special focus on the small
block size that both compressed and uncompressed custom format archives
have.

I have been needing to pg_restore 10TB TOC-less dumps recently, and it's
a pain to do the full scan, even with both of my patches applied.

Maybe the block size could be a command line option of pg_dump, so that
one could set it to sizes like 100MB, which sounds like a normal block
from the perspective of a 10TB gigantic dump.

Regards,
Dimitris




On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

> Out of curiosity I've tried the same with an uncompressed dump
> (--compress=none). Surprisingly it seems the blocksize is even smaller.
>
> With my patched pg_restore I only get 4K reads and nothing else on the strace
> output.
>
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096


To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.

The code in question, in pg_backup_custom.c:


/*
  * Skip data from current file position.
  * Data blocks are formatted as an integer length, followed by data.
  * A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
     lclContext *ctx = (lclContext *) AH->formatData;
     size_t        blkLen;
     char       *buf = NULL;
     int            buflen = 0;

     blkLen = ReadInt(AH);
     while (blkLen != 0)
     {
         /* Sequential access is usually faster, so avoid seeking if the
          * jump forward is shorter than 1MB. */
         if (ctx->hasSeek && blkLen > 1024 * 1024)
         {
             if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                 pg_fatal("error during file seek: %m");
         }
         else
         {
             if (blkLen > buflen)
             {
                 free(buf);
                 buf = (char *) pg_malloc(blkLen);
                 buflen = blkLen;
             }
             if (fread(buf, 1, blkLen, AH->FH) != blkLen)
             {
                 if (feof(AH->FH))
                     pg_fatal("could not read from input file: end of file");
                 else
                     pg_fatal("could not read from input file: %m");
             }
         }

         blkLen = ReadInt(AH);
     }

     free(buf);
}


blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.

Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?


Thanks in advance,
Dimitris