Обсуждение: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

Поиск

Список

Период

Сортировка

[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

29 марта 2025 г., 03:46:30

Hello list,

I'm submitting a patch for improving an almost 1h long pause at the start
of parallel pg_restore of a big archive. Related discussion has taken
place at pgsql-performance mailing list at:

https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net

I think I explain it rather well in the commit message, so I paste it
inline:


Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.

In this case pg_restore workers manifest constant looping of reading
small sizes (4KB) and seeking forward small lenths (around 10KB for a
compressed archive):

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

This happens as each worker scans the whole file until it finds the
entry it wants, skipping forward each block. In combination to the small
block size of the custom format dump, this causes many seeks and low
performance.

Fix by avoiding forward seeks for jumps of less than 1MB forward.
Do instead sequential reads.

Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.


This is my first patch submission, all help is much appreciated.
Regards,
Dimitris


P.S.  What is the recommended way to test a change, besides a generic make
check? And how do I run selectively only the pg_dump/restore tests, in
order to speed up my development routine?

Вложения

v1-0001-parallel-pg_restore-avoid-disk-seeks-when-jumping.patch

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

01 апреля 2025 г., 22:33:32

On Sat, 29 Mar 2025, Dimitrios Apostolou wrote:
>
> P.S.  What is the recommended way to test a change, besides a generic make
> check? And how do I run selectively only the pg_dump/restore tests, in order
> to speed up my development routine?

I have tested it with:

   make  -C src/bin/pg_dump  check

It didn't break any test, but I also don't see any difference, the
performance boost is noticeable only when restoring a huge archive that is
missing offsets.

Any volunteer to review this one-line patch?

Thanks,
Dimitris

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Nathan Bossart

Дата:

01 апреля 2025 г., 23:04:00

On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
> It didn't break any test, but I also don't see any difference, the
> performance boost is noticeable only when restoring a huge archive that is
> missing offsets.

This seems generally reasonable to me, but how did you decide on 1MB as the
threshold?  Have you tested other values?  Could the best threshold vary
based on the workload and hardware?

-- 
nathan

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

02 апреля 2025 г., 01:25:25

Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and
20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive?
Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it
madesense to guard against these cases too. 

I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of
reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it
wouldbe better to seek() forward for lengths of over 1MB.  

On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote:
>On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
>> It didn't break any test, but I also don't see any difference, the
>> performance boost is noticeable only when restoring a huge archive that is
>> missing offsets.
>
>This seems generally reasonable to me, but how did you decide on 1MB as the
>threshold?  Have you tested other values?  Could the best threshold vary
>based on the workload and hardware?
>

Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

04 апреля 2025 г., 19:55:44

I just managed to run pgindent, here is v2 with the comment style fixed.

Вложения

v2-0001-parallel-pg_restore-avoid-disk-seeks-when-jumping.patch

[PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

14 апреля 2025 г., 19:31:48

On Fri, 4 Apr 2025, Dimitrios Apostolou wrote:

> I just managed to run pgindent, here is v2 with the comment style fixed.

Any feedback on this one-liner? Or is the lack of feedback a clue that I
have been missing something important in my patch submission? :-)

Should I CC people that are frequent committers to the file?


Thanks,
Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Tom Lane

Дата:

14 апреля 2025 г., 20:27:29

Dimitrios Apostolou <jimis@gmx.net> writes:
> Any feedback on this one-liner? Or is the lack of feedback a clue that I 
> have been missing something important in my patch submission? :-)

The calendar ;-).  At this point we're in feature freeze for v18,
so things that aren't bugs aren't likely to get much attention
until v19 development opens up (in July, unless things are really
going badly with v18).

You should add your patch to the July commitfest [1] to make sure
we don't lose track of it.

            regards, tom lane

[1] https://commitfest.postgresql.org/53/

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

09 июня 2025 г., 23:08:29


On Mon, 14 Apr 2025, Tom Lane wrote:

>
> You should add your patch to the July commitfest [1] to make sure
> we don't lose track of it.

I rebased the patch (attached) and created an entry in the commitfest:

https://commitfest.postgresql.org/patch/5809/


Thanks!
Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

09 июня 2025 г., 23:09:57

Attached now...

On Mon, 9 Jun 2025, Dimitrios Apostolou wrote:

>
>
> On Mon, 14 Apr 2025, Tom Lane wrote:
>
>>
>>  You should add your patch to the July commitfest [1] to make sure
>>  we don't lose track of it.
>
> I rebased the patch (attached) and created an entry in the commitfest:
>
> https://commitfest.postgresql.org/patch/5809/
>
>
> Thanks!
> Dimitris
>
>
>

Вложения

v3-0001-parallel-pg_restore-avoid-disk-seeks-when-moving-.patch

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Nathan Bossart

Дата:

11 июня 2025 г., 00:47:59

On Mon, Jun 09, 2025 at 10:09:57PM +0200, Dimitrios Apostolou wrote:
> Fix by avoiding forward seeks for jumps of less than 1MB forward.
> Do instead sequential reads.
> 
> Performance gain can be significant, depending on the size of the dump
> and the I/O subsystem. On my local NVMe drive, read speeds for that
> phase of pg_restore increased from 150MB/s to 3GB/s.

I was curious about what exactly was leading to the performance gains you
are seeing.  This page has an explanation:

    https://www.mjr19.org.uk/IT/fseek.html

I also wrote a couple of test programs to show the difference between
fseeko-ing and fread-ing through a file with various sizes.  On a Linux
machine, I see this:

     log2(n) | fseeko  | fread
    ---------+---------+-------
           1 | 109.288 | 5.528
           2 |  54.881 | 2.848
           3 |   27.65 | 1.504
           4 |  13.953 | 0.834
           5 |     7.1 |  0.49
           6 |   3.665 | 0.322
           7 |   1.944 | 0.244
           8 |   1.085 | 0.201
           9 |   0.658 | 0.185
          10 |   0.443 | 0.175
          11 |   0.253 | 0.171
          12 |   0.102 | 0.162
          13 |   0.075 |  0.13
          14 |   0.061 | 0.114
          15 |   0.054 |   0.1

So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
aren't quite as dramatic, but 4096 bytes is the break-even point there,
too.  I imagine there's a buffer around that size somewhere...

This doesn't fully explain the results you are seeing, but it does seem to
validate the idea.  I'm curious if you see further improvement with even
lower thresholds (e.g., 8KB, 16KB, 32KB). 

-- 
nathan

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

11 июня 2025 г., 01:32:58

On Tue, 10 Jun 2025, Nathan Bossart wrote:

> I also wrote a couple of test programs to show the difference between
> fseeko-ing and fread-ing through a file with various sizes.  On a Linux
> machine, I see this:
>
>     log2(n) | fseeko  | fread
>    ---------+---------+-------
>           1 | 109.288 | 5.528
>           2 |  54.881 | 2.848
>           3 |   27.65 | 1.504
>           4 |  13.953 | 0.834
>           5 |     7.1 |  0.49
>           6 |   3.665 | 0.322
>           7 |   1.944 | 0.244
>           8 |   1.085 | 0.201
>           9 |   0.658 | 0.185
>          10 |   0.443 | 0.175
>          11 |   0.253 | 0.171
>          12 |   0.102 | 0.162
>          13 |   0.075 |  0.13
>          14 |   0.061 | 0.114
>          15 |   0.054 |   0.1
>
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

Thank you for benchmarking! Before answering in more depth, I'm curious,
what read-seek pattern do you see on the system call level (as
shown by strace)? In pg_restore it was a constant loop of
read(4K)-lseek(8-16K).

Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Nathan Bossart

Дата:

12 июня 2025 г., 00:14:53

On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
> Thank you for benchmarking! Before answering in more depth, I'm curious,
> what read-seek pattern do you see on the system call level (as shown by
> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).

For fseeko(), sizes less than 4096 produce a repeating pattern of read()
calls followed by approximately (4096 / size) lseek() calls.  For greater
sizes, it's just a stream of lseek().  For fread(), sizes less than 4096
produce a stream of read(fd, "...", 4096), and for greater sizes, the only
difference is that the last argument is the size.

-- 
nathan

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

12 июня 2025 г., 02:25:00

On Wed, 11 Jun 2025, Nathan Bossart wrote:

> On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
>> what read-seek pattern do you see on the system call level (as shown by
>> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).
>
> For fseeko(), sizes less than 4096 produce a repeating pattern of read()
> calls followed by approximately (4096 / size) lseek() calls.  For greater
> sizes, it's just a stream of lseek().

This is the opposite of what the link you shared before describes, so
maybe glibc has changed its behaviour to improve performance.

Anyway, the fact that fseek(>4096) produces a stream of lseek()s, means
that most likely no I/O is happening. You need to issue a getc() after
each fseeko(), like pg_restore is doing.

Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

13 июня 2025 г., 02:00:26

On Tue, 10 Jun 2025, Nathan Bossart wrote:

> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...
>
> This doesn't fully explain the results you are seeing, but it does seem to
> validate the idea.  I'm curious if you see further improvement with even
> lower thresholds (e.g., 8KB, 16KB, 32KB).

By the way, I might have set the threshold to 1MB in my program, but
lowering it won't show a difference in my test case, since the lseek()s I
was noticing before the patch were mostly 8-16KB forward. Not sure what is
the defining factor for that. Maybe the compression algorithm, or how wide
the table is?

Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Nathan Bossart

Дата:

14 июня 2025 г., 01:04:47

On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
> By the way, I might have set the threshold to 1MB in my program, but
> lowering it won't show a difference in my test case, since the lseek()s I
> was noticing before the patch were mostly 8-16KB forward. Not sure what is
> the defining factor for that. Maybe the compression algorithm, or how wide
> the table is?

I may have missed it, but could you share what the strace looks like with
the patch applied?

-- 
nathan

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Thomas Munro

Дата:

14 июня 2025 г., 02:15:25

On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to
control FILE buffering.  I suspect that glibc ignores the size if you
pass NULL for my_buffer, so you'd need to allocate it yourself and it
should probably be aligned on PG_IO_ALIGN_SIZE for best results
(minimising the number of VM pages that must be held/pinned).  Then
you might be able to get better and less OS-dependent results.  I
haven't studied this seek business so I have no opinion on that and
what a good size would be, but interesting sizes might be
rounded to both PG_IO_ALIGN_SIZE and filesystem block size according
to fstat(fileno(stream)).  IDK, just a thought...

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

14 июня 2025 г., 19:01:13

On Fri, 13 Jun 2025, Nathan Bossart wrote:

> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

14 июня 2025 г., 19:17:41

On Sat, 14 Jun 2025, Dimitrios Apostolou wrote:

> On Fri, 13 Jun 2025, Nathan Bossart wrote:
>
>>  On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>>>  By the way, I might have set the threshold to 1MB in my program, but
>>>  lowering it won't show a difference in my test case, since the lseek()s I
>>>  was noticing before the patch were mostly 8-16KB forward. Not sure what
>>>  is
>>>  the defining factor for that. Maybe the compression algorithm, or how
>>>  wide
>>>  the table is?
>>
>>  I may have missed it, but could you share what the strace looks like with
>>  the patch applied?
>
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096


This was from pg_restoring a zstd-compressed custom format dump.

Out of curiosity I've tried the same with an uncompressed dump
(--compress=none). Surprisingly it seems the blocksize is even smaller.

With my patched pg_restore I only get 4K reads and nothing else on
the strace output.

read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096

The unpatched pg_restore gives me the weirdest output ever:

read(4, "..."..., 4096) = 4096
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
[ ... repeats about 80 times ...]
read(4, "..."..., 4096) = 4096
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
[ ... repeats ... ]



Seeing this, I think we should really consider raising the pg_dump block
size like Tom suggested on a previous thread.


Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

24 июля 2025 г., 00:54:44

Hi Nathan, I've noticed you've set yourself as a reviewer of this patch
in the commitfest. I appreciate it, but you might want to combine it
with another simple patch [1] that speeds up the same part of
pg_restore: the initial full scan on TOC-less archives.

[1] https://commitfest.postgresql.org/patch/5817/

On Saturday 2025-06-14 00:04, Nathan Bossart wrote:
>
> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

I hope you've seen my response here, with special focus on the small
block size that both compressed and uncompressed custom format archives
have.

I have been needing to pg_restore 10TB TOC-less dumps recently, and it's
a pain to do the full scan, even with both of my patches applied.

Maybe the block size could be a command line option of pg_dump, so that
one could set it to sizes like 100MB, which sounds like a normal block
from the perspective of a 10TB gigantic dump.

Regards,
Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Dimitrios Apostolou

Дата:

28 июля 2025 г., 16:34:20

On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

> Out of curiosity I've tried the same with an uncompressed dump
> (--compress=none). Surprisingly it seems the blocksize is even smaller.
>
> With my patched pg_restore I only get 4K reads and nothing else on the strace
> output.
>
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096

To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.

The code in question, in pg_backup_custom.c:

/*
  * Skip data from current file position.
  * Data blocks are formatted as an integer length, followed by data.
  * A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
     lclContext *ctx = (lclContext *) AH->formatData;
     size_t        blkLen;
     char       *buf = NULL;
     int            buflen = 0;

     blkLen = ReadInt(AH);
     while (blkLen != 0)
     {
         /* Sequential access is usually faster, so avoid seeking if the
          * jump forward is shorter than 1MB. */
         if (ctx->hasSeek && blkLen > 1024 * 1024)
         {
             if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                 pg_fatal("error during file seek: %m");
         }
         else
         {
             if (blkLen > buflen)
             {
                 free(buf);
                 buf = (char *) pg_malloc(blkLen);
                 buflen = blkLen;
             }
             if (fread(buf, 1, blkLen, AH->FH) != blkLen)
             {
                 if (feof(AH->FH))
                     pg_fatal("could not read from input file: end of file");
                 else
                     pg_fatal("could not read from input file: %m");
             }
         }

         blkLen = ReadInt(AH);
     }

     free(buf);
}

blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.

Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?

Thanks in advance,
Dimitris

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Tom Lane

Дата:

10 октября 2025 г., 23:26:57

Dimitrios Apostolou <jimis@gmx.net> writes:
> Question that remains: where is pg_dump setting this ~35B length block? 

I poked into that question, and found that the answer is some
exceedingly brain-dead buffering logic in compress_zstd.c.
It will dump its buffer during every loop iteration within
_ZstdWriteCommon, no matter how much buffer space it has left;
and each call to cs->writeF() produces a new "data block"
in the output file.  The amount of data fed to _ZstdWriteCommon
per call is whatever the backend sends per "copy data" message,
which is generally one table row.  So if the table rows aren't
too wide, or if they're highly compressible, you get these
tiny data blocks.

compress_lz4.c is equally broken, again buffering no bytes across
calls; although liblz4 seems to do some buffering internally.
I got blocks of around 300 bytes on the test case I was using.
That's still ridiculous.

compress_gzip.c is actually sanely implemented, and consistently
produces blocks of 4096 bytes, which traces to DEFAULT_IO_BUFFER_SIZE
in compress_io.h.

If you choose --compress=none, you get data blocks that correspond
exactly to table rows.  We could imagine doing some internal
buffering to amalgamate short rows into larger blocks, but I'm
not entirely convinced it's worth messing with that case.

The attached patch fixes the buffering logic in compress_zstd.c
and compress_lz4.c.  For zstd, most blocks are now 131591 bytes,
which seems to be determined by ZSTD_CStreamOutSize() not by
our code.  For lz4, I see a range of block sizes but they're
almost all around 64K.  That's apparently emergent from the
behavior of LZ4F_compressBound(): when told we want to supply
it up to 4K at a time, it says it needs a buffer around 64K.

I'm tempted to increase DEFAULT_IO_BUFFER_SIZE so that gzip
also produces blocks in the vicinity of 64K, but we'd have
to decouple the behavior of compress_lz4.c somehow, or it
would want to produce blocks around a megabyte which might
be excessive.  (Or if it's not, we'd still want all these
compression methods to choose similar block sizes, I'd think.)

Anyway, these fixes should remove the need for pg_restore to look
at quite so many places in the archive file.  There may still be
a need for altering the seek-versus-read behavior as you suggest,
but I think we need to re-measure that tradeoff after fixing the
pg_dump side.

            regards, tom lane

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..63b1ebf8583 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Tom Lane

Дата:

11 октября 2025 г., 05:22:16

I wrote:
> I'm tempted to increase DEFAULT_IO_BUFFER_SIZE so that gzip
> also produces blocks in the vicinity of 64K, but we'd have
> to decouple the behavior of compress_lz4.c somehow, or it
> would want to produce blocks around a megabyte which might
> be excessive.  (Or if it's not, we'd still want all these
> compression methods to choose similar block sizes, I'd think.)

After a bit of further experimentation, here is a v2 patchset.

0001 is like my previous patch except that it also fixes
Zstd_write and Zstd_close so that the "stream API" code doesn't
behave differently from the older API.  Also, now with draft
commit message.

0002 adjusts things so that lz4 and gzip compression produce
block sizes around 128K, which is what compress_zstd.c already
does after 0001.  While I wouldn't necessarily follow zstd's
lead if it were easy to do differently, it isn't.  We'd have
to ignore ZSTD_CStreamOutSize() in favor of making our own
buffer size choice.  That seems to carry some risks of tickling
bugs that upstream isn't testing for, and the value of 128K is
not so far off that I care to take any such risk.

This brings us to a place where all three compression modes
should yield roughly-comparable data block sizes, which is a good
starting point for further discussion of whether pg_restore needs
seek-versus-read adjustments.

            regards, tom lane

From 8f5ab1e5b18be54dc037a459dee5b9095166232b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 20:57:15 -0400
Subject: [PATCH v2 1/2] Fix poor buffering logic in pg_dump's lz4 and zstd
 compression code.

Both of these modules dumped each bit of output that they got from
the underlying compression library as a separate "data block" in
the emitted archive file.  In the case of zstd this'd frequently
result in block sizes well under 100 bytes; lz4 is a little better
but still produces blocks around 300 bytes, at least in the test
case I tried.  This bloats the archive file a little bit compared
to larger block sizes, but the real problem is that when pg_restore
has to skip each data block rather than seeking directly to some
target data, tiny block sizes are enormously inefficient.

Fix both modules so that they fill their allocated buffer reasonably
well before dumping a data block.  In the case of lz4, also delete
some redundant logic that caused the lz4 frame header to be emitted
as a separate data block.  (That saves little, but I see no reason
to expend extra code to get worse results.)

I fixed the "stream API" code too.  In those cases, feeding small
amounts of data to fwrite() probably doesn't have any meaningful
performance consequences.  But it seems like a bad idea to leave
the two sets of code doing the same thing in two different ways.

In passing, remove unnecessary "extra paranoia" check in
_ZstdWriteCommon.  _CustomWriteFunc (the only possible referent
of cs->writeF) already protects itself against zero-length writes,
and it's really a modularity violation for _ZstdWriteCommon to know
that the custom format disallows empty data blocks.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c  | 167 +++++++++++++++++++-------------
 src/bin/pg_dump/compress_zstd.c |  37 +++----
 2 files changed, 117 insertions(+), 87 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..5fe2279faae 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */
@@ -367,26 +365,31 @@ Zstd_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     if (zstdcs->cstream == NULL)
     {
         zstdcs->output.size = ZSTD_CStreamOutSize();
-        zstdcs->output.dst = pg_malloc0(zstdcs->output.size);
+        zstdcs->output.dst = pg_malloc(zstdcs->output.size);
+        zstdcs->output.pos = 0;
         zstdcs->cstream = _ZstdCStreamParams(CFH->compression_spec);
         if (zstdcs->cstream == NULL)
             pg_fatal("could not initialize compression library");
     }

     /* Consume all input, to be flushed later */
-    while (input->pos != input->size)
+    while (input->pos < input->size)
     {
-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_continue);
         if (ZSTD_isError(res))
             pg_fatal("could not write to file: %s", ZSTD_getErrorName(res));

-        errno = 0;
-        cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
-        if (cnt != output->pos)
+        /* Dump output buffer if full */
+        if (output->pos >= output->size)
         {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("could not write to file: %m");
+            errno = 0;
+            cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
+            if (cnt != output->pos)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("could not write to file: %m");
+            }
+            output->pos = 0;
         }
     }
 }
@@ -448,7 +451,6 @@ Zstd_close(CompressFileHandle *CFH)
         /* Loop until the compression buffers are fully consumed */
         for (;;)
         {
-            output->pos = 0;
             res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_end);
             if (ZSTD_isError(res))
             {
@@ -466,6 +468,7 @@ Zstd_close(CompressFileHandle *CFH)
                 success = false;
                 break;
             }
+            output->pos = 0;

             if (res == 0)
                 break;            /* End of frame */
--
2.43.7

From 865a0ccacf8ec065426b1e2823aba9d3cc2c1caf Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 22:08:13 -0400
Subject: [PATCH v2 2/2] Try to align the block sizes of pg_dump's various
 compression modes.

(This is more of a straw man for discussion than a finished patch.)

After the previous patch, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number into line.

With compress = none you get data blocks the same sizes
as the table rows.  We could avoid that by introducing
an additional layer of buffering, but it's not clear to
me that that's a net win, so this patch doesn't do so.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here either.  In my view, the decompression side
of compress_lz4.c needs to be rewritten to be simpler, rather than
tested more.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h  | 2 +-
 src/bin/pg_dump/compress_lz4.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..53cf8c9b03b 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -24,7 +24,7 @@
  * still exercise all the branches. This applies especially if the value is
  * increased, in which case the overflow buffer may not be needed.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index 47ee2e4bbac..c9ea895c137 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -102,9 +102,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
--
2.43.7

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Tom Lane

Дата:

12 октября 2025 г., 04:25:20

While playing around with the test cases for pg_dump compression,
I was startled to discover that the performance of compress_lz4's
"stream API" code is absolutely abysmal.  Here is a simple test
case to demonstrate, using the regression database as test data:

$ pg_dump -Fd --compress=lz4 -f rlz4.dir regression
$ time pg_restore -f /dev/null rlz4.dir

real    0m0.023s
user    0m0.017s
sys     0m0.006s

So far so good, but now let's compress the toc.dat file:

$ lz4 -f -m --rm rlz4.dir/toc.dat
$ time pg_restore -f /dev/null rlz4.dir

real    0m1.335s
user    0m1.326s
sys     0m0.008s

Considering that lz4 prides itself on fast decompression speed,
that is not a sane result.  Decompressing the file only requires
a couple ms on my machine:

$ time lz4cat rlz4.dir/toc.dat.lz4 >/dev/null

real    0m0.002s
user    0m0.000s
sys     0m0.002s

So on this example, pg_restore is something more than 600x slower
to read the TOC data than it ought to be.

On investigation, the blame mostly affixes to LZ4Stream_read_overflow's
habit of memmove'ing all the remaining buffered data after each read
operation.  Since reading a TOC file tends to involve a lot of small
(even one-byte) decompression calls, that amounts to an O(N^2) cost.

This could have been fixed with a minimal patch, but to my
eyes LZ4Stream_read_internal and LZ4Stream_read_overflow are
badly-written spaghetti code; in particular the eol_flag logic
is inefficient and duplicative.  I chose to throw the code
away and rewrite from scratch.  This version is about sixty
lines shorter as well as not having the performance issue.

Fortunately, AFAICT the only way to get to this problem is to
manually LZ4-compress the toc.dat and/or blobs.toc files within a
directory-style archive.  Few people do that, which likely explains
the lack of field complaints.

On top of that, a similar case with gzip doesn't work at all,
though it's supposed to:

$ pg_dump -Fd --compress=gzip -f rgzip.dir regression
$ gzip rgzip.dir/toc.dat 
$ pg_restore -f /dev/null rgzip.dir
pg_restore: error: could not read from input file: 

Tracking this down, it seems that Gzip_read doesn't cope with
a request to read zero bytes.  I wonder how long that's been
broken.

As far as I can see, 002_pg_dump.pl doesn't exercise the case of
manually-compressed toc.dat files.  I wonder why not.

0001 and 0002 attached are the same as before, then 0003 adds a fix
for the LZ4 performance problem, and 0004 fixes the Gzip_read problem.
While at it, I got rid of a few other minor inefficiencies such as
unnecessary buffer-zeroing.

            regards, tom lane

From 1d5ff3431923b9f75415b80721966ad42c4036f3 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 20:57:15 -0400
Subject: [PATCH v3 1/4] Fix poor buffering logic in pg_dump's lz4 and zstd
 compression code.

Both of these modules dumped each bit of output that they got from
the underlying compression library as a separate "data block" in
the emitted archive file.  In the case of zstd this'd frequently
result in block sizes well under 100 bytes; lz4 is a little better
but still produces blocks around 300 bytes, at least in the test
case I tried.  This bloats the archive file a little bit compared
to larger block sizes, but the real problem is that when pg_restore
has to skip each data block rather than seeking directly to some
target data, tiny block sizes are enormously inefficient.

Fix both modules so that they fill their allocated buffer reasonably
well before dumping a data block.  In the case of lz4, also delete
some redundant logic that caused the lz4 frame header to be emitted
as a separate data block.  (That saves little, but I see no reason
to expend extra code to get worse results.)

I fixed the "stream API" code too.  In those cases, feeding small
amounts of data to fwrite() probably doesn't have any meaningful
performance consequences.  But it seems like a bad idea to leave
the two sets of code doing the same thing in two different ways.

In passing, remove unnecessary "extra paranoia" check in
_ZstdWriteCommon.  _CustomWriteFunc (the only possible referent
of cs->writeF) already protects itself against zero-length writes,
and it's really a modularity violation for _ZstdWriteCommon to know
that the custom format disallows empty data blocks.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c  | 167 +++++++++++++++++++-------------
 src/bin/pg_dump/compress_zstd.c |  37 +++----
 2 files changed, 117 insertions(+), 87 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..5fe2279faae 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */
@@ -367,26 +365,31 @@ Zstd_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     if (zstdcs->cstream == NULL)
     {
         zstdcs->output.size = ZSTD_CStreamOutSize();
-        zstdcs->output.dst = pg_malloc0(zstdcs->output.size);
+        zstdcs->output.dst = pg_malloc(zstdcs->output.size);
+        zstdcs->output.pos = 0;
         zstdcs->cstream = _ZstdCStreamParams(CFH->compression_spec);
         if (zstdcs->cstream == NULL)
             pg_fatal("could not initialize compression library");
     }

     /* Consume all input, to be flushed later */
-    while (input->pos != input->size)
+    while (input->pos < input->size)
     {
-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_continue);
         if (ZSTD_isError(res))
             pg_fatal("could not write to file: %s", ZSTD_getErrorName(res));

-        errno = 0;
-        cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
-        if (cnt != output->pos)
+        /* Dump output buffer if full */
+        if (output->pos >= output->size)
         {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("could not write to file: %m");
+            errno = 0;
+            cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
+            if (cnt != output->pos)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("could not write to file: %m");
+            }
+            output->pos = 0;
         }
     }
 }
@@ -448,7 +451,6 @@ Zstd_close(CompressFileHandle *CFH)
         /* Loop until the compression buffers are fully consumed */
         for (;;)
         {
-            output->pos = 0;
             res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_end);
             if (ZSTD_isError(res))
             {
@@ -466,6 +468,7 @@ Zstd_close(CompressFileHandle *CFH)
                 success = false;
                 break;
             }
+            output->pos = 0;

             if (res == 0)
                 break;            /* End of frame */
--
2.43.7

From 253f3c9f4bf2916930f4d48730e0dc98f757d66e Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 22:08:13 -0400
Subject: [PATCH v3 2/4] Try to align the block sizes of pg_dump's various
 compression modes.

(This is more of a straw man for discussion than a finished patch.)

After the previous patch, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number into line.

With compress = none you get data blocks the same sizes
as the table rows.  We could avoid that by introducing
an additional layer of buffering, but it's not clear to
me that that's a net win, so this patch doesn't do so.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here either.  In my view, the decompression side
of compress_lz4.c needs to be rewritten to be simpler, rather than
tested more.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h  | 2 +-
 src/bin/pg_dump/compress_lz4.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..53cf8c9b03b 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -24,7 +24,7 @@
  * still exercise all the branches. This applies especially if the value is
  * increased, in which case the overflow buffer may not be needed.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index 47ee2e4bbac..c9ea895c137 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -102,9 +102,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
--
2.43.7

From 0eaeb4009f1b6956d36a89b1139d49ae1f6db2dc Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Oct 2025 20:26:24 -0400
Subject: [PATCH v3 3/4] Fix serious performance problems in
 LZ4Stream_read_internal.

I was distressed to find that reading an LZ4-compressed toc.dat
file was hundreds of times slower than it ought to be.  On
investigation, the blame mostly affixes to LZ4Stream_read_overflow's
habit of memmove'ing all the remaining buffered data after each read
operation.  Since reading a TOC file tends to involve a lot of small
(even one-byte) decompression calls, that amounts to an O(N^2) cost.

This could have been fixed with a minimal patch, but to my
eyes LZ4Stream_read_internal and LZ4Stream_read_overflow are
badly-written spaghetti code; in particular the eol_flag logic
is inefficient and duplicative.  I chose to throw the code
away and rewrite from scratch.  This version is about sixty
lines shorter as well as not having the performance issue.

Fortunately, AFAICT the only way to get to this problem is to
manually LZ4-compress the toc.dat and/or blobs.toc files within a
directory-style archive.  Few people do that, which likely explains
the lack of field complaints.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c | 242 ++++++++++++---------------------
 1 file changed, 89 insertions(+), 153 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index c9ea895c137..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -65,14 +65,12 @@ typedef struct LZ4State
     char       *buffer;            /* buffer for compressed data */
     size_t        buflen;            /* allocated size of buffer */
     size_t        bufdata;        /* amount of valid data currently in buffer */
-
-    /*
-     * Used by the Stream API to store already uncompressed data that the
-     * caller has not consumed.
-     */
-    size_t        overflowalloclen;
-    size_t        overflowlen;
-    char       *overflowbuf;
+    /* These fields are used only while decompressing: */
+    size_t        bufnext;        /* next buffer position to decompress */
+    char       *outbuf;            /* buffer for decompressed data */
+    size_t        outbuflen;        /* allocated size of outbuf */
+    size_t        outbufdata;        /* amount of valid data currently in outbuf */
+    size_t        outbufnext;        /* next outbuf position to return */

     /*
      * Used by both APIs to keep track of error codes.
@@ -168,8 +166,8 @@ ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs)
         pg_fatal("could not create LZ4 decompression context: %s",
                  LZ4F_getErrorName(status));

-    outbuf = pg_malloc0(DEFAULT_IO_BUFFER_SIZE);
-    readbuf = pg_malloc0(DEFAULT_IO_BUFFER_SIZE);
+    outbuf = pg_malloc(DEFAULT_IO_BUFFER_SIZE);
+    readbuf = pg_malloc(DEFAULT_IO_BUFFER_SIZE);
     readbuflen = DEFAULT_IO_BUFFER_SIZE;
     while ((r = cs->readF(AH, &readbuf, &readbuflen)) > 0)
     {
@@ -184,7 +182,6 @@ ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs)
             size_t        out_size = DEFAULT_IO_BUFFER_SIZE;
             size_t        read_size = readend - readp;

-            memset(outbuf, 0, DEFAULT_IO_BUFFER_SIZE);
             status = LZ4F_decompress(ctx, outbuf, &out_size,
                                      readp, &read_size, &dec_opt);
             if (LZ4F_isError(status))
@@ -327,15 +324,16 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi

 /*
  * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
- * is reached.
+ * more buffered data and the end of the input file has been reached.
  */
 static bool
 LZ4Stream_eof(CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;

-    return state->overflowlen == 0 && feof(state->fp);
+    return state->outbufnext >= state->outbufdata &&
+        state->bufnext >= state->bufdata &&
+        feof(state->fp);
 }

 static const char *
@@ -357,13 +355,15 @@ LZ4Stream_get_error(CompressFileHandle *CFH)
  *
  * Creates the necessary contexts for either compression or decompression. When
  * compressing data (indicated by compressing=true), it additionally writes the
- * LZ4 header in the output stream.
+ * LZ4 header in the output buffer.
+ *
+ * It's expected that a not-yet-initialized LZ4State will be zero-filled.
  *
  * Returns true on success. In case of a failure returns false, and stores the
  * error code in state->errcode.
  */
 static bool
-LZ4Stream_init(LZ4State *state, int size, bool compressing)
+LZ4Stream_init(LZ4State *state, bool compressing)
 {
     size_t        status;

@@ -386,66 +386,22 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)
             return false;
         }

-        state->buflen = Max(size, DEFAULT_IO_BUFFER_SIZE);
+        state->buflen = DEFAULT_IO_BUFFER_SIZE;
         state->buffer = pg_malloc(state->buflen);
-
-        state->overflowalloclen = state->buflen;
-        state->overflowbuf = pg_malloc(state->overflowalloclen);
-        state->overflowlen = 0;
+        state->outbuflen = DEFAULT_IO_BUFFER_SIZE;
+        state->outbuf = pg_malloc(state->outbuflen);
     }

     state->inited = true;
     return true;
 }

-/*
- * Read already decompressed content from the overflow buffer into 'ptr' up to
- * 'size' bytes, if available. If the eol_flag is set, then stop at the first
- * occurrence of the newline char prior to 'size' bytes.
- *
- * Any unread content in the overflow buffer is moved to the beginning.
- *
- * Returns the number of bytes read from the overflow buffer (and copied into
- * the 'ptr' buffer), or 0 if the overflow buffer is empty.
- */
-static int
-LZ4Stream_read_overflow(LZ4State *state, void *ptr, int size, bool eol_flag)
-{
-    char       *p;
-    int            readlen = 0;
-
-    if (state->overflowlen == 0)
-        return 0;
-
-    if (state->overflowlen >= size)
-        readlen = size;
-    else
-        readlen = state->overflowlen;
-
-    if (eol_flag && (p = memchr(state->overflowbuf, '\n', readlen)))
-        /* Include the line terminating char */
-        readlen = p - state->overflowbuf + 1;
-
-    memcpy(ptr, state->overflowbuf, readlen);
-    state->overflowlen -= readlen;
-
-    if (state->overflowlen > 0)
-        memmove(state->overflowbuf, state->overflowbuf + readlen, state->overflowlen);
-
-    return readlen;
-}
-
 /*
  * The workhorse for reading decompressed content out of an LZ4 compressed
  * stream.
  *
  * It will read up to 'ptrsize' decompressed content, or up to the new line
- * char if found first when the eol_flag is set. It is possible that the
- * decompressed output generated by reading any compressed input via the
- * LZ4F API, exceeds 'ptrsize'. Any exceeding decompressed content is stored
- * at an overflow buffer within LZ4State. Of course, when the function is
- * called, it will first try to consume any decompressed content already
- * present in the overflow buffer, before decompressing new content.
+ * char if one is found first when the eol_flag is set.
  *
  * Returns the number of bytes of decompressed data copied into the ptr
  * buffer, or -1 in case of error.
@@ -454,62 +410,85 @@ static int
 LZ4Stream_read_internal(LZ4State *state, void *ptr, int ptrsize, bool eol_flag)
 {
     int            dsize = 0;
-    int            rsize;
-    int            size = ptrsize;
-    bool        eol_found = false;
-
-    void       *readbuf;
+    int            remaining = ptrsize;

     /* Lazy init */
-    if (!LZ4Stream_init(state, size, false /* decompressing */ ))
+    if (!LZ4Stream_init(state, false /* decompressing */ ))
     {
         pg_log_error("unable to initialize LZ4 library: %s",
                      LZ4F_getErrorName(state->errcode));
         return -1;
     }

-    /* No work needs to be done for a zero-sized output buffer */
-    if (size <= 0)
-        return 0;
-
-    /* Verify that there is enough space in the outbuf */
-    if (size > state->buflen)
+    /* Loop until postcondition is satisfied */
+    while (remaining > 0)
     {
-        state->buflen = size;
-        state->buffer = pg_realloc(state->buffer, size);
-    }
-
-    /* use already decompressed content if available */
-    dsize = LZ4Stream_read_overflow(state, ptr, size, eol_flag);
-    if (dsize == size || (eol_flag && memchr(ptr, '\n', dsize)))
-        return dsize;
-
-    readbuf = pg_malloc(size);
+        /*
+         * If we already have some decompressed data, return that.
+         */
+        if (state->outbufnext < state->outbufdata)
+        {
+            char       *outptr = state->outbuf + state->outbufnext;
+            size_t        readlen = state->outbufdata - state->outbufnext;
+            bool        eol_found = false;
+
+            if (readlen > remaining)
+                readlen = remaining;
+            /* If eol_flag is set, don't read beyond a newline */
+            if (eol_flag)
+            {
+                char       *eolptr = memchr(outptr, '\n', readlen);

-    do
-    {
-        char       *rp;
-        char       *rend;
+                if (eolptr)
+                {
+                    readlen = eolptr - outptr + 1;
+                    eol_found = true;
+                }
+            }
+            memcpy(ptr, outptr, readlen);
+            ptr = ((char *) ptr) + readlen;
+            state->outbufnext += readlen;
+            dsize += readlen;
+            remaining -= readlen;
+            if (eol_found || remaining == 0)
+                break;
+            /* We must have emptied outbuf */
+            Assert(state->outbufnext >= state->outbufdata);
+        }

-        rsize = fread(readbuf, 1, size, state->fp);
-        if (rsize < size && !feof(state->fp))
+        /*
+         * If we don't have any pending compressed data, load more into
+         * state->buffer.
+         */
+        if (state->bufnext >= state->bufdata)
         {
-            pg_log_error("could not read from input file: %m");
-            return -1;
-        }
+            size_t        rsize;

-        rp = (char *) readbuf;
-        rend = (char *) readbuf + rsize;
+            rsize = fread(state->buffer, 1, state->buflen, state->fp);
+            if (rsize < state->buflen && !feof(state->fp))
+            {
+                pg_log_error("could not read from input file: %m");
+                return -1;
+            }
+            if (rsize == 0)
+                break;            /* must be EOF */
+            state->bufdata = rsize;
+            state->bufnext = 0;
+        }

-        while (rp < rend)
+        /*
+         * Decompress some data into state->outbuf.
+         */
         {
             size_t        status;
-            size_t        outlen = state->buflen;
-            size_t        read_remain = rend - rp;
-
-            memset(state->buffer, 0, outlen);
-            status = LZ4F_decompress(state->dtx, state->buffer, &outlen,
-                                     rp, &read_remain, NULL);
+            size_t        outlen = state->outbuflen;
+            size_t        inlen = state->bufdata - state->bufnext;
+
+            status = LZ4F_decompress(state->dtx,
+                                     state->outbuf, &outlen,
+                                     state->buffer + state->bufnext,
+                                     &inlen,
+                                     NULL);
             if (LZ4F_isError(status))
             {
                 state->errcode = status;
@@ -517,54 +496,11 @@ LZ4Stream_read_internal(LZ4State *state, void *ptr, int ptrsize, bool eol_flag)
                              LZ4F_getErrorName(state->errcode));
                 return -1;
             }
-
-            rp += read_remain;
-
-            /*
-             * fill in what space is available in ptr if the eol flag is set,
-             * either skip if one already found or fill up to EOL if present
-             * in the outbuf
-             */
-            if (outlen > 0 && dsize < size && eol_found == false)
-            {
-                char       *p;
-                size_t        lib = (!eol_flag) ? size - dsize : size - 1 - dsize;
-                size_t        len = outlen < lib ? outlen : lib;
-
-                if (eol_flag &&
-                    (p = memchr(state->buffer, '\n', outlen)) &&
-                    (size_t) (p - state->buffer + 1) <= len)
-                {
-                    len = p - state->buffer + 1;
-                    eol_found = true;
-                }
-
-                memcpy((char *) ptr + dsize, state->buffer, len);
-                dsize += len;
-
-                /* move what did not fit, if any, at the beginning of the buf */
-                if (len < outlen)
-                    memmove(state->buffer, state->buffer + len, outlen - len);
-                outlen -= len;
-            }
-
-            /* if there is available output, save it */
-            if (outlen > 0)
-            {
-                while (state->overflowlen + outlen > state->overflowalloclen)
-                {
-                    state->overflowalloclen *= 2;
-                    state->overflowbuf = pg_realloc(state->overflowbuf,
-                                                    state->overflowalloclen);
-                }
-
-                memcpy(state->overflowbuf + state->overflowlen, state->buffer, outlen);
-                state->overflowlen += outlen;
-            }
+            state->bufnext += inlen;
+            state->outbufdata = outlen;
+            state->outbufnext = 0;
         }
-    } while (rsize == size && dsize < size && eol_found == false);
-
-    pg_free(readbuf);
+    }

     return dsize;
 }
@@ -579,7 +515,7 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     size_t        remaining = size;

     /* Lazy init */
-    if (!LZ4Stream_init(state, size, true))
+    if (!LZ4Stream_init(state, true))
         pg_fatal("unable to initialize LZ4 library: %s",
                  LZ4F_getErrorName(state->errcode));

@@ -742,7 +678,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
             if (LZ4F_isError(status))
                 pg_log_error("could not end decompression: %s",
                              LZ4F_getErrorName(status));
-            pg_free(state->overflowbuf);
+            pg_free(state->outbuf);
         }

         pg_free(state->buffer);
--
2.43.7

From 6e65ac1ff26071024cfcf24d15819aa6b72af137 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Oct 2025 21:16:12 -0400
Subject: [PATCH v3 4/4] Fix issues with reading zero bytes in Gzip_read and
 Zstd_read.

pg_dump expects a read request of zero bytes to be a no-op
(see for example ReadStr()).  Gzip_read got this wrong and
falsely supposed that the resulting gzret == 0 indicated
an error.  Zstd_read got the right result, but only after
doing a lot more work than necessary, because it checked at
the bottom of the loop not the top.

The Gzip_read fix perhaps should be back-patched, because
it breaks the nominally-supported case of manually gzip'ing
the toc.dat file within a directory-style dump.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_gzip.c | 4 ++++
 src/bin/pg_dump/compress_zstd.c | 5 +----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index 4a067e1402c..ad3b6486027 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -257,6 +257,10 @@ Gzip_read(void *ptr, size_t size, CompressFileHandle *CFH)
     gzFile        gzfp = (gzFile) CFH->private_data;
     int            gzret;

+    /* Reading zero bytes must be a no-op */
+    if (size == 0)
+        return 0;
+
     gzret = gzread(gzfp, ptr, size);

     /*
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index 5fe2279faae..36c1fd264ee 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -287,7 +287,7 @@ Zstd_read_internal(void *ptr, size_t size, CompressFileHandle *CFH, bool exit_on
     output->dst = ptr;
     output->pos = 0;

-    for (;;)
+    while (output->pos < output->size)
     {
         Assert(input->pos <= input->size);
         Assert(input->size <= input_allocated_size);
@@ -341,9 +341,6 @@ Zstd_read_internal(void *ptr, size_t size, CompressFileHandle *CFH, bool exit_on
             if (res == 0)
                 break;            /* End of frame */
         }
-
-        if (output->pos == output->size)
-            break;                /* We read all the data that fits */
     }

     return output->pos;
--
2.43.7

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Chao Li

Дата:

13 октября 2025 г., 06:33:06

On Oct 12, 2025, at 09:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

While playing around with the test cases for pg_dump compression,
I was startled to discover that the performance of compress_lz4's
"stream API" code is absolutely abysmal. Here is a simple test
case to demonstrate, using the regression database as test data:

$ pg_dump -Fd --compress=lz4 -f rlz4.dir regression
$ time pg_restore -f /dev/null rlz4.dir

real    0m0.023s
user    0m0.017s
sys     0m0.006s

So far so good, but now let's compress the toc.dat file:

$ lz4 -f -m --rm rlz4.dir/toc.dat
$ time pg_restore -f /dev/null rlz4.dir

real    0m1.335s
user    0m1.326s
sys     0m0.008s

Considering that lz4 prides itself on fast decompression speed,
that is not a sane result. Decompressing the file only requires
a couple ms on my machine:

$ time lz4cat rlz4.dir/toc.dat.lz4 >/dev/null

real    0m0.002s
user    0m0.000s
sys     0m0.002s

So on this example, pg_restore is something more than 600x slower
to read the TOC data than it ought to be.

I tested the patch on my MacBook M4 (Sequoia 15.6.1), compiled with -O2 optimization:

- For LZ4 performance improvement:

Without the fix (latest master):

```

chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=lz4 -f rlz4.dir regression

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir

pg_restore -f /dev/null rlz4.dir 0.03s user 0.03s system 11% cpu 0.463 total

chaol@ChaodeMacBook-Air tmp % lz4 -f -m --rm rlz4.dir/toc.dat

# It took 1.59s to restore from compressed data, much slower than from uncompressed data.

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir

pg_restore -f /dev/null rlz4.dir 1.59s user 0.02s system 97% cpu 1.653 total

```

With the fix:

```

chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=lz4 -f rlz4.dir regression

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir

pg_restore -f /dev/null rlz4.dir 0.02s user 0.03s system 16% cpu 0.305 total

chaol@ChaodeMacBook-Air tmp % lz4 -f -m --rm rlz4.dir/toc.dat

# Much faster !!!

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir

pg_restore -f /dev/null rlz4.dir 0.02s user 0.02s system 93% cpu 0.043 total

```

- For Gzip zero read bug:

Without the fix:

```

chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=gzip -f rgzip.dir regression

chaol@ChaodeMacBook-Air tmp % gzip rgzip.dir/toc.dat

# Failed

chaol@ChaodeMacBook-Air tmp % pg_restore -f /dev/null rgzip.dir

pg_restore: error: could not read from input file:

```

With the fix:

```

chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=gzip -f rgzip.dir regression

chaol@ChaodeMacBook-Air tmp % gzip rgzip.dir/toc.dat

# Ran successfully

chaol@ChaodeMacBook-Air tmp % pg_restore -f /dev/null rgzip.dir

```

I also reviewed the code change, basically LGTM. Just a couple of trivial comments:

1 - 0001

In WriteDataToArchiveLZ4()

```

+ required = LZ4F_compressBound(chunk, &state->prefs);

+ if (required > state->buflen - state->bufdata)

+ {

+ cs->writeF(AH, state->buffer, state->bufdata);

+ state->bufdata = 0;

+ }

```

And in EndCompressorLZ4()

```

+ required = LZ4F_compressBound(0, &state->prefs);

+ if (required > state->buflen - state->bufdata)

+ {

+ cs->writeF(AH, state->buffer, state->bufdata);

+ state->bufdata = 0;

+ }

```

These two code pieces are similar, only difference is the first parameter passed to LZ4F_compressBound().

I think we can create an inline function for it. But these are just 5 lines, I don’t have a strong option on that.

Same thing for the other code pieces in LZ4Stream_write() and LZ4Stream_close().

2 - 0003

```

* LZ4 equivalent to feof() or gzeof(). Return true iff there is no

- * decompressed output in the overflow buffer and the end of the backing file

```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

"David G. Johnston"

Дата:

13 октября 2025 г., 06:39:31

On Sunday, October 12, 2025, Chao Li <li.evan.chao@gmail.com> wrote:

2 - 0003
```
/*
* LZ4 equivalent to feof() or gzeof(). Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

“iif” is shorthand for “if and only if”. So it isn’t likely to be a typo; it only needs to be changed if it is wrong. I haven’t looked to see.

David J.

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

"David G. Johnston"

Дата:

13 октября 2025 г., 06:40:34

On Sunday, October 12, 2025, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Sunday, October 12, 2025, Chao Li <li.evan.chao@gmail.com> wrote:
2 - 0003
```
/*
* LZ4 equivalent to feof() or gzeof(). Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

“iif” is shorthand for “if and only if”. So it isn’t likely to be a typo; it only needs to be changed if it is wrong. I haven’t looked to see.

Never mind…two f’s, not two i’s …

David J.

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Chao Li

Дата:

13 октября 2025 г., 06:41:15

On Oct 13, 2025, at 11:39, David G. Johnston <david.g.johnston@gmail.com> wrote:

“iif” is shorthand for “if and only if”. So it isn’t likely to be a typo; it only needs to be changed if it is wrong. I haven’t looked to see.

Thanks for the explanation, I wasn’t aware of that. I learned.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Tom Lane

Дата:

14 октября 2025 г., 00:07:52

Chao Li <li.evan.chao@gmail.com> writes:
>> On Oct 12, 2025, at 09:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While playing around with the test cases for pg_dump compression,
>> I was startled to discover that the performance of compress_lz4's
>> "stream API" code is absolutely abysmal.

> I tested the patch on my MacBook M4 (Sequoia 15.6.1), compiled with -O2 optimization:

Thanks for looking at it!

> I also reviewed the code change, basically LGTM. Just a couple of trivial comments:

> 1 - 0001
> In WriteDataToArchiveLZ4()

> ```
> +        required = LZ4F_compressBound(chunk, &state->prefs);
> +        if (required > state->buflen - state->bufdata)
> +        {
> +            cs->writeF(AH, state->buffer, state->bufdata);
> +            state->bufdata = 0;
> +        }
> ```

> And in EndCompressorLZ4()
> ```
> +    required = LZ4F_compressBound(0, &state->prefs);
> +    if (required > state->buflen - state->bufdata)
> +    {
> +        cs->writeF(AH, state->buffer, state->bufdata);
> +        state->bufdata = 0;
> +    }
> ```

> These two code pieces are similar, only difference is the first parameter passed to LZ4F_compressBound().

> I think we can create an inline function for it. But these are just 5 lines, I don’t have a strong option on that.

Yeah, I don't think that would improve code readability noticeably.
By the time you got done writing a documentation comment for the
new subroutine, the code would probably be longer not shorter.

I've pushed the parts of that patch set that I thought were
uncontroversial.  What's left is the business about increasing
DEFAULT_IO_BUFFER_SIZE and then adjusting the tests appropriately.

So, v4-0001 attached is the previous v3-0002 to increase
DEFAULT_IO_BUFFER_SIZE, plus additions in compress_none.c to make
--compress=none also produce predictably large data blocks.
I decided that if we're going to rely on that behavior as part
of the solution for this thread's original problem, we'd better
make it happen for all compression options.

0002 adds a test case in 002_pg_dump.pl to exercise --compress=none,
because without that we don't have any coverage of the new code
0001 added in compress_none.c.  That makes for a small increase
in the runtime of 002_pg_dump.pl, but I'm inclined to think it's
worth doing.

0003 modifies the existing test cases that manually compress
blobs.toc files so that they also compress toc.dat.  I feel
like it's mostly an oversight that that wasn't done to begin
with; if it had been done, we'd have caught the Gzip_read bug
right away.  Also, AFAICT, this doesn't cost anything measurable
in test runtime.

0004 increases the row width in the existing test case that says
it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
the compressors.  While I agree with the premise, this solution
is hugely expensive: it adds about 12% to the already-long runtime
of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
energy for today.  (I think the reason this costs so much is that
it's effectively iterated hundreds of times because of
002_pg_dump.pl's more or less cross-product approach to testing
everything.  Maybe we should pull it out of that structure?)

            regards, tom lane

From 969b1ba5a94449e10e56103f18ccf4f9c5481796 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:09:36 -0400
Subject: [PATCH v4 1/4] Align the data block sizes of pg_dump's various
 compression modes.

After commit fe8192a95, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number roughly into line.

With compress = none you get data blocks the same sizes as the
table rows, which seems potentially problematic for narrow tables.
Introduce a layer of buffering to make that case match the others.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here, leaving it for a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h    |  4 +-
 src/bin/pg_dump/compress_lz4.c   |  9 ++++-
 src/bin/pg_dump/compress_none.c  | 64 +++++++++++++++++++++++++++++++-
 src/tools/pgindent/typedefs.list |  1 +
 4 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..ae008585c89 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -22,9 +22,9 @@
  *
  * When changing this value, it's necessary to check the relevant test cases
  * still exercise all the branches. This applies especially if the value is
- * increased, in which case the overflow buffer may not be needed.
+ * increased, in which case some loops may not get iterated.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index b817a083d38..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -100,9 +100,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
diff --git a/src/bin/pg_dump/compress_none.c b/src/bin/pg_dump/compress_none.c
index 4abb2e95abc..94c155a572d 100644
--- a/src/bin/pg_dump/compress_none.c
+++ b/src/bin/pg_dump/compress_none.c
@@ -22,6 +22,18 @@
  *----------------------
  */

+/*
+ * We buffer outgoing data, just to ensure that data blocks written to the
+ * archive file are of reasonable size.  The read side could use this struct,
+ * but there's no need because it does not retain data across calls.
+ */
+typedef struct NoneCompressorState
+{
+    char       *buffer;            /* buffer for unwritten data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */
+} NoneCompressorState;
+
 /*
  * Private routines
  */
@@ -49,13 +61,45 @@ static void
 WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
                        const void *data, size_t dLen)
 {
-    cs->writeF(AH, data, dLen);
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+    size_t        remaining = dLen;
+
+    while (remaining > 0)
+    {
+        size_t        chunk;
+
+        /* Dump buffer if full */
+        if (nonecs->bufdata >= nonecs->buflen)
+        {
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+            nonecs->bufdata = 0;
+        }
+        /* And fill it */
+        chunk = nonecs->buflen - nonecs->bufdata;
+        if (chunk > remaining)
+            chunk = remaining;
+        memcpy(nonecs->buffer + nonecs->bufdata, data, chunk);
+        nonecs->bufdata += chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
+    }
 }

 static void
 EndCompressorNone(ArchiveHandle *AH, CompressorState *cs)
 {
-    /* no op */
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+
+    if (nonecs)
+    {
+        /* Dump buffer if nonempty */
+        if (nonecs->bufdata > 0)
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+        /* Free working state */
+        pg_free(nonecs->buffer);
+        pg_free(nonecs);
+        cs->private_data = NULL;
+    }
 }

 /*
@@ -71,6 +115,22 @@ InitCompressorNone(CompressorState *cs,
     cs->end = EndCompressorNone;

     cs->compression_spec = compression_spec;
+
+    /*
+     * If the caller has defined a write function, prepare the necessary
+     * buffer.
+     */
+    if (cs->writeF)
+    {
+        NoneCompressorState *nonecs;
+
+        nonecs = (NoneCompressorState *) pg_malloc(sizeof(NoneCompressorState));
+        nonecs->buflen = DEFAULT_IO_BUFFER_SIZE;
+        nonecs->buffer = pg_malloc(nonecs->buflen);
+        nonecs->bufdata = 0;
+
+        cs->private_data = nonecs;
+    }
 }


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..63f9387044b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1758,6 +1758,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NoneCompressorState
 Notification
 NotificationList
 NotifyStmt
--
2.43.7

From 4407d68c4a4b95979e24fc225573944def989fc9 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:20:38 -0400
Subject: [PATCH v4 2/4] Add a test case to cover pg_dump with --compress=none.

This brings the coverage of compress_none.c up from about 64%
to 90%, in particular covering the new code added in the previous
patch.

This adds perhaps 2% to the runtime of 002_pg_dump.pl.  It could
be argued that compress_none.c is simple enough to not require
permanent memorialization in a test case; but it's less simple
than it was before, so maybe we want this.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 6be6888b977..8a08f9a5f6f 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -81,6 +81,24 @@ my %pgdump_runs = (
         ],
     },

+    compression_none_custom => {
+        test_key => 'compression',
+        dump_cmd => [
+            'pg_dump',
+            '--format' => 'custom',
+            '--compress' => 'none',
+            '--file' => "$tempdir/compression_none_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_none_custom.sql",
+            '--statistics',
+            "$tempdir/compression_none_custom.dump",
+        ],
+    },
+
     # Do not use --no-sync to give test coverage for data sync.
     compression_gzip_custom => {
         test_key => 'compression',
--
2.43.7

From 4f0c4e17ac17dd8fd36cbfc4f887fff8cfecb1f6 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:30:55 -0400
Subject: [PATCH v4 3/4] Include compression of toc.dat in manually-compressed
 test cases.

We would have found the bug fixed in commit a239c4a0c much sooner
if we'd done this.  As far as I can tell, this doesn't reduce
test coverage at all, since there are other tests of directory
format that still use an uncompressed toc.dat.  And its effect
on the total runtime of 002_pg_dump.pl seems lost in the noise.

While here, fix a glitch that I noticed in testing: the
$glob_patterns tests were incapable of failing, because glob()
will return 'foo' as 'foo' whether there is a matching file or
not.  (Indeed, the stanza just above this one relies on that.)

I'm slightly tempted to back-patch this part.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
Discussion: https://postgr.es/m/25345.1760289877@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 41 +++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8a08f9a5f6f..8b287a673bf 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -140,15 +140,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
+            args => [
+                '-f',
+                "$tempdir/compression_gzip_dir/toc.dat",
+                "$tempdir/compression_gzip_dir/blobs_*.toc",
+            ],
         },
-        # Verify that only data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_gzip_dir/toc.dat",
+            "$tempdir/compression_gzip_dir/toc.dat.gz",
             "$tempdir/compression_gzip_dir/*.dat.gz",
         ],
         restore_cmd => [
@@ -219,18 +222,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'LZ4'},
             args => [
                 '-z', '-f', '-m', '--rm',
+                "$tempdir/compression_lz4_dir/toc.dat",
                 "$tempdir/compression_lz4_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_lz4_dir/toc.dat",
+            "$tempdir/compression_lz4_dir/toc.dat.lz4",
             "$tempdir/compression_lz4_dir/*.dat.lz4",
         ],
         restore_cmd => [
@@ -303,18 +306,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'ZSTD'},
             args => [
-                '-z', '-f',
-                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
+                '-z', '-f', '--rm',
+                "$tempdir/compression_zstd_dir/toc.dat",
+                "$tempdir/compression_zstd_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_zstd_dir/toc.dat",
+            "$tempdir/compression_zstd_dir/toc.dat.zst",
             "$tempdir/compression_zstd_dir/*.dat.zst",
         ],
         restore_cmd => [
@@ -5523,8 +5526,12 @@ foreach my $run (sort keys %pgdump_runs)
         foreach my $glob_pattern (@{$glob_patterns})
         {
             my @glob_output = glob($glob_pattern);
-            is(scalar(@glob_output) > 0,
-                1, "$run: glob check for $glob_pattern");
+            my $ok = 0;
+            # certainly found some files if glob() returned multiple matches
+            $ok = 1 if (scalar(@glob_output) > 1);
+            # if just one match, we need to check if it's real
+            $ok = 1 if (scalar(@glob_output) == 1 && -f $glob_output[0]);
+            is($ok, 1, "$run: glob check for $glob_pattern");
         }
     }

--
2.43.7

From 163956c37940aeee1bf4c48fb62ae22612ec89be Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:43:06 -0400
Subject: [PATCH v4 4/4] Widen the wide row used to verify correct (de)
 compression.

Commit 1a05c1d25 advises us (not without reason) to ensure that this
test case fully fills DEFAULT_IO_BUFFER_SIZE, so that loops within
the compression logic will iterate completely.  To follow that
advice with the proposed DEFAULT_IO_BUFFER_SIZE of 128K, we need
something close to this.  This does indeed increase the reported
code coverage by a few lines.

This is a very expensive test, however: it increases the total
runtime of 002_pg_dump.pl by about 12% for me, and that's already
one of the more expensive TAP tests.  I wonder if there isn't a
less brute-force way of getting the same result.  If there's not,
is this really worth it?

One idea is to take this out of the main structure of 002_pg_dump.pl
so that the test instance is run just once per compression type,
rather than many many times as it is here.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8b287a673bf..8bbd6ed47a9 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -3732,14 +3732,15 @@ my %tests = (
     },

     # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
-    # (de)compression operations
+    # (de)compression operations.  The weird regex is because Perl
+    # restricts us to repeat counts of less than 32K.
     'COPY test_compression_method' => {
         create_order => 111,
         create_sql => 'INSERT INTO dump_test.test_compression_method (col1) '
-          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
+          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,65536) a;',
         regexp => qr/^
             \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E
-            \n(?:\d{15277}\n){1}\\\.\n
+            \n(?:(?:\d\d\d\d\d\d\d\d\d\d){31657}\d\d\d\d\n){1}\\\.\n
             /xm,
         like => {
             %full_runs,
--
2.43.7

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От

Chao Li

Дата:

14 октября 2025 г., 03:09:56

On Oct 14, 2025, at 05:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've pushed the parts of that patch set that I thought were
uncontroversial. What's left is the business about increasing
DEFAULT_IO_BUFFER_SIZE and then adjusting the tests appropriately.

So, v4-0001 attached is the previous v3-0002 to increase
DEFAULT_IO_BUFFER_SIZE, plus additions in compress_none.c to make
--compress=none also produce predictably large data blocks.
I decided that if we're going to rely on that behavior as part
of the solution for this thread's original problem, we'd better
make it happen for all compression options.

0002 adds a test case in 002_pg_dump.pl to exercise --compress=none,
because without that we don't have any coverage of the new code
0001 added in compress_none.c. That makes for a small increase
in the runtime of 002_pg_dump.pl, but I'm inclined to think it's
worth doing.

0003 modifies the existing test cases that manually compress
blobs.toc files so that they also compress toc.dat. I feel
like it's mostly an oversight that that wasn't done to begin
with; if it had been done, we'd have caught the Gzip_read bug
right away. Also, AFAICT, this doesn't cost anything measurable
in test runtime.

0004 increases the row width in the existing test case that says
it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
the compressors. While I agree with the premise, this solution
is hugely expensive: it adds about 12% to the already-long runtime
of 002_pg_dump.pl. I'd like to find a better way, but ran out of
energy for today. (I think the reason this costs so much is that
it's effectively iterated hundreds of times because of
002_pg_dump.pl's more or less cross-product approach to testing
everything. Maybe we should pull it out of that structure?)

In v4 patch, the code changes are straightforward. 0001 changes compress_none.c to write data to a 128K buffer first, then only flush the buffer when it’s filled up. 0002, 0003 and 0004 add more test cases. I have no comment to the code diff.

I tested DEFAULT_IO_BUFFER_SIZE with 4K, 32K, 64K, 128K and 256K. Looks like increasing the buffer size doesn’t improve the performance significantly. Actually, with the buffer size 64K, 128K and 256K, the test results are very close. I tested both with lz4 and none compression. I am not suggesting tuning the buffer size. These data are only for your reference.

To do the test, I created a test db and filled in several GB of data.

```

256K ====

% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest

pg_dump -Fd --compress=lz4 -f dump_A.dir evantest 3.37s user 0.82s system 57% cpu 7.249 total