Обсуждение: [PATCH] Better Performance for PostgreSQL with large INSERTs
Sometimes, storing documents (eg. PDFs) in a database is much easier than using a separate storage (like S3, NFS, etc.). (Because of issues like backup integrity, availability, service dependencies, access rights, encryption of data, etc..) With this patch: ```diff diff --git i/src/backend/libpq/pqcomm.c w/src/backend/libpq/pqcomm.c index e517146..936b073 100644 --- i/src/backend/libpq/pqcomm.c +++ w/src/backend/libpq/pqcomm.c @@ -117,7 +117,8 @@ static List *sock_paths = NIL; */ #define PQ_SEND_BUFFER_SIZE 8192 -#define PQ_RECV_BUFFER_SIZE 8192 +#define PQ_RECV_BUFFER_SIZE 2097152 + static char *PqSendBuffer; static int PqSendBufferSize; /* Size send buffer */ ``` ie. changing the network receive buffer size from 8KB to 2MB, got 7% better INSERT performance when storing BLOBs. The 2MB value is just what we tried, 128kB or 256kB works as well. The main point is to reduce the number of syscalls for receiving data to about half of what it is with 8KB. Thank you for your consideration! Regards, Phil
Hi! On Mon, 15 Sept 2025 at 18:16, Philipp Marek <philipp@marek.priv.at> wrote: > > Sometimes, storing documents (eg. PDFs) in a database > is much easier than using a separate storage (like S3, NFS, etc.). > > (Because of issues like backup integrity, availability, > service dependencies, access rights, encryption of data, etc..) > > > With this patch: > > ```diff > diff --git i/src/backend/libpq/pqcomm.c w/src/backend/libpq/pqcomm.c > index e517146..936b073 100644 > --- i/src/backend/libpq/pqcomm.c > +++ w/src/backend/libpq/pqcomm.c > @@ -117,7 +117,8 @@ static List *sock_paths = NIL; > */ > > #define PQ_SEND_BUFFER_SIZE 8192 > -#define PQ_RECV_BUFFER_SIZE 8192 > +#define PQ_RECV_BUFFER_SIZE 2097152 > + Changing this constant will result in an overwhelming increase of memory consumption for instances that work with a large number of connections ( max_connections ~ 1e4) for zero benefit. -- Best regards, Kirill Reshke
Hi Kirill,
> Changing this constant will result in an overwhelming increase of
> memory consumption for instances that work with a large number of
> connections (
> max_connections ~ 1e4) for zero benefit.
No, it's not that bad.
1) With this being allocated in the bss segment,
the RAM will only be actually be provided on _first use_ --
so processes that only ever work with small queries will see no
difference.
2) With temp_buffers' default of 8MB, work_mem using 4MB, etc.,
using an additional 256kB RAM to 2MB _when needed_ is a good deal
for about 7% performance improvements.
Here's the patch again, this time with a 128kB buffer size. This gives us nearly the same gains (~7%) for the blob INSERTs, and the additional memory usage (120kB) shouldn't really matter, with "temp_buffer"s 8MB and "work_mem" 4MB defaults. Making it configurable would give a much more complex patch -- so I suggest just using this fixed size. Thanks! diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 25f739a6..72d092b2 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -117,7 +117,7 @@ static List *sock_paths = NIL; */ #define PQ_SEND_BUFFER_SIZE 8192 -#define PQ_RECV_BUFFER_SIZE 8192 +#define PQ_RECV_BUFFER_SIZE (128 * 1024) static char *PqSendBuffer; static int PqSendBufferSize; /* Size send buffer */
Hi,
7% is an interesting improvement for these cases. I see it as a nice feature,
What about adding a GUC variable to have the ability to configure PQ_RECV_BUFFER_SIZE based on the user's needs?
-Filip-
út 30. 9. 2025 v 10:42 odesílatel Philipp Marek <philipp@marek.priv.at> napsal:
Here's the patch again, this time with a 128kB buffer size.
This gives us nearly the same gains (~7%) for the blob INSERTs,
and the additional memory usage (120kB) shouldn't really matter,
with "temp_buffer"s 8MB and "work_mem" 4MB defaults.
Making it configurable would give a much more complex patch --
so I suggest just using this fixed size.
Thanks!
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 25f739a6..72d092b2 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -117,7 +117,7 @@ static List *sock_paths = NIL;
*/
#define PQ_SEND_BUFFER_SIZE 8192
-#define PQ_RECV_BUFFER_SIZE 8192
+#define PQ_RECV_BUFFER_SIZE (128 * 1024)
static char *PqSendBuffer;
static int PqSendBufferSize; /* Size send buffer */
Hi,
On 2025-09-30 10:42:00 +0200, Philipp Marek wrote:
> Here's the patch again, this time with a 128kB buffer size.
>
> This gives us nearly the same gains (~7%) for the blob INSERTs,
> and the additional memory usage (120kB) shouldn't really matter,
> with "temp_buffer"s 8MB and "work_mem" 4MB defaults.
>
> Making it configurable would give a much more complex patch --
> so I suggest just using this fixed size.
Have you tried to verify that this doesn't cause performance regressions in
other workloads? pq_recvbuf() has this code:
if (PqRecvPointer > 0)
{
if (PqRecvLength > PqRecvPointer)
{
/* still some unread data, left-justify it in the buffer */
memmove(PqRecvBuffer, PqRecvBuffer + PqRecvPointer,
PqRecvLength - PqRecvPointer);
PqRecvLength -= PqRecvPointer;
PqRecvPointer = 0;
}
else
PqRecvLength = PqRecvPointer = 0;
}
I do seem to recall that just increasing the buffer size substantially lead to
more time being spent inside that memmove() (likely due to exceeding L1/L2).
Greetings,
Andres Freund
Hi Filip, > 7% is an interesting improvement for these cases. I see it as a nice > feature, > What about adding a GUC variable to have the ability to configure > PQ_RECV_BUFFER_SIZE based on the user's needs? Well, currently this is just the array size as a preprocessor macro, so the executables .bss gets sized differently. For a GUC quite some more code, including dynamical allocation _and reallocation_ would be needed -- and this complexity I wanted to avoid for the first discussion. If a GUC fits the usage patterns better, I'm okay with it!
Hi Andres, > Have you tried to verify that this doesn't cause performance > regressions in > other workloads? pq_recvbuf() has this code: > ... > > I do seem to recall that just increasing the buffer size substantially > lead to > more time being spent inside that memmove() (likely due to exceeding > L1/L2). Do you have any pointers to discussions or other data about that? My (quick) analysis was that clients that send one request, wait for an answer, then send the next request wouldn't run that code as there's nothing behind the individual requests that could be moved. But yes, Pipeline Mode[1] might/would be affected. The interesting question is how much data can userspace copy before that means more load than doing a userspace-kernel-userspace round trip. (I guess that moving 64kB or 128kB should be quicker, especially since the various CPU mitigations.) As long as there are complete requests in the buffer the memmove() could be avoided; only the initial part of the first incomplete request might need moving to the beginning. That patch would be more than +- 1 line again ;) The documentation says > Pipelining is less useful, and more complex, > when a single pipeline contains multiple transactions > (see Section 32.5.1.3). are there any benchmarks/usage statistics for pipeline mode? Regards, Phil Ad 1: https://www.postgresql.org/docs/current/libpq-pipeline-mode.html
Hi, On 2025-10-07 15:03:29 +0200, Philipp Marek wrote: > > Have you tried to verify that this doesn't cause performance regressions > > in > > other workloads? pq_recvbuf() has this code: > > > ... > > > > I do seem to recall that just increasing the buffer size substantially > > lead to > > more time being spent inside that memmove() (likely due to exceeding > > L1/L2). > > > Do you have any pointers to discussions or other data about that? > > > My (quick) analysis was that clients that send one request, > wait for an answer, then send the next request wouldn't run that code > as there's nothing behind the individual requests that could be moved. > > > But yes, Pipeline Mode[1] might/would be affected. > > The interesting question is how much data can userspace copy before > that means more load than doing a userspace-kernel-userspace round trip. > (I guess that moving 64kB or 128kB should be quicker, especially since > the various CPU mitigations.) I unfortunately don't remember the details of where I saw it happening. Unfortunately I suspect it'll depend a lot on hardware and operating system details (like the security mitigations you mention) when it matters too. > As long as there are complete requests in the buffer the memmove() > could be avoided; only the initial part of the first incomplete request > might need moving to the beginning. Right. I'd be inclined that that ought to be addressed as part of this patch, that way we can be sure that it's pretty sure it's not going to cause regressions. > The documentation says > > > Pipelining is less useful, and more complex, > > when a single pipeline contains multiple transactions > > (see Section 32.5.1.3). > > are there any benchmarks/usage statistics for pipeline mode? You can write benchmarks for it using pgbench's pipeline support, with a custom script. Greetings, Andres Freund
-Filip-
út 7. 10. 2025 v 16:54 odesílatel Andres Freund <andres@anarazel.de> napsal:
Hi,
On 2025-10-07 15:03:29 +0200, Philipp Marek wrote:
> > Have you tried to verify that this doesn't cause performance regressions
> > in
> > other workloads? pq_recvbuf() has this code:
> >
> ...
> >
> > I do seem to recall that just increasing the buffer size substantially
> > lead to
> > more time being spent inside that memmove() (likely due to exceeding
> > L1/L2).
>
>
> Do you have any pointers to discussions or other data about that?
>
>
> My (quick) analysis was that clients that send one request,
> wait for an answer, then send the next request wouldn't run that code
> as there's nothing behind the individual requests that could be moved.
>
>
> But yes, Pipeline Mode[1] might/would be affected.
>
> The interesting question is how much data can userspace copy before
> that means more load than doing a userspace-kernel-userspace round trip.
> (I guess that moving 64kB or 128kB should be quicker, especially since
> the various CPU mitigations.)
I unfortunately don't remember the details of where I saw it
happening. Unfortunately I suspect it'll depend a lot on hardware and
operating system details (like the security mitigations you mention) when it
matters too.
> As long as there are complete requests in the buffer the memmove()
> could be avoided; only the initial part of the first incomplete request
> might need moving to the beginning.
Right. I'd be inclined that that ought to be addressed as part of this patch,
that way we can be sure that it's pretty sure it's not going to cause
regressions.
memmove(), but I wasn’t able to hit the memmove() part of the code. This led me to a deeper investigation, and I realized that the memmove() call is probably in a dead part of the code.pq_recvbuf is called when PqRecvPointer >= PqRecvLength, while memmove() is called later only if PqRecvLength > PqRecvPointer.This results in a contradiction.
> The documentation says
>
> > Pipelining is less useful, and more complex,
> > when a single pipeline contains multiple transactions
> > (see Section 32.5.1.3).
>
> are there any benchmarks/usage statistics for pipeline mode?
You can write benchmarks for it using pgbench's pipeline support, with a
custom script.
Greetings,
Andres Freund
I am also proposing the introduction of a new GUC variable for setting PQ_RECV_BUFFER_SIZE in the first patch. And the second patch removes the dead code.
Filip
Вложения
Hi Filip, > I am also proposing the introduction of a new GUC > variable for setting PQ_RECV_BUFFER_SIZE > in the first patch. thanks a lot; this allows configuration for all connections to a database, and so should fit our needs while not changing any behaviour in the default case. > And the second patch removes the dead code. Also always a good idea ;) Thank you! Regards, Phil
On Wed, Nov 26, 2025 at 3:03 PM Filip Janus <fjanus@redhat.com> wrote:
>
>
>
> -Filip-
>
>
> út 7. 10. 2025 v 16:54 odesílatel Andres Freund <andres@anarazel.de> napsal:
>>
>> Hi,
>>
>> On 2025-10-07 15:03:29 +0200, Philipp Marek wrote:
>> > > Have you tried to verify that this doesn't cause performance regressions
>> > > in
>> > > other workloads? pq_recvbuf() has this code:
>> > >
>> > ...
>> > >
>> > > I do seem to recall that just increasing the buffer size substantially
>> > > lead to
>> > > more time being spent inside that memmove() (likely due to exceeding
>> > > L1/L2).
>> >
>> >
>> > Do you have any pointers to discussions or other data about that?
>> >
>> >
>> > My (quick) analysis was that clients that send one request,
>> > wait for an answer, then send the next request wouldn't run that code
>> > as there's nothing behind the individual requests that could be moved.
>> >
>> >
>> > But yes, Pipeline Mode[1] might/would be affected.
>> >
>> > The interesting question is how much data can userspace copy before
>> > that means more load than doing a userspace-kernel-userspace round trip.
>> > (I guess that moving 64kB or 128kB should be quicker, especially since
>> > the various CPU mitigations.)
>>
>> I unfortunately don't remember the details of where I saw it
>> happening. Unfortunately I suspect it'll depend a lot on hardware and
>> operating system details (like the security mitigations you mention) when it
>> matters too.
>>
>>
>> > As long as there are complete requests in the buffer the memmove()
>> > could be avoided; only the initial part of the first incomplete request
>> > might need moving to the beginning.
>>
>> Right. I'd be inclined that that ought to be addressed as part of this patch,
>> that way we can be sure that it's pretty sure it's not going to cause
>> regressions.
>
>
> I tried to benchmark the usage of memmove(), but I wasn’t able to hit the memmove() part of the code. This led me to
adeeper investigation, and I realized that the memmove() call is probably in a dead part of the code.
> pq_recvbuf is called when PqRecvPointer >= PqRecvLength, while memmove() is called later only if PqRecvLength >
PqRecvPointer.
> This results in a contradiction.
>
>>
>> > The documentation says
>> >
>> > > Pipelining is less useful, and more complex,
>> > > when a single pipeline contains multiple transactions
>> > > (see Section 32.5.1.3).
>> >
>> > are there any benchmarks/usage statistics for pipeline mode?
>>
>> You can write benchmarks for it using pgbench's pipeline support, with a
>> custom script.
>>
>> Greetings,
>>
>> Andres Freund
>>
> I am also proposing the introduction of a new GUC variable for setting PQ_RECV_BUFFER_SIZE in the first patch. And
thesecond patch removes the dead code.
>
Hi Filip,
Can you please how have you verified it is giving you that some perf. increase?
3 tries each, best:
@ pq_recv_buffers = 2MB best of 3:
latency average = 2.594 ms
latency stddev = 0.352 ms
initial connection time = 9.419 ms
tps = 385.431723 (without initial connection time)
@ pq_recv_buffers = default (8kB) best of 3:
latency average = 2.629 ms
latency stddev = 0.929 ms
initial connection time = 9.937 ms
tps = 380.336257 (without initial connection time)
/usr/pgsql19/bin/pgbench -h xxx -U app -f insert.sql -c 1 -P 1 -n -T
5 -M prepared postgres
where insert.sql was:
echo "CREATE TEMPORARY TABLE IF NOT EXISTS file_storage (data
BYTEA STORAGE EXTERNAL) ON COMMIT PRESERVE ROWS;" > insert.sql
echo "INSERT INTO file_storage(data) VALUES ('" >> insert.sql
perl -e 'print "A"x(1024*1024);' >> insert.sql # 1MB
echo "');" >> insert.sql
Some description of the env I had:
- tuned TCP rmem/wmem and set congestion to BBR (to eliminate TCP as
being a bottlneck)
- low RTT (same AWS zone), max 25Gbps total, max 9.6 Gbps single stream TCP
- as TOAST pglz compression way taking most of CPU in my case , I've
changed it to lz4 also didn't help a lot, so I've changed it to avoid
*any* compression
- switched to temporary table to avoid I/O as much as possible,
wal_level=minimal too
- had to use prepared statements as otherwise I was hitting way too
much CPU in parser (yylex routines)
So I'm looking for a way to demonstrate the effect. I've also written
a simple psypong2 based LO upload benchmark as pgbench cannot
apparently benchmark this. Sadly of course, then you cannot I think
disable compression and/or load into TEMPORARY table so it's far worse
and hits I/O heavy (as it hit pg_largeobjects*)
-J.
Hi Jakub,
> Can you please how have you verified it is giving you that some perf.
> increase?
...
> where insert.sql was:
> echo "CREATE TEMPORARY TABLE IF NOT EXISTS file_storage (data
> BYTEA STORAGE EXTERNAL) ON COMMIT PRESERVE ROWS;" > insert.sql
> echo "INSERT INTO file_storage(data) VALUES ('" >> insert.sql
> perl -e 'print "A"x(1024*1024);' >> insert.sql # 1MB
> echo "');" >> insert.sql
This data might be too easily compressed.
In production we upload files (PDFs), for benchmarking
we generated random data.
This might make a big difference on the network side
if you're using TLS with compression, for example.
> Some description of the env I had:
> - tuned TCP rmem/wmem and set congestion to BBR (to eliminate TCP as
> being a bottlneck)
Well, if your tcp_rmem buffers are bigger than the data you upload,
that might also soften the impact of the patch.
> - low RTT (same AWS zone), max 25Gbps total, max 9.6 Gbps single stream
> TCP
That might also be a big difference,
we had a few msec inbetween client and server.
> - as TOAST pglz compression way taking most of CPU in my case , I've
> changed it to lz4 also didn't help a lot, so I've changed it to avoid
> *any* compression
We had no compression set on the table
> - switched to temporary table to avoid I/O as much as possible,
> wal_level=minimal too
We didn't use a temporary table.
If you think that's helpful,
I can try to dig out what we used to benchmark
the receive buffer size impact.
Regards,
Phil