Обсуждение: Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

Поиск
Список
Период
Сортировка

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
> I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
> database. This binding uses the asynchronous API of libpq by default to
> facilitate the ruby IO wait and scheduling mechanisms.
>
> This works well with the vanilla postgresql server, but it leads to
> starvation with other types of servers using the postgresql wire
> protocol 3. This is because the current functioning of the libpq async
> interface depends on a maximum size of SSL records of 8kB.

Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.

I was able to reproduce a hang when all of the following were true:
- psycopg2's async mode was enabled
- the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
- the server splits a large message over many large TLS records
- the server packs the final ReadyForQuery message into the same
record as the split message's final fragment

Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
  1) DataRow 1 fragment 1 (12k bytes)
  2) DataRow 1 fragment 2 (12k bytes)
  3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
       + CommandComplete + ReadyForQuery

> To fix this issue the attached patch calls pqReadData() repeatedly in
> PQconsumeInput() until there is no buffered SSL data left to be read.
> Another solution could be to process buffered SSL read bytes in
> PQisBusy() instead of PQconsumeInput() .

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

    https://commitfest.postgresql.org/50/

Thanks,
--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Lars Kanis
Дата:
Thank you Jacob for verifying this issue!

> Gory details of the packet sizes, if it's helpful:
> - max TLS record size is 12k, because it made the math easier
> - server sends DataRow of 32006 bytes, followed by DataRow of 806
> bytes, followed by CommandComplete/ReadyForQuery
> - so there are three TLS records on the wire containing
>    1) DataRow 1 fragment 1 (12k bytes)
>    2) DataRow 1 fragment 2 (12k bytes)
>    3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
>         + CommandComplete + ReadyForQuery

How did you verify the issue on the server side - with YugabyteDB or 
with a modified Postgres server? I'd like to verify the GSSAPI part and 
I'm familiar with the Postgres server only.


> I agree that PQconsumeInput() needs to ensure that the transport
> buffers are all drained. But I'm not sure this is a complete solution;
> doesn't GSS have the same problem? And are there any other sites that
> need to make the same guarantee before returning?

Which other sites do you mean? The synchronous transfer already works, 
since the select() is short-circuit in case of pending bytes: [1]


> I need to switch away from this for a bit. Would you mind adding this
> to the next Commitfest as a placeholder?

No problem; registered: https://commitfest.postgresql.org/50/5251/

--

Regards, Lars

[1] 

https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070




Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
> How did you verify the issue on the server side - with YugabyteDB or
> with a modified Postgres server? I'd like to verify the GSSAPI part and
> I'm familiar with the Postgres server only.

Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.

> > And are there any other sites that
> > need to make the same guarantee before returning?
>
> Which other sites do you mean?

I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.

> > I need to switch away from this for a bit. Would you mind adding this
> > to the next Commitfest as a placeholder?
>
> No problem; registered: https://commitfest.postgresql.org/50/5251/

Thank you!

--Jacob



On Thu, 12 Sept 2024 at 04:30, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
>
> On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote:
> > How did you verify the issue on the server side - with YugabyteDB or
> > with a modified Postgres server? I'd like to verify the GSSAPI part and
> > I'm familiar with the Postgres server only.
>
> Neither, unfortunately -- I have a protocol testbed that I use for
> this kind of stuff. I'm happy to share once I get it cleaned up, but
> it's unlikely to help you in this case since I haven't implemented
> gssenc support. Patching the Postgres server itself seems like a good
> way to go.
>
> > > And are there any other sites that
> > > need to make the same guarantee before returning?
> >
> > Which other sites do you mean?
>
> I'm mostly worried that other parts of libpq might assume that a
> single call to pqReadData will drain the buffers. If not, great! --
> but I haven't had time to check all the call sites.

@Jacob, could you find some time to wrap this up? This will help us
assess whether it can be refined into a committable state soon.

Regards,
Vignesh



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Sun, Mar 16, 2025 at 6:14 AM vignesh C <vignesh21@gmail.com> wrote:
> @Jacob, could you find some time to wrap this up? This will help us
> assess whether it can be refined into a committable state soon.

This is a bug report that likely requires backpatching; feature freeze
should not be an issue here.

Thanks,
--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Tue, Sep 10, 2024 at 11:49 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> I need to switch away from this for a bit.

"a bit"

In an effort to get this unstuck (and ideally solved during this
commitfest) here are my most recent thoughts:

> I agree that PQconsumeInput() needs to ensure that the transport
> buffers are all drained. But I'm not sure this is a complete solution;
> doesn't GSS have the same problem? And are there any other sites that
> need to make the same guarantee before returning?

So, Postgres internals cheat: calls to
pqWait/pqWaitTimed/pqSocketCheck() all reach down into the SSL buffer
to see if there are any pending bytes before polling the socket. I do
not yet understand why this protection is not extended to
GSS-encrypted connections.

In any case, our clients can't make use of that protection either, so
it's unfortunate that we're relying on it ourselves, because it makes
reasoning about a fix more difficult. I'm not about to propose
removing that case in a backport, but as long as it's there covering
up the problem, it's hard to say whether we've truly fixed this bug.

I think the hang could happen in any situation where a third-party
client calls into a function which calls pqReadData(), and then
immediately polls the socket for a read-ready condition. So: what are
the APIs that could be called that way?

Clearly PQconsumeInput() is one. PQconnectPoll() seems like it should
be exposed as well. postgres_fdw's use of PQgetResult() might qualify
(or if it doesn't, I don't understand why not -- there is at least one
path in PQgetResult() where the buffer is not fully consumed). And
nonblocking mode seems like it might sprinkle this hazard into more
places, as pqSendSome() will then start reading data and returning
before a pqWait().

So, to fix this once and for all, do we have to make sure that
pqReadData() always fully drains the SSL/GSS transport buffer instead
of eagerly returning?

--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Tue, Jul 1, 2025 at 1:42 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> I do
> not yet understand why this protection is not extended to
> GSS-encrypted connections.

After repurposing some of my test code for d98cefe11, I'm able to
reproduce the hang with gssencmode when the server uses a
smaller-than-standard (12k) send buffer. Same reproduction case as the
original (32006-byte DataRow, 806-byte DataRow, CommandComplete,
ReadyForQuery).

So a complete patch for this has to consider GSS as well.

> So, to fix this once and for all, do we have to make sure that
> pqReadData() always fully drains the SSL/GSS transport buffer instead
> of eagerly returning?

I'll work on proving that code paths other than PQconsumeInput() are
affected. If they are, I'll start a patch for pqReadData().

--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 2, 2025 at 4:12 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> I'll work on proving that code paths other than PQconsumeInput() are
> affected. If they are, I'll start a patch for pqReadData().

Here's one way for a server implementation to hang the client during
PQconnectPoll(): accept the SSLRequest and immediately send a protocol
2.0 error message of (16k + 1) bytes, but split the message across two
TLS records of 8k and (8k + 1) bytes each, and wait for the client to
close the connection.

This case works fine with synchronous connect, but it fails with
async. libpq will fill the first 8k of its 16k input buffer with the
first record. It fills the second half with all but one byte of the
second record, and since it doesn't find a null terminator anywhere in
the buffer, it tells the client to poll on the socket again. Since the
server in this case is waiting for the client to close first, no
additional socket activity is coming, and the connection deadlocks
with one byte pending in the SSL buffer.

I use 2.0 as an example; libpq avoids getting stuck in 3.0 by
enlarging the input buffer to the size of the incoming message. But
that's supposed to be a performance optimization, not anti-deadlock
protection. Even if we were to enshrine the buffer sizes we use in the
protocol specification itself [1], someone's still likely to trip over
this hazard if they start adjusting those performance optimizations.
(Even our use of a 16k initial buffer is itself just an optimization.)

So I think pqReadData() needs to make the same guarantees for SSL/GSS
that it does for plain TCP: when it returns, either 1) there are no
bytes left pending or 2) the socket itself is marked readable.
Otherwise I think we'll continue to chase weird corner cases.

What I'm not sure about is unforeseen consequences -- could fixing
this expose other latent bugs? If we're concerned about that, we might
have to tier the backports to make sure everything remains stable.
Here are the three bugs discussed so far, with the fix for 3 possibly
subsuming 1 and 2 but also affecting more code:

1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
2) PQconsumeInput() does not drain all pending bytes
3) pqReadData() does not drain all pending bytes

Thanks,
--Jacob

[1] We should not do that.



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Merlin Moncure
Дата:
On Tue, Jul 15, 2025 at 4:31 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote
Otherwise I think we'll continue to chase weird corner cases.

Agreed. Here's a little more detail on the case I noticed:

* postgres backend thread managing several libpq connections, with polling is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods of time, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k bytes 
* reducing the log verbosity made the problem go away.

merlin

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Andres Freund
Дата:
Hi,

On 2025-07-15 15:31:25 -0700, Jacob Champion wrote:
> So I think pqReadData() needs to make the same guarantees for SSL/GSS
> that it does for plain TCP: when it returns, either 1) there are no
> bytes left pending or 2) the socket itself is marked readable.
> Otherwise I think we'll continue to chase weird corner cases.

It's not really the same issue, as at least my reproducer requires modifying
libpq (and thus clearly isn't a bug), but it thought it might still be
interesting. For one, I'm not sure this can only be triggered with the
modifications.


If one modifies libpq to use openssl readahead (which does result in speedups,
because otherwise openssl think it's useful to do lots of 5 byte reads from
the socket), I see occasional hangs in libpq.

diff --git i/src/interfaces/libpq/fe-secure-openssl.c w/src/interfaces/libpq/fe-secure-openssl.c
index 51dd7b9fec0..a8deb28594c 100644
--- i/src/interfaces/libpq/fe-secure-openssl.c
+++ w/src/interfaces/libpq/fe-secure-openssl.c
@@ -864,6 +864,9 @@ initialize_SSL(PGconn *conn)
      */
     SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);

+    SSL_CTX_set_read_ahead(SSL_context, 1);
+    SSL_CTX_set_default_read_buffer_len(SSL_context, 64*1024);
+
     /*
      * If the root cert file exists, load it so we can perform certificate
      * verification. If sslmode is "verify-full" we will also do further


ninja src/bin/psql/psql && src/bin/psql/psql -Xq -h localhost -c '\copy pg_class to stdout'

hangs reliably for me (if it's going to a file it's less reliable, so there's
a timing aspect here).


At that point psql is not interruptible, which isn't great either...

Backtrace:

#0  __internal_syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=-1, a4=a4@entry=0, a5=a5@entry=0,
a6=a6@entry=0,nr=7)
 
    at ./nptl/cancellation.c:44
#1  0x00007f1e25e3f6ad in __syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=<optimized out>,
a4=a4@entry=0,a5=a5@entry=0, a6=a6@entry=0, nr=7)
 
    at ./nptl/cancellation.c:75
#2  0x00007f1e25eb39c6 in __GI___poll (fds=fds@entry=0x7ffd7ed2edb8, nfds=nfds@entry=1, timeout=<optimized out>) at
../sysdeps/unix/sysv/linux/poll.c:29
#3  0x00007f1e26248dea in poll (__fds=0x7ffd7ed2edb8, __nfds=1, __timeout=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/poll2.h:44
#4  PQsocketPoll (sock=<optimized out>, forRead=<optimized out>, forWrite=<optimized out>, end_time=<optimized out>)
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1175
#5  0x00007f1e262491a8 in pqSocketCheck (conn=conn@entry=0x55f69b7e68f0, forRead=1, forWrite=0, end_time=-1)
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1114
#6  0x00007f1e262492b3 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x55f69b7e68f0,
end_time=<optimizedout>)
 
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1039
#7  0x00007f1e262492f1 in pqWait (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x55f69b7e68f0)
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1021
#8  0x00007f1e26234c46 in pqGetCopyData3 (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8,
async=async@entry=0)
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-protocol3.c:1846
#9  0x00007f1e26243d98 in PQgetCopyData (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8,
async=async@entry=0)
    at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:2840
#10 0x000055f65d28f9c6 in handleCopyOut (conn=0x55f69b7e68f0, copystream=copystream@entry=0x7f1e25f985c0
<_IO_2_1_stdout_>,res=res@entry=0x7ffd7ed2f008)
 
    at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:442
#11 0x000055f65d27b07d in HandleCopyResult (resultp=resultp@entry=0x7ffd7ed2f158, copystream=0x7f1e25f985c0
<_IO_2_1_stdout_>)
    at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:950
#12 0x000055f65d27dab8 in ExecQueryAndProcessResults (query=query@entry=0x55f69b8b3470 "COPY  pg_attribute TO STDOUT
",
    elapsed_msec=elapsed_msec@entry=0x7ffd7ed2f1b8, svpt_gone_p=svpt_gone_p@entry=0x7ffd7ed2f1b7,
is_watch=is_watch@entry=false,min_rows=min_rows@entry=0,
 
    opt=opt@entry=0x0, printQueryFout=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1934
#13 0x000055f65d27cdc9 in SendQuery (query=0x55f69b8b3470 "COPY  pg_attribute TO STDOUT ")
    at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1212
#14 0x000055f65d28f6dd in do_copy (args=args@entry=0x55f69b8b1f10 "pg_attribute to stdout")
    at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:370
#15 0x000055f65d299f24 in exec_command_copy (scan_state=scan_state@entry=0x55f69b8a71f0,
active_branch=active_branch@entry=true)
    at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:951
#16 0x000055f65d2a2065 in exec_command (cmd=cmd@entry=0x55f69b8b3310 "copy",
scan_state=scan_state@entry=0x55f69b8a71f0,cstack=cstack@entry=0x55f69b887ba0,
 
    query_buf=query_buf@entry=0x0, previous_buf=previous_buf@entry=0x0) at
../../../../../home/andres/src/postgresql/src/bin/psql/command.c:338
#17 0x000055f65d2a2b59 in HandleSlashCmds (scan_state=scan_state@entry=0x55f69b8a71f0,
cstack=cstack@entry=0x55f69b887ba0,query_buf=query_buf@entry=0x0,
 
    previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:241
#18 0x000055f65d29305f in main (argc=<optimized out>, argv=<optimized out>) at
../../../../../home/andres/src/postgresql/src/bin/psql/startup.c:409


It does seem ... weird ... that pqGetCopyData3() just processes whatever is
already in the libpq buffer and then waits for the socket to become readable,
before ever calling pqReadData(). What guarantees, even without the change
above, that a) there's no pending data inside openssl b) that we're not
waiting for a write?

If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
if it returns 1, the hang vanishes.


The same
            if (pqWait(true, false, conn) ||
                pqReadData(conn) < 0)
                return -2;

pattern (without a preceding pqReadData()) exists in several other places :(



> What I'm not sure about is unforeseen consequences -- could fixing
> this expose other latent bugs? If we're concerned about that, we might
> have to tier the backports to make sure everything remains stable.
> Here are the three bugs discussed so far, with the fix for 3 possibly
> subsuming 1 and 2 but also affecting more code:
>
> 1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes
> 2) PQconsumeInput() does not drain all pending bytes
> 3) pqReadData() does not drain all pending bytes

What are the limits for the maximum amount of data this could make us buffer
in addition to what we are buffering right now?  It's not entirely obvious to
me that a loop around pqReadData() as long as there is pending data couldn't
make us buffer a lot of data.

Do you have a WIP patch?

Greetings,

Andres Freund



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 7:36 AM Merlin Moncure <mmoncure@gmail.com> wrote:
> Agreed. Here's a little more detail on the case I noticed:
>
> * postgres backend thread managing several libpq connections, with polling is_busy loop
> * when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods
oftime, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything
interestingoutside of of the polling loop) 
> * checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k
bytes
> * reducing the log verbosity made the problem go away.

Thanks for this! I'm not sure I'll be able to prove that this is
related, as opposed to some other potential problem, but it's good to
have the information.

--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
> If one modifies libpq to use openssl readahead (which does result in speedups,
> because otherwise openssl think it's useful to do lots of 5 byte reads from
> the socket), I see occasional hangs in libpq.

Now that is a very interesting coincidence, because the patch I'm
working on is about to add an assertion that readahead is turned
*off*. :D

Based on my understanding of [1], readahead makes this overall problem
much worse by opportunistically slurping bytes off the wire and doing
absolutely nothing with them until you call SSL_read() enough times to
finally get to them. It would even hide those bytes from the pending
count:

> Therefore, it is possible for no more bytes to be readable from the
> underlying BIO (because OpenSSL has already read them) and for
> SSL_pending() to return 0, even though readable application data bytes
> are available (because the data is in unprocessed buffered records).

That seems like the polar opposite of what we want. If clients are
polling on the raw socket, those pending bytes must be accounted for.

> It does seem ... weird ... that pqGetCopyData3() just processes whatever is
> already in the libpq buffer and then waits for the socket to become readable,
> before ever calling pqReadData(). What guarantees, even without the change
> above, that a) there's no pending data inside openssl b) that we're not
> waiting for a write?

(a) is the bug discussed here, AFAICT? I have not considered (b) yet,
but that seems like a separate problem.

> If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
> if it returns 1, the hang vanishes.

Sure, but that's cheating. I think I'd expect that adding extra
"unnecessary" reads would help cover up the problem that readahead has
caused.

> What are the limits for the maximum amount of data this could make us buffer
> in addition to what we are buffering right now?

I still need to make sure, but I believe it should be an additional
16k due to the current limits on GSS token sizes and TLS record sizes.
In normal operation, I think we basically immediately double the input
buffer sizes (which start at 16k) as soon as a long message comes in.
So that order of magnitude didn't immediately seem worrisome to me.

> It's not entirely obvious to
> me that a loop around pqReadData() as long as there is pending data couldn't
> make us buffer a lot of data.

FWIW, I'm not planning to loop around it; I'm planning to add a
pgsecure_* primitive that drains whatever the transport is holding
onto.

> Do you have a WIP patch?

I'm working on one now.

--Jacob

[1] https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Andres Freund
Дата:
Hi,

On 2025-07-16 11:50:46 -0700, Jacob Champion wrote:
> On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
> > If one modifies libpq to use openssl readahead (which does result in speedups,
> > because otherwise openssl think it's useful to do lots of 5 byte reads from
> > the socket), I see occasional hangs in libpq.
>
> Now that is a very interesting coincidence, because the patch I'm
> working on is about to add an assertion that readahead is turned
> *off*. :D

Heh.


> Based on my understanding of [1], readahead makes this overall problem
> much worse by opportunistically slurping bytes off the wire and doing
> absolutely nothing with them until you call SSL_read() enough times to
> finally get to them.

Right - but it also substantially reduces the number of syscalls :(. I'm not
sure that it's wise to close that door permanently. Clearly it's not something
we need to care about in the back branches.

Without it openssl will do one read for the record length, and another read
for the record contents. I.e. it'll often double the number of syscalls.


Without readahead:

perf stat -e raw_syscalls:sys_enter,cycles,context-switches src/bin/pgbench/pgbench -c $c -j$j -M prepared -n -P 1 -f
<(echo'select 1') -h localhost -t1000
 

             5,561      raw_syscalls:sys_enter
        89,125,976      cycles
             1,009      context-switches


With readahead:

             4,556      raw_syscalls:sys_enter
        83,582,571      cycles
             1,007      context-switches



> It would even hide those bytes from the pending
> count:
>
> > Therefore, it is possible for no more bytes to be readable from the
> > underlying BIO (because OpenSSL has already read them) and for
> > SSL_pending() to return 0, even though readable application data bytes
> > are available (because the data is in unprocessed buffered records).

It's rather depressing, like how this this API behaviour came to be?


Luckily there seems to be SSL_has_pending():

> SSL_has_pending() returns 1 if s has buffered data (whether processed or
> unprocessed) and 0 otherwise

And it seems old enough:

> The SSL_has_pending() function was added in OpenSSL 1.1.0.


> > It does seem ... weird ... that pqGetCopyData3() just processes whatever is
> > already in the libpq buffer and then waits for the socket to become readable,
> > before ever calling pqReadData(). What guarantees, even without the change
> > above, that a) there's no pending data inside openssl b) that we're not
> > waiting for a write?
>
> (a) is the bug discussed here, AFAICT?

I'm not entirely sure, tbh.


> > If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue
> > if it returns 1, the hang vanishes.
>
> Sure, but that's cheating. I think I'd expect that adding extra
> "unnecessary" reads would help cover up the problem that readahead has
> caused.

To me the pattern in our code seems bogus, even without readahead, no?

getCopyDataMessage() can return 0 because our internal buffer is empty,
without ever reading from the socket or openssl. But if the SSL record
contained two messages (or parts of two messages), all the required data may
be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
forever.  That's a problem unrelated to readahead, no?

I.e. the extra read actually ensures that we're not doing a pqWait() without
knowing whether we need one.

Greetings,

Andres Freund



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> > Based on my understanding of [1], readahead makes this overall problem
> > much worse by opportunistically slurping bytes off the wire and doing
> > absolutely nothing with them until you call SSL_read() enough times to
> > finally get to them.
>
> Right - but it also substantially reduces the number of syscalls :(. I'm not
> sure that it's wise to close that door permanently.

I think it's very likely that OpenSSL's implementation of readahead is
fundamentally incompatible with our async implementation. For one, the
documented way to call SSL_read() without accidentally hitting the
socket is via SSL_pending(), which readahead is documented to break.
For two, if you're worried about how much data we could potentially
have to hold during a "drain all pending" operation, I think readahead
changes the upper bound from "the size of one TLS record" to
"SO_RCVBUF", doesn't it?

> Without it openssl will do one read for the record length, and another read
> for the record contents. I.e. it'll often double the number of syscalls.

That is unfortunate... but if we're talking about a
correctness/performance tradeoff then I'm somewhat less sympathetic to
the performance side.

> Luckily there seems to be SSL_has_pending():
>
> > SSL_has_pending() returns 1 if s has buffered data (whether processed or
> > unprocessed) and 0 otherwise

IIUC, we can't use SSL_has_pending() either. I need to know how
exactly many bytes to drain out of the userspace buffer without
introducing more bytes into it.

> To me the pattern in our code seems bogus, even without readahead, no?
>
> getCopyDataMessage() can return 0 because our internal buffer is empty,
> without ever reading from the socket or openssl. But if the SSL record
> contained two messages (or parts of two messages), all the required data may
> be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
> forever.

I think the "cheating" in pqSocketCheck() that I mentioned upthread
ensures that pqWait() will see the OpenSSL data and return
immediately. (That's a major architectural smell IMO, but it is there.
Just not for GSS, which might explain some hangs discussed on the list
a while back.)

> I.e. the extra read actually ensures that we're not doing a pqWait() without
> knowing whether we need one.

I don't think so, because (without my WIP patch) you still can't
guarantee that the single call to pqReadData() got all of the
readahead data. And with my WIP patch, you don't need the extra call,
because the previous call to pqReadData() will have ensured that
there's no leftover SSL buffer to begin with. Am I missing something?

Thanks!
--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Andres Freund
Дата:
Hi,

On 2025-07-16 15:25:14 -0700, Jacob Champion wrote:
> On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:
> > > Based on my understanding of [1], readahead makes this overall problem
> > > much worse by opportunistically slurping bytes off the wire and doing
> > > absolutely nothing with them until you call SSL_read() enough times to
> > > finally get to them.
> >
> > Right - but it also substantially reduces the number of syscalls :(. I'm not
> > sure that it's wise to close that door permanently.
>
> I think it's very likely that OpenSSL's implementation of readahead is
> fundamentally incompatible with our async implementation. For one, the
> documented way to call SSL_read() without accidentally hitting the
> socket is via SSL_pending(), which readahead is documented to break.

Why do we care about not hitting the socket? We always operate the socket in
non-blocking mode anyway?


FWIW, I have seen too many folks move away from encrypted connections for
backups due to openssl being the bottlenecks to just accept that we'll never
go beyond the worst performance.  Any actually decently performing networking
will need to divorce when socket IO happens and when openssl sees the data
much more heavily. Just relying more and more on extremely tight coupling is a
dead end.


> For two, if you're worried about how much data we could potentially
> have to hold during a "drain all pending" operation, I think readahead
> changes the upper bound from "the size of one TLS record" to
> "SO_RCVBUF", doesn't it?

It seems to be a configurable limit, with
SSL_set_default_read_buffer_len(). But the default is to just read up to a
whole frame's worth of data, instead of of reading the length
separately. I.e. an irrelevant amount of memory.


> > Without it openssl will do one read for the record length, and another read
> > for the record contents. I.e. it'll often double the number of syscalls.
>
> That is unfortunate... but if we're talking about a
> correctness/performance tradeoff then I'm somewhat less sympathetic to
> the performance side.

I'm obviously not advocating for incorrectness.


> > Luckily there seems to be SSL_has_pending():
> >
> > > SSL_has_pending() returns 1 if s has buffered data (whether processed or
> > > unprocessed) and 0 otherwise
>
> IIUC, we can't use SSL_has_pending() either. I need to know how
> exactly many bytes to drain out of the userspace buffer without
> introducing more bytes into it.

Why?  The only case where we ought to care about whether pending data exists
inside openssl is if our internal buffer is either empty or doesn't contain
the entire message we're trying to consume. In either of those two cases we
can just consume some data from openssl, without caring how much it precisely
is.


> > To me the pattern in our code seems bogus, even without readahead, no?
> >
> > getCopyDataMessage() can return 0 because our internal buffer is empty,
> > without ever reading from the socket or openssl. But if the SSL record
> > contained two messages (or parts of two messages), all the required data may
> > be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait
> > forever.
>
> I think the "cheating" in pqSocketCheck() that I mentioned upthread
> ensures that pqWait() will see the OpenSSL data and return
> immediately. (That's a major architectural smell IMO, but it is there.
> Just not for GSS, which might explain some hangs discussed on the list
> a while back.)

I guess so, but it's hard to know without a patch.


> > I.e. the extra read actually ensures that we're not doing a pqWait() without
> > knowing whether we need one.
>
> I don't think so, because (without my WIP patch) you still can't
> guarantee that the single call to pqReadData() got all of the
> readahead data.

It wouldn't need to get all the data in one pqReadData() in this case, as we'd
just loop around and do the whole thing again, until the entire message is
read in (regardless of whether there's remaining data in a partially consumed
record, or in the "unprocessed buffer").

Greetings,

Andres Freund



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> Why do we care about not hitting the socket? We always operate the socket in
> non-blocking mode anyway?

IIUC, that would change pqReadData() from a bounded read to an
unbounded one. Then we have to somehow protect against a server that
can send data faster than we can process it.

> FWIW, I have seen too many folks move away from encrypted connections for
> backups due to openssl being the bottlenecks to just accept that we'll never
> go beyond the worst performance.  Any actually decently performing networking
> will need to divorce when socket IO happens and when openssl sees the data
> much more heavily.

Can't we make our custom BIO do that?

> Just relying more and more on extremely tight coupling is a
> dead end.

Why? This abstraction _must_ leak. Because of PQsocket() at minimum,
but also there's a bunch of code complexity around the read/write
behavior and EOF differences for a TLS connection. When I look at the
code in pqReadData() I am very sad.

TLS does not behave like raw TCP. So if there's no way to get rid of
the coupling, IMO we might as well use it.

> > For two, if you're worried about how much data we could potentially
> > have to hold during a "drain all pending" operation, I think readahead
> > changes the upper bound from "the size of one TLS record" to
> > "SO_RCVBUF", doesn't it?
>
> It seems to be a configurable limit, with
> SSL_set_default_read_buffer_len(). But the default is to just read up to a
> whole frame's worth of data, instead of of reading the length
> separately. I.e. an irrelevant amount of memory.

Hmm, okay.

> > IIUC, we can't use SSL_has_pending() either. I need to know how
> > exactly many bytes to drain out of the userspace buffer without
> > introducing more bytes into it.
>
> Why?  The only case where we ought to care about whether pending data exists
> inside openssl is if our internal buffer is either empty or doesn't contain
> the entire message we're trying to consume. In either of those two cases we
> can just consume some data from openssl, without caring how much it precisely
> is.

I can't tell if you're discussing the fix for this bug or a
hypothetical future architecture that makes readahead safe. I don't
think PQconnectPoll() behaves the way you're describing; there is no
retry-read concept.

The order of operations is:
- read some available data; don't care how much
- parse the message(s) if we have enough data
- otherwise tell the caller to poll for read

That's an architecture that's already very tightly coupled to the
behavior of un-userspace-buffered TCP/UNIX sockets. Which is why I'm
trying to have pqReadData() make compatible guarantees for SSL/GSS.

--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 11:50 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
> > Do you have a WIP patch?
>
> I'm working on one now.

The attached still needs some documentation work, and closer
inspection of the new assertions and OpenSSL error handling, but I
think it's good enough to get my intent across.

0001 is a fix for pqWait+gssencmode, which is hopefully the easier of
the two to review/backport. The current behavior for GSS is, IMO, an
obvious oversight.

0002 adds drain semantics at the end of pqReadData(). If we were
writing this from scratch, it wouldn't look like this: there's no
point in having the transport layer split data off into an undersized
conn->inBuffer if we're just going to come right back and drain it all
anyway. But I'm hoping this is easier to reason about and verify,
compared to a bigger rewrite.

I've renamed the existing pgtls_read_pending() to
pgtls_read_is_pending(), to try to disambiguate it from the new
*_drain_pending() APIs, but I'm not particularly attached to my choice
of names.

Thanks,
--Jacob

Вложения

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> The current behavior for GSS is, IMO, an
> obvious oversight.

(A better way to word this would have been "clearly an oversight"; I
didn't mean to imply that the correct behavior is obvious. The
opposite, in fact.)

--Jacob



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Andres Freund
Дата:
Hi,


On 2025-07-17 09:48:29 -0700, Jacob Champion wrote:
> On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> > Why do we care about not hitting the socket? We always operate the socket in
> > non-blocking mode anyway?
>
> IIUC, that would change pqReadData() from a bounded read to an
> unbounded one. Then we have to somehow protect against a server that
> can send data faster than we can process it.

I don't see why it would have to. If a connection has sufficient data in its
buffer on the libpq level, we obviously don't need to read more.


> > FWIW, I have seen too many folks move away from encrypted connections for
> > backups due to openssl being the bottlenecks to just accept that we'll never
> > go beyond the worst performance.  Any actually decently performing networking
> > will need to divorce when socket IO happens and when openssl sees the data
> > much more heavily.
>
> Can't we make our custom BIO do that?

We can, but that actually just makes this issue *more* acute, not less. We
would obviously have to take the data buffered on the BIO level into account
before signalling to the libpq user that no data is ready.

FWIW, even with BIO level buffering, it turns out that openssl readahead is
*still* a decent performance boon, going twice as many times in the BIO isn't
free, even if the BIO does buffering.


> > Just relying more and more on extremely tight coupling is a dead end.
>
> Why? This abstraction _must_ leak.

I don't really buy this. We have buffering in libpq, even without TLS. Most of
the issues we need to solve already exists for non-TLS connections (except for
the annoying thing of introduces writes when doing reads and vice versa).

It's just that our implementation is bad.


> TLS does not behave like raw TCP. So if there's no way to get rid of
> the coupling, IMO we might as well use it.

I agree that they're not the same.  I don't see what that has to do with
introducing hard requirement for no buffering within openssl or layers below
(a future BIO that does caching).


> > > IIUC, we can't use SSL_has_pending() either. I need to know how
> > > exactly many bytes to drain out of the userspace buffer without
> > > introducing more bytes into it.
> >
> > Why?  The only case where we ought to care about whether pending data exists
> > inside openssl is if our internal buffer is either empty or doesn't contain
> > the entire message we're trying to consume. In either of those two cases we
> > can just consume some data from openssl, without caring how much it precisely
> > is.
>
> I can't tell if you're discussing the fix for this bug or a hypothetical
> future architecture that makes readahead safe.

I don't know your fix really looks like - afaict you haven't shared it. So
it's hard to actually comment with specifics to it.

I am not saying that a to-be-backpatched-fix needs to make openssl readahead
work, that'd be absurd. But I am concerned with more fundamentally entrenching
the idea that there never is any buffering below libpq's buffering, it'll
cause trouble down the line.



> I don't think PQconnectPoll() behaves the way you're describing; there is no
> retry-read concept.

FWIW, I don't care about what we do during connection establishment. Doing
lots of tiny reads or such will never matter for efficiency, compared to all
the other costs.

To me the relevant cases are use of libpq in established connections, either in a
blocking way, or in a non-blocking way.

For blocking use we have that gross SSL hack in place (i.e. the
pgtls_read_pending() check in pqSocketCheck()), but we obviously *don't* for
the non-blocking case - the problem of this thread.


Our APIs around the non-blocking use of libpq are pretty
bonkers:

- PQconsumeInput() doesn't tell you whether it actually consumed input

- We kind of document PQconsumeInput() to consume all the available input, but
  it does no such thing, it just fills the internal buffer. Which might be
  big, or it might not be, depending on what previously happened on the
  connection.

- PQisBusy() returns true even if there's unconsumed data in the socket

There's really no sane way to write an event loop with this that doesn't do
unnecessary poll()s, because PQconsumeInput() might have consumed data, but
PQisBusy() might *still* return true, because not enough input has been
consumed.

I suspect that the the most minimal way to fix this, without breaking the
external API, would be for PQisBusy() to try to read more data. But that's
also pretty gross.

Greetings,

Andres Freund



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 12:55 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

I think we're talking past each other, so let me try to focus on just
a few items here. I'm happy to go back and respond point-by-point if
needed.

> I don't know your fix really looks like - afaict you haven't shared it. So
> it's hard to actually comment with specifics to it.

Just upthread [1]. Emails probably crossed while you were typing.

> I am not saying that a to-be-backpatched-fix needs to make openssl readahead
> work, that'd be absurd. But I am concerned with more fundamentally entrenching
> the idea that there never is any buffering below libpq's buffering, it'll
> cause trouble down the line.

And I'm not saying that I'm fundamentally opposed to a future
architecture that allows readahead. But I do want to pin behavior
that's required for safety in the _current_ architecture. We can unpin
it if the order of operations is changed; assertions that have been
added can always be deleted.

> FWIW, I don't care about what we do during connection establishment.

I have to care, because upthread I've shown that we can deadlock there
too. My goal in this thread is to fix the deadlock generally, on all
branches.

--Jacob

[1] https://postgr.es/m/CAOYmi%2BmD6EbDEYfwZON0FCUAvGO%2B2%3DjR2V4KQYx%2Bd%2Bg0ap0Amg%40mail.gmail.com



Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> The attached still needs some documentation work

v2 does a bunch of commit message work, but I imagine it needs a good
bit of copy-editing for clarity.

I'm not in any hurry to smash this in. I think we still need
- independent verification of the architectural issue, to make sure
it's not any deeper or shallower than pqReadData()
- independent verification that this fixes the bugs that have been described
- measurement of the performance characteristics of the new code
- verification of the maximum amount of additional buffer memory that
can be consumed during the drain
- consensus that we want to maintain this new behavior
- discussion of what we want this code to look like going forward

Andres, does this patch help clarify my thoughts upthread? Ideally the
additional code isn't getting in the way of any future
rearchitectures, since it only pins the new requirement in the code
that needs it.

Thanks,
--Jacob

Вложения