Обсуждение: Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote: > I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL > database. This binding uses the asynchronous API of libpq by default to > facilitate the ruby IO wait and scheduling mechanisms. > > This works well with the vanilla postgresql server, but it leads to > starvation with other types of servers using the postgresql wire > protocol 3. This is because the current functioning of the libpq async > interface depends on a maximum size of SSL records of 8kB. Thanks for the report! I wanted evidence that this wasn't a ruby-pg-specific problem, so I set up a test case with Python/psycopg2. I was able to reproduce a hang when all of the following were true: - psycopg2's async mode was enabled - the client performs a PQconsumeInput/PQisBusy loop, waiting on socket read events when the connection is busy (I used psycopg2.extras.wait_select() for this) - the server splits a large message over many large TLS records - the server packs the final ReadyForQuery message into the same record as the split message's final fragment Gory details of the packet sizes, if it's helpful: - max TLS record size is 12k, because it made the math easier - server sends DataRow of 32006 bytes, followed by DataRow of 806 bytes, followed by CommandComplete/ReadyForQuery - so there are three TLS records on the wire containing 1) DataRow 1 fragment 1 (12k bytes) 2) DataRow 1 fragment 2 (12k bytes) 3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes) + CommandComplete + ReadyForQuery > To fix this issue the attached patch calls pqReadData() repeatedly in > PQconsumeInput() until there is no buffered SSL data left to be read. > Another solution could be to process buffered SSL read bytes in > PQisBusy() instead of PQconsumeInput() . I agree that PQconsumeInput() needs to ensure that the transport buffers are all drained. But I'm not sure this is a complete solution; doesn't GSS have the same problem? And are there any other sites that need to make the same guarantee before returning? I need to switch away from this for a bit. Would you mind adding this to the next Commitfest as a placeholder? https://commitfest.postgresql.org/50/ Thanks, --Jacob
Thank you Jacob for verifying this issue! > Gory details of the packet sizes, if it's helpful: > - max TLS record size is 12k, because it made the math easier > - server sends DataRow of 32006 bytes, followed by DataRow of 806 > bytes, followed by CommandComplete/ReadyForQuery > - so there are three TLS records on the wire containing > 1) DataRow 1 fragment 1 (12k bytes) > 2) DataRow 1 fragment 2 (12k bytes) > 3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes) > + CommandComplete + ReadyForQuery How did you verify the issue on the server side - with YugabyteDB or with a modified Postgres server? I'd like to verify the GSSAPI part and I'm familiar with the Postgres server only. > I agree that PQconsumeInput() needs to ensure that the transport > buffers are all drained. But I'm not sure this is a complete solution; > doesn't GSS have the same problem? And are there any other sites that > need to make the same guarantee before returning? Which other sites do you mean? The synchronous transfer already works, since the select() is short-circuit in case of pending bytes: [1] > I need to switch away from this for a bit. Would you mind adding this > to the next Commitfest as a placeholder? No problem; registered: https://commitfest.postgresql.org/50/5251/ -- Regards, Lars [1] https://github.com/postgres/postgres/blob/77761ee5dddc0518235a51c533893e81e5f375b9/src/interfaces/libpq/fe-misc.c#L1070
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote: > How did you verify the issue on the server side - with YugabyteDB or > with a modified Postgres server? I'd like to verify the GSSAPI part and > I'm familiar with the Postgres server only. Neither, unfortunately -- I have a protocol testbed that I use for this kind of stuff. I'm happy to share once I get it cleaned up, but it's unlikely to help you in this case since I haven't implemented gssenc support. Patching the Postgres server itself seems like a good way to go. > > And are there any other sites that > > need to make the same guarantee before returning? > > Which other sites do you mean? I'm mostly worried that other parts of libpq might assume that a single call to pqReadData will drain the buffers. If not, great! -- but I haven't had time to check all the call sites. > > I need to switch away from this for a bit. Would you mind adding this > > to the next Commitfest as a placeholder? > > No problem; registered: https://commitfest.postgresql.org/50/5251/ Thank you! --Jacob
On Thu, 12 Sept 2024 at 04:30, Jacob Champion <jacob.champion@enterprisedb.com> wrote: > > On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis <lars@greiz-reinsdorf.de> wrote: > > How did you verify the issue on the server side - with YugabyteDB or > > with a modified Postgres server? I'd like to verify the GSSAPI part and > > I'm familiar with the Postgres server only. > > Neither, unfortunately -- I have a protocol testbed that I use for > this kind of stuff. I'm happy to share once I get it cleaned up, but > it's unlikely to help you in this case since I haven't implemented > gssenc support. Patching the Postgres server itself seems like a good > way to go. > > > > And are there any other sites that > > > need to make the same guarantee before returning? > > > > Which other sites do you mean? > > I'm mostly worried that other parts of libpq might assume that a > single call to pqReadData will drain the buffers. If not, great! -- > but I haven't had time to check all the call sites. @Jacob, could you find some time to wrap this up? This will help us assess whether it can be refined into a committable state soon. Regards, Vignesh
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Sun, Mar 16, 2025 at 6:14 AM vignesh C <vignesh21@gmail.com> wrote: > @Jacob, could you find some time to wrap this up? This will help us > assess whether it can be refined into a committable state soon. This is a bug report that likely requires backpatching; feature freeze should not be an issue here. Thanks, --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Tue, Sep 10, 2024 at 11:49 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > I need to switch away from this for a bit. "a bit" In an effort to get this unstuck (and ideally solved during this commitfest) here are my most recent thoughts: > I agree that PQconsumeInput() needs to ensure that the transport > buffers are all drained. But I'm not sure this is a complete solution; > doesn't GSS have the same problem? And are there any other sites that > need to make the same guarantee before returning? So, Postgres internals cheat: calls to pqWait/pqWaitTimed/pqSocketCheck() all reach down into the SSL buffer to see if there are any pending bytes before polling the socket. I do not yet understand why this protection is not extended to GSS-encrypted connections. In any case, our clients can't make use of that protection either, so it's unfortunate that we're relying on it ourselves, because it makes reasoning about a fix more difficult. I'm not about to propose removing that case in a backport, but as long as it's there covering up the problem, it's hard to say whether we've truly fixed this bug. I think the hang could happen in any situation where a third-party client calls into a function which calls pqReadData(), and then immediately polls the socket for a read-ready condition. So: what are the APIs that could be called that way? Clearly PQconsumeInput() is one. PQconnectPoll() seems like it should be exposed as well. postgres_fdw's use of PQgetResult() might qualify (or if it doesn't, I don't understand why not -- there is at least one path in PQgetResult() where the buffer is not fully consumed). And nonblocking mode seems like it might sprinkle this hazard into more places, as pqSendSome() will then start reading data and returning before a pqWait(). So, to fix this once and for all, do we have to make sure that pqReadData() always fully drains the SSL/GSS transport buffer instead of eagerly returning? --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Tue, Jul 1, 2025 at 1:42 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > I do > not yet understand why this protection is not extended to > GSS-encrypted connections. After repurposing some of my test code for d98cefe11, I'm able to reproduce the hang with gssencmode when the server uses a smaller-than-standard (12k) send buffer. Same reproduction case as the original (32006-byte DataRow, 806-byte DataRow, CommandComplete, ReadyForQuery). So a complete patch for this has to consider GSS as well. > So, to fix this once and for all, do we have to make sure that > pqReadData() always fully drains the SSL/GSS transport buffer instead > of eagerly returning? I'll work on proving that code paths other than PQconsumeInput() are affected. If they are, I'll start a patch for pqReadData(). --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 2, 2025 at 4:12 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > I'll work on proving that code paths other than PQconsumeInput() are > affected. If they are, I'll start a patch for pqReadData(). Here's one way for a server implementation to hang the client during PQconnectPoll(): accept the SSLRequest and immediately send a protocol 2.0 error message of (16k + 1) bytes, but split the message across two TLS records of 8k and (8k + 1) bytes each, and wait for the client to close the connection. This case works fine with synchronous connect, but it fails with async. libpq will fill the first 8k of its 16k input buffer with the first record. It fills the second half with all but one byte of the second record, and since it doesn't find a null terminator anywhere in the buffer, it tells the client to poll on the socket again. Since the server in this case is waiting for the client to close first, no additional socket activity is coming, and the connection deadlocks with one byte pending in the SSL buffer. I use 2.0 as an example; libpq avoids getting stuck in 3.0 by enlarging the input buffer to the size of the incoming message. But that's supposed to be a performance optimization, not anti-deadlock protection. Even if we were to enshrine the buffer sizes we use in the protocol specification itself [1], someone's still likely to trip over this hazard if they start adjusting those performance optimizations. (Even our use of a 16k initial buffer is itself just an optimization.) So I think pqReadData() needs to make the same guarantees for SSL/GSS that it does for plain TCP: when it returns, either 1) there are no bytes left pending or 2) the socket itself is marked readable. Otherwise I think we'll continue to chase weird corner cases. What I'm not sure about is unforeseen consequences -- could fixing this expose other latent bugs? If we're concerned about that, we might have to tier the backports to make sure everything remains stable. Here are the three bugs discussed so far, with the fix for 3 possibly subsuming 1 and 2 but also affecting more code: 1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes 2) PQconsumeInput() does not drain all pending bytes 3) pqReadData() does not drain all pending bytes Thanks, --Jacob [1] We should not do that.
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Merlin Moncure
Дата:
On Tue, Jul 15, 2025 at 4:31 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote
Otherwise I think we'll continue to chase weird corner cases.
Agreed. Here's a little more detail on the case I noticed:
* postgres backend thread managing several libpq connections, with polling is_busy loop
* when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods of time, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything interesting outside of of the polling loop)
* checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k bytes
* reducing the log verbosity made the problem go away.
merlin
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Andres Freund
Дата:
Hi, On 2025-07-15 15:31:25 -0700, Jacob Champion wrote: > So I think pqReadData() needs to make the same guarantees for SSL/GSS > that it does for plain TCP: when it returns, either 1) there are no > bytes left pending or 2) the socket itself is marked readable. > Otherwise I think we'll continue to chase weird corner cases. It's not really the same issue, as at least my reproducer requires modifying libpq (and thus clearly isn't a bug), but it thought it might still be interesting. For one, I'm not sure this can only be triggered with the modifications. If one modifies libpq to use openssl readahead (which does result in speedups, because otherwise openssl think it's useful to do lots of 5 byte reads from the socket), I see occasional hangs in libpq. diff --git i/src/interfaces/libpq/fe-secure-openssl.c w/src/interfaces/libpq/fe-secure-openssl.c index 51dd7b9fec0..a8deb28594c 100644 --- i/src/interfaces/libpq/fe-secure-openssl.c +++ w/src/interfaces/libpq/fe-secure-openssl.c @@ -864,6 +864,9 @@ initialize_SSL(PGconn *conn) */ SSL_CTX_set_mode(SSL_context, SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER); + SSL_CTX_set_read_ahead(SSL_context, 1); + SSL_CTX_set_default_read_buffer_len(SSL_context, 64*1024); + /* * If the root cert file exists, load it so we can perform certificate * verification. If sslmode is "verify-full" we will also do further ninja src/bin/psql/psql && src/bin/psql/psql -Xq -h localhost -c '\copy pg_class to stdout' hangs reliably for me (if it's going to a file it's less reliable, so there's a timing aspect here). At that point psql is not interruptible, which isn't great either... Backtrace: #0 __internal_syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=-1, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0,nr=7) at ./nptl/cancellation.c:44 #1 0x00007f1e25e3f6ad in __syscall_cancel (a1=a1@entry=140726731206072, a2=a2@entry=1, a3=<optimized out>, a4=a4@entry=0,a5=a5@entry=0, a6=a6@entry=0, nr=7) at ./nptl/cancellation.c:75 #2 0x00007f1e25eb39c6 in __GI___poll (fds=fds@entry=0x7ffd7ed2edb8, nfds=nfds@entry=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29 #3 0x00007f1e26248dea in poll (__fds=0x7ffd7ed2edb8, __nfds=1, __timeout=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:44 #4 PQsocketPoll (sock=<optimized out>, forRead=<optimized out>, forWrite=<optimized out>, end_time=<optimized out>) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1175 #5 0x00007f1e262491a8 in pqSocketCheck (conn=conn@entry=0x55f69b7e68f0, forRead=1, forWrite=0, end_time=-1) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1114 #6 0x00007f1e262492b3 in pqWaitTimed (forRead=<optimized out>, forWrite=<optimized out>, conn=0x55f69b7e68f0, end_time=<optimizedout>) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1039 #7 0x00007f1e262492f1 in pqWait (forRead=forRead@entry=1, forWrite=forWrite@entry=0, conn=conn@entry=0x55f69b7e68f0) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1021 #8 0x00007f1e26234c46 in pqGetCopyData3 (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-protocol3.c:1846 #9 0x00007f1e26243d98 in PQgetCopyData (conn=conn@entry=0x55f69b7e68f0, buffer=buffer@entry=0x7ffd7ed2efb8, async=async@entry=0) at ../../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:2840 #10 0x000055f65d28f9c6 in handleCopyOut (conn=0x55f69b7e68f0, copystream=copystream@entry=0x7f1e25f985c0 <_IO_2_1_stdout_>,res=res@entry=0x7ffd7ed2f008) at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:442 #11 0x000055f65d27b07d in HandleCopyResult (resultp=resultp@entry=0x7ffd7ed2f158, copystream=0x7f1e25f985c0 <_IO_2_1_stdout_>) at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:950 #12 0x000055f65d27dab8 in ExecQueryAndProcessResults (query=query@entry=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ", elapsed_msec=elapsed_msec@entry=0x7ffd7ed2f1b8, svpt_gone_p=svpt_gone_p@entry=0x7ffd7ed2f1b7, is_watch=is_watch@entry=false,min_rows=min_rows@entry=0, opt=opt@entry=0x0, printQueryFout=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1934 #13 0x000055f65d27cdc9 in SendQuery (query=0x55f69b8b3470 "COPY pg_attribute TO STDOUT ") at ../../../../../home/andres/src/postgresql/src/bin/psql/common.c:1212 #14 0x000055f65d28f6dd in do_copy (args=args@entry=0x55f69b8b1f10 "pg_attribute to stdout") at ../../../../../home/andres/src/postgresql/src/bin/psql/copy.c:370 #15 0x000055f65d299f24 in exec_command_copy (scan_state=scan_state@entry=0x55f69b8a71f0, active_branch=active_branch@entry=true) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:951 #16 0x000055f65d2a2065 in exec_command (cmd=cmd@entry=0x55f69b8b3310 "copy", scan_state=scan_state@entry=0x55f69b8a71f0,cstack=cstack@entry=0x55f69b887ba0, query_buf=query_buf@entry=0x0, previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:338 #17 0x000055f65d2a2b59 in HandleSlashCmds (scan_state=scan_state@entry=0x55f69b8a71f0, cstack=cstack@entry=0x55f69b887ba0,query_buf=query_buf@entry=0x0, previous_buf=previous_buf@entry=0x0) at ../../../../../home/andres/src/postgresql/src/bin/psql/command.c:241 #18 0x000055f65d29305f in main (argc=<optimized out>, argv=<optimized out>) at ../../../../../home/andres/src/postgresql/src/bin/psql/startup.c:409 It does seem ... weird ... that pqGetCopyData3() just processes whatever is already in the libpq buffer and then waits for the socket to become readable, before ever calling pqReadData(). What guarantees, even without the change above, that a) there's no pending data inside openssl b) that we're not waiting for a write? If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue if it returns 1, the hang vanishes. The same if (pqWait(true, false, conn) || pqReadData(conn) < 0) return -2; pattern (without a preceding pqReadData()) exists in several other places :( > What I'm not sure about is unforeseen consequences -- could fixing > this expose other latent bugs? If we're concerned about that, we might > have to tier the backports to make sure everything remains stable. > Here are the three bugs discussed so far, with the fix for 3 possibly > subsuming 1 and 2 but also affecting more code: > > 1) pqSocketCheck() checks for pending SSL bytes but not GSS bytes > 2) PQconsumeInput() does not drain all pending bytes > 3) pqReadData() does not drain all pending bytes What are the limits for the maximum amount of data this could make us buffer in addition to what we are buffering right now? It's not entirely obvious to me that a loop around pqReadData() as long as there is pending data couldn't make us buffer a lot of data. Do you have a WIP patch? Greetings, Andres Freund
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 7:36 AM Merlin Moncure <mmoncure@gmail.com> wrote: > Agreed. Here's a little more detail on the case I noticed: > > * postgres backend thread managing several libpq connections, with polling is_busy loop > * when client pushed a lot of log messages (say, with 'RAISE NOTICE'), the server would stall for significant periods oftime, sometimes minutes -- during that time, there would be no activity in the log (the server doesn't do anything interestingoutside of of the polling loop) > * checking backend server log, which relogs the client raised logs via dblink, each pause happened after roughly 16k bytes > * reducing the log verbosity made the problem go away. Thanks for this! I'm not sure I'll be able to prove that this is related, as opposed to some other potential problem, but it's good to have the information. --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote: > If one modifies libpq to use openssl readahead (which does result in speedups, > because otherwise openssl think it's useful to do lots of 5 byte reads from > the socket), I see occasional hangs in libpq. Now that is a very interesting coincidence, because the patch I'm working on is about to add an assertion that readahead is turned *off*. :D Based on my understanding of [1], readahead makes this overall problem much worse by opportunistically slurping bytes off the wire and doing absolutely nothing with them until you call SSL_read() enough times to finally get to them. It would even hide those bytes from the pending count: > Therefore, it is possible for no more bytes to be readable from the > underlying BIO (because OpenSSL has already read them) and for > SSL_pending() to return 0, even though readable application data bytes > are available (because the data is in unprocessed buffered records). That seems like the polar opposite of what we want. If clients are polling on the raw socket, those pending bytes must be accounted for. > It does seem ... weird ... that pqGetCopyData3() just processes whatever is > already in the libpq buffer and then waits for the socket to become readable, > before ever calling pqReadData(). What guarantees, even without the change > above, that a) there's no pending data inside openssl b) that we're not > waiting for a write? (a) is the bug discussed here, AFAICT? I have not considered (b) yet, but that seems like a separate problem. > If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue > if it returns 1, the hang vanishes. Sure, but that's cheating. I think I'd expect that adding extra "unnecessary" reads would help cover up the problem that readahead has caused. > What are the limits for the maximum amount of data this could make us buffer > in addition to what we are buffering right now? I still need to make sure, but I believe it should be an additional 16k due to the current limits on GSS token sizes and TLS record sizes. In normal operation, I think we basically immediately double the input buffer sizes (which start at 16k) as soon as a long message comes in. So that order of magnitude didn't immediately seem worrisome to me. > It's not entirely obvious to > me that a loop around pqReadData() as long as there is pending data couldn't > make us buffer a lot of data. FWIW, I'm not planning to loop around it; I'm planning to add a pgsecure_* primitive that drains whatever the transport is holding onto. > Do you have a WIP patch? I'm working on one now. --Jacob [1] https://docs.openssl.org/3.5/man3/SSL_pending/#synopsis
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Andres Freund
Дата:
Hi, On 2025-07-16 11:50:46 -0700, Jacob Champion wrote: > On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote: > > If one modifies libpq to use openssl readahead (which does result in speedups, > > because otherwise openssl think it's useful to do lots of 5 byte reads from > > the socket), I see occasional hangs in libpq. > > Now that is a very interesting coincidence, because the patch I'm > working on is about to add an assertion that readahead is turned > *off*. :D Heh. > Based on my understanding of [1], readahead makes this overall problem > much worse by opportunistically slurping bytes off the wire and doing > absolutely nothing with them until you call SSL_read() enough times to > finally get to them. Right - but it also substantially reduces the number of syscalls :(. I'm not sure that it's wise to close that door permanently. Clearly it's not something we need to care about in the back branches. Without it openssl will do one read for the record length, and another read for the record contents. I.e. it'll often double the number of syscalls. Without readahead: perf stat -e raw_syscalls:sys_enter,cycles,context-switches src/bin/pgbench/pgbench -c $c -j$j -M prepared -n -P 1 -f <(echo'select 1') -h localhost -t1000 5,561 raw_syscalls:sys_enter 89,125,976 cycles 1,009 context-switches With readahead: 4,556 raw_syscalls:sys_enter 83,582,571 cycles 1,007 context-switches > It would even hide those bytes from the pending > count: > > > Therefore, it is possible for no more bytes to be readable from the > > underlying BIO (because OpenSSL has already read them) and for > > SSL_pending() to return 0, even though readable application data bytes > > are available (because the data is in unprocessed buffered records). It's rather depressing, like how this this API behaviour came to be? Luckily there seems to be SSL_has_pending(): > SSL_has_pending() returns 1 if s has buffered data (whether processed or > unprocessed) and 0 otherwise And it seems old enough: > The SSL_has_pending() function was added in OpenSSL 1.1.0. > > It does seem ... weird ... that pqGetCopyData3() just processes whatever is > > already in the libpq buffer and then waits for the socket to become readable, > > before ever calling pqReadData(). What guarantees, even without the change > > above, that a) there's no pending data inside openssl b) that we're not > > waiting for a write? > > (a) is the bug discussed here, AFAICT? I'm not entirely sure, tbh. > > If I make pqGetCopyData3() call pqReadData() before the pqWait() and continue > > if it returns 1, the hang vanishes. > > Sure, but that's cheating. I think I'd expect that adding extra > "unnecessary" reads would help cover up the problem that readahead has > caused. To me the pattern in our code seems bogus, even without readahead, no? getCopyDataMessage() can return 0 because our internal buffer is empty, without ever reading from the socket or openssl. But if the SSL record contained two messages (or parts of two messages), all the required data may be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait forever. That's a problem unrelated to readahead, no? I.e. the extra read actually ensures that we're not doing a pqWait() without knowing whether we need one. Greetings, Andres Freund
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > Based on my understanding of [1], readahead makes this overall problem > > much worse by opportunistically slurping bytes off the wire and doing > > absolutely nothing with them until you call SSL_read() enough times to > > finally get to them. > > Right - but it also substantially reduces the number of syscalls :(. I'm not > sure that it's wise to close that door permanently. I think it's very likely that OpenSSL's implementation of readahead is fundamentally incompatible with our async implementation. For one, the documented way to call SSL_read() without accidentally hitting the socket is via SSL_pending(), which readahead is documented to break. For two, if you're worried about how much data we could potentially have to hold during a "drain all pending" operation, I think readahead changes the upper bound from "the size of one TLS record" to "SO_RCVBUF", doesn't it? > Without it openssl will do one read for the record length, and another read > for the record contents. I.e. it'll often double the number of syscalls. That is unfortunate... but if we're talking about a correctness/performance tradeoff then I'm somewhat less sympathetic to the performance side. > Luckily there seems to be SSL_has_pending(): > > > SSL_has_pending() returns 1 if s has buffered data (whether processed or > > unprocessed) and 0 otherwise IIUC, we can't use SSL_has_pending() either. I need to know how exactly many bytes to drain out of the userspace buffer without introducing more bytes into it. > To me the pattern in our code seems bogus, even without readahead, no? > > getCopyDataMessage() can return 0 because our internal buffer is empty, > without ever reading from the socket or openssl. But if the SSL record > contained two messages (or parts of two messages), all the required data may > be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait > forever. I think the "cheating" in pqSocketCheck() that I mentioned upthread ensures that pqWait() will see the OpenSSL data and return immediately. (That's a major architectural smell IMO, but it is there. Just not for GSS, which might explain some hangs discussed on the list a while back.) > I.e. the extra read actually ensures that we're not doing a pqWait() without > knowing whether we need one. I don't think so, because (without my WIP patch) you still can't guarantee that the single call to pqReadData() got all of the readahead data. And with my WIP patch, you don't need the extra call, because the previous call to pqReadData() will have ensured that there's no leftover SSL buffer to begin with. Am I missing something? Thanks! --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Andres Freund
Дата:
Hi, On 2025-07-16 15:25:14 -0700, Jacob Champion wrote: > On Wed, Jul 16, 2025 at 2:34 PM Andres Freund <andres@anarazel.de> wrote: > > > Based on my understanding of [1], readahead makes this overall problem > > > much worse by opportunistically slurping bytes off the wire and doing > > > absolutely nothing with them until you call SSL_read() enough times to > > > finally get to them. > > > > Right - but it also substantially reduces the number of syscalls :(. I'm not > > sure that it's wise to close that door permanently. > > I think it's very likely that OpenSSL's implementation of readahead is > fundamentally incompatible with our async implementation. For one, the > documented way to call SSL_read() without accidentally hitting the > socket is via SSL_pending(), which readahead is documented to break. Why do we care about not hitting the socket? We always operate the socket in non-blocking mode anyway? FWIW, I have seen too many folks move away from encrypted connections for backups due to openssl being the bottlenecks to just accept that we'll never go beyond the worst performance. Any actually decently performing networking will need to divorce when socket IO happens and when openssl sees the data much more heavily. Just relying more and more on extremely tight coupling is a dead end. > For two, if you're worried about how much data we could potentially > have to hold during a "drain all pending" operation, I think readahead > changes the upper bound from "the size of one TLS record" to > "SO_RCVBUF", doesn't it? It seems to be a configurable limit, with SSL_set_default_read_buffer_len(). But the default is to just read up to a whole frame's worth of data, instead of of reading the length separately. I.e. an irrelevant amount of memory. > > Without it openssl will do one read for the record length, and another read > > for the record contents. I.e. it'll often double the number of syscalls. > > That is unfortunate... but if we're talking about a > correctness/performance tradeoff then I'm somewhat less sympathetic to > the performance side. I'm obviously not advocating for incorrectness. > > Luckily there seems to be SSL_has_pending(): > > > > > SSL_has_pending() returns 1 if s has buffered data (whether processed or > > > unprocessed) and 0 otherwise > > IIUC, we can't use SSL_has_pending() either. I need to know how > exactly many bytes to drain out of the userspace buffer without > introducing more bytes into it. Why? The only case where we ought to care about whether pending data exists inside openssl is if our internal buffer is either empty or doesn't contain the entire message we're trying to consume. In either of those two cases we can just consume some data from openssl, without caring how much it precisely is. > > To me the pattern in our code seems bogus, even without readahead, no? > > > > getCopyDataMessage() can return 0 because our internal buffer is empty, > > without ever reading from the socket or openssl. But if the SSL record > > contained two messages (or parts of two messages), all the required data may > > be in openssl. In which case the pqWait() that pqGetCopyData3() does will wait > > forever. > > I think the "cheating" in pqSocketCheck() that I mentioned upthread > ensures that pqWait() will see the OpenSSL data and return > immediately. (That's a major architectural smell IMO, but it is there. > Just not for GSS, which might explain some hangs discussed on the list > a while back.) I guess so, but it's hard to know without a patch. > > I.e. the extra read actually ensures that we're not doing a pqWait() without > > knowing whether we need one. > > I don't think so, because (without my WIP patch) you still can't > guarantee that the single call to pqReadData() got all of the > readahead data. It wouldn't need to get all the data in one pqReadData() in this case, as we'd just loop around and do the whole thing again, until the entire message is read in (regardless of whether there's remaining data in a partially consumed record, or in the "unprocessed buffer"). Greetings, Andres Freund
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote: > Why do we care about not hitting the socket? We always operate the socket in > non-blocking mode anyway? IIUC, that would change pqReadData() from a bounded read to an unbounded one. Then we have to somehow protect against a server that can send data faster than we can process it. > FWIW, I have seen too many folks move away from encrypted connections for > backups due to openssl being the bottlenecks to just accept that we'll never > go beyond the worst performance. Any actually decently performing networking > will need to divorce when socket IO happens and when openssl sees the data > much more heavily. Can't we make our custom BIO do that? > Just relying more and more on extremely tight coupling is a > dead end. Why? This abstraction _must_ leak. Because of PQsocket() at minimum, but also there's a bunch of code complexity around the read/write behavior and EOF differences for a TLS connection. When I look at the code in pqReadData() I am very sad. TLS does not behave like raw TCP. So if there's no way to get rid of the coupling, IMO we might as well use it. > > For two, if you're worried about how much data we could potentially > > have to hold during a "drain all pending" operation, I think readahead > > changes the upper bound from "the size of one TLS record" to > > "SO_RCVBUF", doesn't it? > > It seems to be a configurable limit, with > SSL_set_default_read_buffer_len(). But the default is to just read up to a > whole frame's worth of data, instead of of reading the length > separately. I.e. an irrelevant amount of memory. Hmm, okay. > > IIUC, we can't use SSL_has_pending() either. I need to know how > > exactly many bytes to drain out of the userspace buffer without > > introducing more bytes into it. > > Why? The only case where we ought to care about whether pending data exists > inside openssl is if our internal buffer is either empty or doesn't contain > the entire message we're trying to consume. In either of those two cases we > can just consume some data from openssl, without caring how much it precisely > is. I can't tell if you're discussing the fix for this bug or a hypothetical future architecture that makes readahead safe. I don't think PQconnectPoll() behaves the way you're describing; there is no retry-read concept. The order of operations is: - read some available data; don't care how much - parse the message(s) if we have enough data - otherwise tell the caller to poll for read That's an architecture that's already very tightly coupled to the behavior of un-userspace-buffered TCP/UNIX sockets. Which is why I'm trying to have pqReadData() make compatible guarantees for SSL/GSS. --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Wed, Jul 16, 2025 at 11:50 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > On Wed, Jul 16, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote: > > Do you have a WIP patch? > > I'm working on one now. The attached still needs some documentation work, and closer inspection of the new assertions and OpenSSL error handling, but I think it's good enough to get my intent across. 0001 is a fix for pqWait+gssencmode, which is hopefully the easier of the two to review/backport. The current behavior for GSS is, IMO, an obvious oversight. 0002 adds drain semantics at the end of pqReadData(). If we were writing this from scratch, it wouldn't look like this: there's no point in having the transport layer split data off into an undersized conn->inBuffer if we're just going to come right back and drain it all anyway. But I'm hoping this is easier to reason about and verify, compared to a bigger rewrite. I've renamed the existing pgtls_read_pending() to pgtls_read_is_pending(), to try to disambiguate it from the new *_drain_pending() APIs, but I'm not particularly attached to my choice of names. Thanks, --Jacob
Вложения
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > The current behavior for GSS is, IMO, an > obvious oversight. (A better way to word this would have been "clearly an oversight"; I didn't mean to imply that the correct behavior is obvious. The opposite, in fact.) --Jacob
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Andres Freund
Дата:
Hi, On 2025-07-17 09:48:29 -0700, Jacob Champion wrote: > On Wed, Jul 16, 2025 at 4:35 PM Andres Freund <andres@anarazel.de> wrote: > > Why do we care about not hitting the socket? We always operate the socket in > > non-blocking mode anyway? > > IIUC, that would change pqReadData() from a bounded read to an > unbounded one. Then we have to somehow protect against a server that > can send data faster than we can process it. I don't see why it would have to. If a connection has sufficient data in its buffer on the libpq level, we obviously don't need to read more. > > FWIW, I have seen too many folks move away from encrypted connections for > > backups due to openssl being the bottlenecks to just accept that we'll never > > go beyond the worst performance. Any actually decently performing networking > > will need to divorce when socket IO happens and when openssl sees the data > > much more heavily. > > Can't we make our custom BIO do that? We can, but that actually just makes this issue *more* acute, not less. We would obviously have to take the data buffered on the BIO level into account before signalling to the libpq user that no data is ready. FWIW, even with BIO level buffering, it turns out that openssl readahead is *still* a decent performance boon, going twice as many times in the BIO isn't free, even if the BIO does buffering. > > Just relying more and more on extremely tight coupling is a dead end. > > Why? This abstraction _must_ leak. I don't really buy this. We have buffering in libpq, even without TLS. Most of the issues we need to solve already exists for non-TLS connections (except for the annoying thing of introduces writes when doing reads and vice versa). It's just that our implementation is bad. > TLS does not behave like raw TCP. So if there's no way to get rid of > the coupling, IMO we might as well use it. I agree that they're not the same. I don't see what that has to do with introducing hard requirement for no buffering within openssl or layers below (a future BIO that does caching). > > > IIUC, we can't use SSL_has_pending() either. I need to know how > > > exactly many bytes to drain out of the userspace buffer without > > > introducing more bytes into it. > > > > Why? The only case where we ought to care about whether pending data exists > > inside openssl is if our internal buffer is either empty or doesn't contain > > the entire message we're trying to consume. In either of those two cases we > > can just consume some data from openssl, without caring how much it precisely > > is. > > I can't tell if you're discussing the fix for this bug or a hypothetical > future architecture that makes readahead safe. I don't know your fix really looks like - afaict you haven't shared it. So it's hard to actually comment with specifics to it. I am not saying that a to-be-backpatched-fix needs to make openssl readahead work, that'd be absurd. But I am concerned with more fundamentally entrenching the idea that there never is any buffering below libpq's buffering, it'll cause trouble down the line. > I don't think PQconnectPoll() behaves the way you're describing; there is no > retry-read concept. FWIW, I don't care about what we do during connection establishment. Doing lots of tiny reads or such will never matter for efficiency, compared to all the other costs. To me the relevant cases are use of libpq in established connections, either in a blocking way, or in a non-blocking way. For blocking use we have that gross SSL hack in place (i.e. the pgtls_read_pending() check in pqSocketCheck()), but we obviously *don't* for the non-blocking case - the problem of this thread. Our APIs around the non-blocking use of libpq are pretty bonkers: - PQconsumeInput() doesn't tell you whether it actually consumed input - We kind of document PQconsumeInput() to consume all the available input, but it does no such thing, it just fills the internal buffer. Which might be big, or it might not be, depending on what previously happened on the connection. - PQisBusy() returns true even if there's unconsumed data in the socket There's really no sane way to write an event loop with this that doesn't do unnecessary poll()s, because PQconsumeInput() might have consumed data, but PQisBusy() might *still* return true, because not enough input has been consumed. I suspect that the the most minimal way to fix this, without breaking the external API, would be for PQisBusy() to try to read more data. But that's also pretty gross. Greetings, Andres Freund
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 12:55 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, I think we're talking past each other, so let me try to focus on just a few items here. I'm happy to go back and respond point-by-point if needed. > I don't know your fix really looks like - afaict you haven't shared it. So > it's hard to actually comment with specifics to it. Just upthread [1]. Emails probably crossed while you were typing. > I am not saying that a to-be-backpatched-fix needs to make openssl readahead > work, that'd be absurd. But I am concerned with more fundamentally entrenching > the idea that there never is any buffering below libpq's buffering, it'll > cause trouble down the line. And I'm not saying that I'm fundamentally opposed to a future architecture that allows readahead. But I do want to pin behavior that's required for safety in the _current_ architecture. We can unpin it if the order of operations is changed; assertions that have been added can always be deleted. > FWIW, I don't care about what we do during connection establishment. I have to care, because upthread I've shown that we can deadlock there too. My goal in this thread is to fix the deadlock generally, on all branches. --Jacob [1] https://postgr.es/m/CAOYmi%2BmD6EbDEYfwZON0FCUAvGO%2B2%3DjR2V4KQYx%2Bd%2Bg0ap0Amg%40mail.gmail.com
Re: libpq: Process buffered SSL read bytes to support records >8kB on async API
От
Jacob Champion
Дата:
On Fri, Jul 18, 2025 at 11:11 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > The attached still needs some documentation work v2 does a bunch of commit message work, but I imagine it needs a good bit of copy-editing for clarity. I'm not in any hurry to smash this in. I think we still need - independent verification of the architectural issue, to make sure it's not any deeper or shallower than pqReadData() - independent verification that this fixes the bugs that have been described - measurement of the performance characteristics of the new code - verification of the maximum amount of additional buffer memory that can be consumed during the drain - consensus that we want to maintain this new behavior - discussion of what we want this code to look like going forward Andres, does this patch help clarify my thoughts upthread? Ideally the additional code isn't getting in the way of any future rearchitectures, since it only pins the new requirement in the code that needs it. Thanks, --Jacob