BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection
Дата
Msg-id 18210-4c6d0b14627f2eb8@postgresql.org
обсуждение исходный текст
Ответы Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18210
Logged by:          Lars Kanis
Email address:      lars@greiz-reinsdorf.de
PostgreSQL version: 16.1
Operating system:   Ubuntu-23.10
Description:

I'm the maintainer of the ruby binding of libpq and this issue happens in
Ruby, but it isn't related to the Ruby environment. It is only that the ruby
binding always runs with the libpq non-blocking mode enabled internally, so
that this issue happens.

The error "GSSAPI caller failed to retransmit all data needing to be
retried" is raised here:

https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
It happens only in non-blocking mode over GSSAPI encrypted connections. It
isn't reliable and depends on the network timing. When sending a 7MB file in
alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
chance to get the failure over the local network. It doesn't happen if TLS
is used instead.

I added some debug prints into libpq to show the amounts of data
transferred. It also shows the gdb backtrace of where the error is raised.
It like so:
```
[...lots of data blocks successfully transferred...]
PQputCopyData() size=186
PQputCopyData() size=27714
pqPutMsgEnd() conn->outCount=30162 toSend=24576
pg_GSS_write() len=24576  PqGSSSendConsumed=0
PQputCopyData() size=189
PQputCopyData() size=27405
pqPutMsgEnd() conn->outCount=33190 toSend=32768
pg_GSS_write() len=32768  PqGSSSendConsumed=0
PQputCopyData() size=185
PQputCopyData() size=27481
pqPutMsgEnd() conn->outCount=28098 toSend=24576
pg_GSS_write() len=24576  PqGSSSendConsumed=0
PQputCopyData() size=182
pqPutMsgEnd() conn->outCount=11965 toSend=8192
pg_GSS_write() len=8192  PqGSSSendConsumed=8256

Breakpoint 1, 0x00007ffff20fe811 in appendPQExpBufferStr ()
   from /home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
(gdb) bt
#0  0x00007ffff20fe811 in appendPQExpBufferStr () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#1  0x00007ffff2103545 in pg_GSS_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#2  0x00007ffff20fbbad in pqsecure_write () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#3  0x00007ffff20f569d in pqSendSome () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#4  0x00007ffff20f517c in pqPutMsgEnd () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#5  0x00007ffff20f07ba in PQputCopyData () from
/home/kanis/ruby-pg/ports/x86_64-linux/lib/libpq.so.5
#6  0x00007ffff1f259e2 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#7  0x00007ffff7c67002 in rb_nogvl () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#8  0x00007ffff1f261ac in gvl_PQputCopyData ()
   from
/home//kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#9  0x00007ffff1f2af90 in ?? () from
/home/kanis/.rvm/gems/ruby-3.2.1/gems/pg-1.5.4-x86_64-linux/lib/3.2/pg_ext.so
#10 0x00007ffff7c9ebff in vm_call_cfunc_with_frame ()
   from /home//kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#11 0x00007ffff7caefa0 in vm_exec_core () from
/home/kanis/.rvm/rubies/ruby-3.2.1/bin/../lib/libruby.so.3.2
#12 0x00007ffff7cb4a23 in rb_vm_exec () from /home/kanis/.rvm/rub
[...more ruby stack...]
```

The problem occurs when pqPutMsgEnd() sends some data per pqSendSome(), if
there is more that 8192 Bytes in the output buffer. The data size is
truncated to the next lower 8KB boundary. This happens here:
https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-misc.c#L544
Due to the truncation it can happen that the boundary is before the boundary
of the previous call of pqSendSome(). This in turn is raised by
pg_GSS_write() which can not transfer any data in this case.

So the error message "GSSAPI caller failed to retransmit all data needing to
be retried" is not about the application calling libpq. In fact the 7MB file
always fits into libpq's internal buffer memory, so that no retransmits are
done on the application level or at the level of the ruby binding. 

The issue is fixed if I change libpq at either pqPutMsgEnd() or at
pg_GSS_write().
So either avoid the truncation of send size to next 8KB boundary in
pqPutMsgEnd() like so:
```
    if (conn->outCount >= 8192)
    {
        int            toSend = conn->outCount; // - (conn->outCount % 8192);

        if (pqSendSome(conn, toSend) < 0)
            return EOF;
```

... or just returning 0 Bytes in pg_GSS_write() instead of raising an
error:
```
    if (len < PqGSSSendConsumed)
    {
        return 0;
        appendPQExpBufferStr(&conn->errorMessage,
                             "GSSAPI caller failed to retransmit all data needing to be
retried\n");
```

Unfortunately so far I'm unable to provide a repro script. Maybe the above
information is enough to understand the problem.

Regards,
Lars Kanis


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: BUG #18187: Unexpected error: "variable not found in subplan target lists" triggered by JOIN
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection