Обсуждение: BUG #17948: libpq seems to misbehave in a pipelining corner case

Поиск
Список
Период
Сортировка

BUG #17948: libpq seems to misbehave in a pipelining corner case

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17948
Logged by:          Ivan Trofimov
Email address:      i.trofimow@yandex.ru
PostgreSQL version: 15.3
Operating system:   Ubuntu 20.04
Description:

As far as i understand, there is an invariant in libpq, that if
PQpipelineSync call finished successfully,
PQresultStatus will eventually return PGRES_PIPELINE_SYNC, or the connection
will be in CONNECTION_BAD state.
 
This is highlighted in several places in the docs:
1. "PGRES_PIPELINE_SYNC is reported exactly once for each PQpipelineSync at
the corresponding point in the pipeline."
2. "From the client's perspective, after PQresultStatus returns
PGRES_FATAL_ERROR, the pipeline is flagged as aborted. 
PQresultStatus will report a PGRES_PIPELINE_ABORTED result for each
remaining queued operation in an aborted pipeline. 
The result for PQpipelineSync is reported as PGRES_PIPELINE_SYNC to signal
the end of the aborted pipeline and 
resumption of normal result processing. The client must process results with
PQgetResult during error recovery."
 
So i expect a code like this
 
if (!PQpipelineSync(conn)) exit(1);
 
while (PQstatus(conn) != CONNECTION_BAD) {
   res = PQgetResult(conn);
   if (PQresultStatus(res) == PGRES_PIPELINE_SYNC) {
       break;
   }
}
 
to eventually exit the loop.
 
However, if instead of expected "ReadyToQuery" response server sends an
error and closes the connection (
say, backend terminated by administrator or with a proxy in place and
upstream PG server being dead) this loop
gets stuck in busy-loop.
 
My understanding is that transitions in pipelining state-machine go like
this:
1. Initial PQgetResult call gets here

https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2081
2. parseInput gets an error the server send

https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-protocol3.c#L216

and switches the state into PGASYNC_READY
3. PQgetResult continues and here

https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2126
advances the queue, so the Sync entry is gone, and switches into
PGASYNC_PIPELINE_IDLE later.
4. Next PQgetResult call gets here

https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#LL2110C4-L2110C26,
and pqPipelineProcessQueue switches into PGASYNC_IDLE here

https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L3084
 
Now we are stuck in the position where libpq considers the connection being
in CONNECTION_OK state 
(because no reads over half-closed socket have been issues), 
asyncStatus being PGASYNC_IDLE, pipeline being in PQ_PIPELINE_ABORTED state,

and the expected PGRES_PIPELINE_SYNC never came and never will (because
PGASYNC_IDLE, so PQgetResult returns NULL right away).

I am able to reproduce this behavior reliably via
https://pastebin.com/raw/4j3v3QzC,
linking against libpq5=15.3 and running the program against PostgreSQL
15.2.

Who is at fault here, is it libpq or me misunderstanding/misusing libpq?


Re: BUG #17948: libpq seems to misbehave in a pipelining corner case

От
Alvaro Herrera
Дата:
On 2023-Aug-05, Трофимов Иван wrote:

>    We've managed to reproduce client <-> server de-synchronization triggered
>    by this, which we are seeing in production.
>    Libpq considers '< 2TDCEZ' a sufficient response to '> BDESS', when
>    according to specification one more 'Z' is expected.
>     
>    A bit more context and a MRE:
>    https://github.com/itrofimow/libpq_protocol_desync

Does this patch fix the problems?  I think the sync loss is because
we're consuming elements from the command queue even when we shouldn't,
because we don't know for an ERROR whether we should do so or not.  What
this patch does is a bit of a hack, I think, but it does solve your test
case (and it doesn't break anything else in the tree) AFAICS.  I may
have to play some more with it before gaining confidence, though.

Thanks

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)

Вложения

Re: BUG #17948: libpq seems to misbehave in a pipelining corner case

От
Ivan Trofimov
Дата:
Thank you for looking into this, much appreciated.

Yes, I can confirm that the patch fixes both the busy-loop and 
client<->server desynchronization problem for me.



Re: BUG #17948: libpq seems to misbehave in a pipelining corner case

От
Alvaro Herrera
Дата:
On 2023-Dec-02, Ivan Trofimov wrote:

> Thank you for looking into this, much appreciated.
> 
> Yes, I can confirm that the patch fixes both the busy-loop and
> client<->server desynchronization problem for me.

Great, thanks for the confirmation.

Mulling it over while writing better comments for it, I realized that it
could still be somewhat brittle, and it's also confusing that part of
the logic is in PQgetResult and the rest in pqCommandQueueAdvance; so I
moved it all to the latter, which also lets me document it more clearly
in the comments.  I addressed the brittleness by changing what to test:
instead of just "it's an error, then do nothing if the queue has a
SYNC", we can verify that we actually have a matching SYNC in both
places.  This seems much better.

Patch v2 attached.  I'll be pushing this to 14+ tomorrow, unless
something ugly comes up.

Thanks for reporting this problem!

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Вложения