Re: postgres_fdw test timeouts

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: postgres_fdw test timeouts
Дата
Msg-id 8051993b-e7ae-42fc-4b57-ef4912d50035@gmail.com
обсуждение исходный текст
Ответ на Re: postgres_fdw test timeouts  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: postgres_fdw test timeouts  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Hello Thomas,

03.12.2023 09:00, Thomas Munro wrote:
> On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've managed to reproduce the failure locally when running postgres_fdw_x/
>> regress in parallel (--num-processes 10). It reproduced for me on
>> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
>> passed).
>>
>> I'm going to investigate this case within days. Maybe we could find a
>> better fix for the issue.
> Thanks.  One thing I can recommend to anyone trying to understand the
> change is that you view it with:

I think, I've found out what's going on here.
The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
ExecAppendAsyncEventWait().
For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
where ExecAsyncNotify() is called for the first AsyncRequest, but the
second one also processed inside, through a recursive call to
ExecAppendAsyncEventWait():
  -> ExecAsyncNotify -> produce_tuple_asynchronously
-> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
-> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
-> ExecAppendAsyncEventWait.
Here we get into the first loop and call ExecAsyncConfigureWait() for the
second AsyncRequest (because we haven't reset it's callback_pending yet),
and it leads to creating another WaitEvent for the second socket inside
postgresForeignAsyncConfigureWait():
     AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...

This WaitEvent seemingly misses an event that we should get for that socket.
It's not that important to get noccured > 1 in
ExecAppendAsyncEventWait() to see the failure, it's enough to call
WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket
(I tried to exit from the WaitEventSetWaitBlock's new loop prematurely,
without touching occurred_events, returned_events on a second iteration of
the loop).

So it looks like we have the same issue with multiple event handles
associated with a single socket here.
And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
"surprisingly" helps in this case as well (I could not see a failure for
100 iterations of 10 tests in parallel).

[1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Junwang Zhao
Дата:
Сообщение: Re: Make COPY format extendable: Extract COPY TO format implementations
Следующее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby