Re: postgres_fdw test timeouts
От | Alexander Lakhin |
---|---|
Тема | Re: postgres_fdw test timeouts |
Дата | |
Msg-id | 8051993b-e7ae-42fc-4b57-ef4912d50035@gmail.com обсуждение исходный текст |
Ответ на | Re: postgres_fdw test timeouts (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: postgres_fdw test timeouts
|
Список | pgsql-hackers |
Hello Thomas, 03.12.2023 09:00, Thomas Munro wrote: > On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> I've managed to reproduce the failure locally when running postgres_fdw_x/ >> regress in parallel (--num-processes 10). It reproduced for me on >> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations >> passed). >> >> I'm going to investigate this case within days. Maybe we could find a >> better fix for the issue. > Thanks. One thing I can recommend to anyone trying to understand the > change is that you view it with: I think, I've found out what's going on here. The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of ExecAppendAsyncEventWait(). For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop, where ExecAsyncNotify() is called for the first AsyncRequest, but the second one also processed inside, through a recursive call to ExecAppendAsyncEventWait(): -> ExecAsyncNotify -> produce_tuple_asynchronously -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend -> ExecAppendAsyncEventWait. Here we get into the first loop and call ExecAsyncConfigureWait() for the second AsyncRequest (because we haven't reset it's callback_pending yet), and it leads to creating another WaitEvent for the second socket inside postgresForeignAsyncConfigureWait(): AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ... This WaitEvent seemingly misses an event that we should get for that socket. It's not that important to get noccured > 1 in ExecAppendAsyncEventWait() to see the failure, it's enough to call WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket (I tried to exit from the WaitEventSetWaitBlock's new loop prematurely, without touching occurred_events, returned_events on a second iteration of the loop). So it looks like we have the same issue with multiple event handles associated with a single socket here. And v2-0003-Redesign-Windows-socket-event-management.patch from [1] "surprisingly" helps in this case as well (I could not see a failure for 100 iterations of 10 tests in parallel). [1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com Best regards, Alexander
В списке pgsql-hackers по дате отправления: