Обсуждение: Converting pqsignal to void return
Nathan,
I just ran across this change when a bunch of CI’s I run barfed.
The problem we are having in extension land is that we often run functions in external libraries that take a long time to return, and we would like to ensure that PgSQL users can cancel their queries, even when control has passed into those functions.
The way we have done it, historically, has been to take the return value of pqsignal(SIGINT, extension_signint_handler) and remember it, and then, inside extension_signint_handler, call the pgsql handler once we have done our own business.
It is possible we have been Doing It Wrong all this time, and would love some pointers on the right way to do this.
Alternatively, if we have a valid use case, it would be nice to have the return value from pqsignal() back.
Thanks!
Paul
On Wed, Jan 22, 2025 at 07:57:52AM -0800, Paul Ramsey wrote: > I just ran across this change when a bunch of CI´s I run barfed. Oops, sorry. > The problem we are having in extension land is that we often run > functions in external libraries that take a long time to return, and we > would like to ensure that PgSQL users can cancel their queries, even when > control has passed into those functions. > > The way we have done it, historically, has been to take the return value > of pqsignal(SIGINT, extension_signint_handler) and remember it, and then, > inside extension_signint_handler, call the pgsql handler once we have > done our own business.. I see a couple other projects that might be doing something similar [0]. > It is possible we have been Doing It Wrong all this time, and would love > some pointers on the right way to do this. > > Alternatively, if we have a valid use case, it would be nice to have the > return value from pqsignal() back. I don't think you were doing it wrong before, but commit 3b00fdb did add some weird race conditions to the return value, but only in rare cases that I don't think apply to the case you shared. I was tempted to suggest that you try using sigaction() with the second "act" argument set to NULL to retrieve the original signal handler, but I think there's a good chance you'd get our wrapper_handler() function (we store the actual handlers in a separate array), which is no good. So, I should probably just revert d4a43b2. An alternative approach could be to provide our own way to retrieve the current handler, but that comes with its own race conditions, although perhaps those are more obvious than the pqsignal() ones. WDYT? [0] https://codesearch.debian.net/search?q=%3D+pqsignal&literal=1 -- nathan
Hi, On 2025-01-22 07:57:52 -0800, Paul Ramsey wrote: > I just ran across this change when a bunch of CI’s I run barfed. > > https://github.com/postgres/postgres/commit/d4a43b283751b23d32bbfa1ecc2cad2d16e3dde9 > > The problem we are having in extension land is that we often run functions > in external libraries that take a long time to return, and we would like to > ensure that PgSQL users can cancel their queries, even when control has > passed into those functions. > > The way we have done it, historically, has been to take the return value of > pqsignal(SIGINT, extension_signint_handler) and remember it, and then, > inside extension_signint_handler, call the pgsql handler once we have done > our own business. I think that's a really bad idea. For one, postgres might use signals in different process types for different things. We are also working on not using plain signals in a bunch of places. It's also really hard to write correct signal handlers. E.g.: > https://github.com/pramsey/pgsql-http/blob/master/http.c#L345 That signal handler is profoundly unsafe: http_interrupt_handler(int sig) { /* Handle the signal here */ elog(DEBUG2, "http_interrupt_handler: sig=%d", sig); http_interrupt_requested = sig; pgsql_interrupt_handler(sig); return; } You're definitely not allowed to elog in a signal handler unless you take *extraordinary* precautions (basically converting the normal execution path to only perform async-signal-safe work). This signal handler can e.g. corrupt the memory context used by elog.c, deadlock in the libc memory allocator, break the FE/BE protocol if any client ever set client_min_messages=debug2, jump out of a signal handler in case of an allocation failure and many other things. > It is possible we have been Doing It Wrong all this time, and would love some pointers on the right way to do this. All your interrupt handler is doing "for you" is setting http_interrupt_requested, right? Why do you need that variable? Seems you could just check postgres' QueryCancelPending? And perhaps ProcDiePending, if you want to handle termination too. Greetings, Andres Freund
Hi, On 2025-01-22 11:50:30 -0500, Andres Freund wrote: > On 2025-01-22 07:57:52 -0800, Paul Ramsey wrote: > > It is possible we have been Doing It Wrong all this time, and would love some pointers on the right way to do this. > > All your interrupt handler is doing "for you" is setting > http_interrupt_requested, right? Why do you need that variable? Seems you > could just check postgres' QueryCancelPending? And perhaps ProcDiePending, if > you want to handle termination too. FWIW, the even better approach in this case probably would be to use curl in a non-blocking manner. Then you can do all the blocking in a loop around WaitLatchOrSocket (or, for slightly improved efficiency, create a wait event and wait with WaitEventSetWait()) and react to interrupts with CHECK_FOR_INTERRUPTS() like normal postgres code. Greetings, Andres Freund
> On Jan 22, 2025, at 8:50 AM, Andres Freund <andres@anarazel.de> wrote: > >> It is possible we have been Doing It Wrong all this time, and would love some pointers on the right way to do this. > > All your interrupt handler is doing "for you" is setting > http_interrupt_requested, right? Why do you need that variable? Seems you > could just check postgres' QueryCancelPending? And perhaps ProcDiePending, if > you want to handle termination too. This is the Knowledge that was sitting right in front of me, but I could not see. From the road to Damascus, Paul
Hi, On 2025-01-22 10:22:45 -0600, Nathan Bossart wrote: > On Wed, Jan 22, 2025 at 07:57:52AM -0800, Paul Ramsey wrote: > > The problem we are having in extension land is that we often run > > functions in external libraries that take a long time to return, and we > > would like to ensure that PgSQL users can cancel their queries, even when > > control has passed into those functions. > > > > The way we have done it, historically, has been to take the return value > > of pqsignal(SIGINT, extension_signint_handler) and remember it, and then, > > inside extension_signint_handler, call the pgsql handler once we have > > done our own business.. > > I see a couple other projects that might be doing something similar [0]. > > [0] https://codesearch.debian.net/search?q=%3D+pqsignal&literal=1 Grouping by package I see three packages. All these seem to not handle backend termination, which certainly doesn't seem optimal - it e.g. means that a fast shutdown won't really work. 1) postgis For at least some of the cases it doesn't look trivial to convert to checking QueryCancelPending/ProcDiePending because the signal handler calls into GEOS and lwgeom. 2) psql-http Also doesn't handle termination. Looks like it could trivially be converted to checking QueryCancelPending/ProcDiePending. 3) timescaledb This looks to be test only code where the signal handler only prints out a message. I'd assume it's not critical to log that message. IOW, the only worrisome case here is postgis. Given that we are working towards *not* relying on signals for a lot of this, I wonder if we ought to support registering callbacks to be called on receipt of query cancellation and backend termination. That then could e.g. call GEOS_interruptRequest() as postgis does. That'd have a chance of working in a threaded postgres too - unfortunately it looks like neither lwgeom's nor GEOS's interrupt mechanisms are thread safe at this point. It's worth noting that postgis ends up with a bunch of postgres-internals knowledge due to its desire to handle signals: https://github.com/postgis/postgis/blob/master/postgis/postgis_module.c#L51-L56 #ifdef WIN32 static void interruptCallback() { if (UNBLOCKED_SIGNAL_QUEUE()) pgwin32_dispatch_queued_signals(); } #endif Which seems really rather undesirable. But given how windows signals are currently delivered via the equivalent code in CHECK_FOR_INTERRUPTS(), I don't really see an alternative at this point :(. Greetings, Andres Freund
> On Jan 22, 2025, at 9:36 AM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2025-01-22 10:22:45 -0600, Nathan Bossart wrote: >> On Wed, Jan 22, 2025 at 07:57:52AM -0800, Paul Ramsey wrote: >>> The problem we are having in extension land is that we often run >>> functions in external libraries that take a long time to return, and we >>> would like to ensure that PgSQL users can cancel their queries, even when >>> control has passed into those functions. >>> >>> The way we have done it, historically, has been to take the return value >>> of pqsignal(SIGINT, extension_signint_handler) and remember it, and then, >>> inside extension_signint_handler, call the pgsql handler once we have >>> done our own business.. >> >> I see a couple other projects that might be doing something similar [0]. >> >> [0] https://codesearch.debian.net/search?q=%3D+pqsignal&literal=1 > > Grouping by package I see three packages. So many things for me to atone for! :) > All these seem to not handle backend termination, which certainly doesn't seem > optimal - it e.g. means that a fast shutdown won't really work. > > 1) postgis > > For at least some of the cases it doesn't look trivial to convert to > checking QueryCancelPending/ProcDiePending because the signal handler calls > into GEOS and lwgeom. We control both those libraries, so we just may need to change them up a little to maybe pass in an callback for their internalcheck-for-interrupt to call. > 2) psql-http > > Also doesn't handle termination. > Looks like it could trivially be converted to checking > QueryCancelPending/ProcDiePending. Going to fix this one first. > 3) timescaledb Not me! > IOW, the only worrisome case here is postgis. > Given that we are working towards *not* relying on signals for a lot of this, > I wonder if we ought to support registering callbacks to be called on receipt > of query cancellation and backend termination. That then could e.g. call > GEOS_interruptRequest() as postgis does. That'd have a chance of working in a > threaded postgres too - unfortunately it looks like neither lwgeom's nor > GEOS's interrupt mechanisms are thread safe at this point. I think callbacks are a good thing. Two of the libraries of interest to me (curl, GDAL) I end up using the progress metercallback to wedge myself in and force an interrupt as necessary. This seems like a common feature of libraries withlong running work. ATB, P > > > It's worth noting that postgis ends up with a bunch of postgres-internals > knowledge due to its desire to handle signals: > https://github.com/postgis/postgis/blob/master/postgis/postgis_module.c#L51-L56 > > #ifdef WIN32 > static void interruptCallback() { > if (UNBLOCKED_SIGNAL_QUEUE()) > pgwin32_dispatch_queued_signals(); > } > #endif > > Which seems really rather undesirable. > > > But given how windows signals are currently delivered via the equivalent code > in CHECK_FOR_INTERRUPTS(), I don't really see an alternative at this point :(. > > > > > Greetings, > > Andres Freund
Hi, On 2025-01-22 09:50:18 -0800, Paul Ramsey wrote: > > On Jan 22, 2025, at 9:36 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2025-01-22 10:22:45 -0600, Nathan Bossart wrote: > >> On Wed, Jan 22, 2025 at 07:57:52AM -0800, Paul Ramsey wrote: > >>> The problem we are having in extension land is that we often run > >>> functions in external libraries that take a long time to return, and we > >>> would like to ensure that PgSQL users can cancel their queries, even when > >>> control has passed into those functions. > >>> > >>> The way we have done it, historically, has been to take the return value > >>> of pqsignal(SIGINT, extension_signint_handler) and remember it, and then, > >>> inside extension_signint_handler, call the pgsql handler once we have > >>> done our own business.. > >> > >> I see a couple other projects that might be doing something similar [0]. > >> > >> [0] https://codesearch.debian.net/search?q=%3D+pqsignal&literal=1 > > > > Grouping by package I see three packages. > > So many things for me to atone for! :) Turns out creating good stuff almost invariably leeds to creating not so great parts of those good things :) > > All these seem to not handle backend termination, which certainly doesn't seem > > optimal - it e.g. means that a fast shutdown won't really work. > > > > 1) postgis > > > > For at least some of the cases it doesn't look trivial to convert to > > checking QueryCancelPending/ProcDiePending because the signal handler calls > > into GEOS and lwgeom. > > We control both those libraries, so we just may need to change them up a > little to maybe pass in an callback for their internal check-for-interrupt > to call. Yea, makes sense. Looking at the callsites for the interrupt checking, it doesn't look like a function call during interrupt checking will have a meaningful performance impact. > > 2) psql-http > > > > Also doesn't handle termination. > > Looks like it could trivially be converted to checking > > QueryCancelPending/ProcDiePending. > > Going to fix this one first. Makes sense. > > 3) timescaledb > > Not me! :) > > IOW, the only worrisome case here is postgis. > > Given that we are working towards *not* relying on signals for a lot of this, > > I wonder if we ought to support registering callbacks to be called on receipt > > of query cancellation and backend termination. That then could e.g. call > > GEOS_interruptRequest() as postgis does. That'd have a chance of working in a > > threaded postgres too - unfortunately it looks like neither lwgeom's nor > > GEOS's interrupt mechanisms are thread safe at this point. > > I think callbacks are a good thing. Two of the libraries of interest to me > (curl, GDAL) I end up using the progress meter callback to wedge myself in > and force an interrupt as necessary. This seems like a common feature of > libraries with long running work. As I mentioned earlier, for something like curl, where afaict you'd want to accept interrupts while waiting for network IO, I think it's better to convert to non-blocking use of such APIs and checking interrupt in the event loop. It's been a while, but I have done that with curl, and it wasn't particularly hard. That approach imo is already cleaner today and I think it's what we'll eventually require for at least a bunch of architectural improvements: - faster query execution (so that execution can switch to another part of the query tree while blocked on network IO) - connection scalability (if there's no 1:1 mapping between sessions and processes/threads, threads have to switch between executing queries for different sessions) - to benefit from asynchronous completion-based FE/BE network handling. But of course all of these are all quite a bit away. Obviously event based loops are not a viable approach for wanting to check for interrupts in, what I assume to be CPU bound, cases like GDAL/lwgeom. If you have a "progress meter callback", I don't think there's really a benefit in postgres providing a way to register a callback when receiving a cancellation / termination event? Greetings, Andres Freund