Обсуждение: Re: convert libpgport's pqsignal() to a void function
Nathan Bossart <nathandbossart@gmail.com> writes: > Thanks to commit 9a45a89, legacy-pqsignal.c now has its own dedicated > extern for pqsignal(), which decouples it enough that we can follow through > with changing libpqport's pqsignal() to a void function. > Thoughts? LGTM, although I don't know enough about Windows to know if the "== SIG_ERR" test in that path is correct. regards, tom lane
On Tue, Jan 14, 2025 at 10:02:46PM -0500, Tom Lane wrote: > LGTM, although I don't know enough about Windows to know if the > "== SIG_ERR" test in that path is correct. It's apparently not [0]. :( My guess is that this has something to do with redefining SIG_ERR in win32_port.h. We might be able to use push_macro/pop_macro to keep the old value around, but at the moment I'm leaning towards just removing the assertion in that path. [0] https://cirrus-ci.com/task/6237809813487616 -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Jan 14, 2025 at 10:02:46PM -0500, Tom Lane wrote: >> LGTM, although I don't know enough about Windows to know if the >> "== SIG_ERR" test in that path is correct. > It's apparently not [0]. :( Bleah. > My guess is that this has something to do with redefining SIG_ERR in > win32_port.h. We might be able to use push_macro/pop_macro to keep the old > value around, but at the moment I'm leaning towards just removing the > assertion in that path. I wonder why we redefine those values? But I tend to agree that just removing the test is sufficient for now. Given the lack of failure checks in the existing code, and the lack of trouble reports suggesting any problem, it's hard to muster enthusiasm for spending a lot of effort on this. regards, tom lane
On Thu, Jan 16, 2025 at 8:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Tue, Jan 14, 2025 at 11:08:05PM -0500, Tom Lane wrote: > > Nathan Bossart <nathandbossart@gmail.com> writes: > >> My guess is that this has something to do with redefining SIG_ERR in > >> win32_port.h. We might be able to use push_macro/pop_macro to keep the old > >> value around, but at the moment I'm leaning towards just removing the > >> assertion in that path. > > > > I wonder why we redefine those values? > > I wondered the same. Those redefines have been there since commit 5049196, > but I haven't been able to find any real discussion in the archives about > it. Maybe I will bug Magnus about it sometime, in case he happens to > remember the reason. My guess would be: perhaps some ancient version of MinGW didn't define them? They're defined by MinGW and native signal.h now and they have the same values, so we should remove them I think. Assertion failed: 0, file ../src/port/pqsignal.c, line 147 Could be due to calling native signal() with a signal number other than the 6 values required to work by the C standard?
On Thu, Jan 16, 2025 at 08:21:08AM +1300, Thomas Munro wrote: > On Thu, Jan 16, 2025 at 8:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Tue, Jan 14, 2025 at 11:08:05PM -0500, Tom Lane wrote: >> > I wonder why we redefine those values? >> >> I wondered the same. Those redefines have been there since commit 5049196, >> but I haven't been able to find any real discussion in the archives about >> it. Maybe I will bug Magnus about it sometime, in case he happens to >> remember the reason. > > My guess would be: perhaps some ancient version of MinGW didn't define > them? They're defined by MinGW and native signal.h now and they have > the same values, so we should remove them I think. Okay. If nothing else, it'd be interesting to see what the buildfarm thinks. > Assertion failed: 0, file ../src/port/pqsignal.c, line 147 > > Could be due to calling native signal() with a signal number other > than the 6 values required to work by the C standard? Looking closer, that probably makes more sense than my SIG_ERR redefinition theory. If that assertion is getting hit, that means signal() _is_ returning SIG_ERR (either the system one or our redefined version), and it looks like it's pretty common to use -1 for SIG_ERR. That'd only affect Windows frontend programs, but it still sounds scary. I'll try getting more details about the error with some custom cfbot runs. -- nathan
On Wed, Jan 15, 2025 at 01:47:18PM -0600, Nathan Bossart wrote: > On Thu, Jan 16, 2025 at 08:21:08AM +1300, Thomas Munro wrote: >> Could be due to calling native signal() with a signal number other >> than the 6 values required to work by the C standard? > > Looking closer, that probably makes more sense than my SIG_ERR redefinition > theory. If that assertion is getting hit, that means signal() _is_ > returning SIG_ERR (either the system one or our redefined version), and it > looks like it's pretty common to use -1 for SIG_ERR. That'd only affect > Windows frontend programs, but it still sounds scary. I'll try getting > more details about the error with some custom cfbot runs. I think this is what's happening. cfbot's test failures are caused by initdb's setup_signals(). The call to pqsignal(SIGHUP, trapsig) seems to fail because SIGHUP isn't a real signal number, just something that's made up in win32_port.h. This SIGHUP definition was added by commit 12c9423 in May 2003, then the pqsignal(SIGHUP, ...) call was added in initdb by commit 279598b in November 2003, but it might not have been broken at that time because it doesn't look like initdb.c included the SIGHUP #define. In any case, I think this has been broken since at least commit ed9b360 (November 2017), if not earlier. Perhaps we should surround all those extra signal #defines in win32_port.h with an #ifndef FRONTEND. initdb seems to be good about avoiding pqsignal() calls if the signal doesn't exist, but I wouldn't be surprised if there are other frontend programs that are not so cautious. I'll give it a try on cfbot and see what breaks... -- nathan
On Thu, Jan 16, 2025 at 8:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Jan 16, 2025 at 08:21:08AM +1300, Thomas Munro wrote: > > Could be due to calling native signal() with a signal number other > > than the 6 values required to work by the C standard? > > Looking closer, that probably makes more sense than my SIG_ERR redefinition > theory. If that assertion is getting hit, that means signal() _is_ > returning SIG_ERR (either the system one or our redefined version), and it > looks like it's pretty common to use -1 for SIG_ERR. That'd only affect > Windows frontend programs, but it still sounds scary. I'll try getting > more details about the error with some custom cfbot runs. Windows barely has signals, so I don't think it's too scary... SIGINT works but surprisingly runs the handler in another thread when you press ^C, SIGABRT, SIGFPE, SIGSEGV presumably have the obvious synchronous/exception-trapping implementations but we aren't even trying to catch those*, and SIGTERM, SIGILL are pro forma, never generated by the system. We've basically just invented more pro forma ones that will also never be generated except in the backend's entirely separate fake signal system that I hope to remove. +1 for your idea of not defining them at all outside the backend, it's just confusing noise. The second surprising thing, unless you're an armchair Unix archeologist, is that all but SIGFPE revert to SIG_DFL when they fire, like POSIX's SA_RESETHAND mode. I don't know if it also has SA_NODEFER behaviour (a combination of behaviours seen in some old Unixen of yestermillennium: your handler had to keep reinstalling itself, cf initdb.c:trapsig, but for a brief window the default process-terminating-core-dumping-nasal-daemon behaviour was installed and there was nothing you could do about that race; the BSD crew fixed that mistake a long time ago, everyone does it that way now, and POSIX sigaction() made those policies explicit and is the recommended replacement, but to this day the C standard has just signal() with undefined semantics and no requirement that any of it even work). *Even in the backend we don't catch Windows' native SIGFPE AFAICS, so I guess those must exit instead of being converted to an ERROR by FloatExceptionHandler. I wonder if there is a way to reach that condition from a SQL query.
On Thu, Jan 16, 2025 at 12:47 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Jan 16, 2025 at 11:07:41AM +1300, Thomas Munro wrote: > > +1 for your idea of not defining them at all outside the backend, it's > > just confusing noise. > > I tried that, but these extra signals are needed even in the frontend for > pgkill(), etc. Right, of course. (In an unconf in Vancouver, a group of us were plotting that there should be a more explicit control socket to talk to the postmaster, on all OSes, to get current status, recovery progress, etc, and also support shutdown commands more explicitly. If you separately turn all the backend IPC signals into interrupts (CF 5118 and further work like that), and these cluster-level ones you just mentioned into commands in the hypothetical new cluster control protocol, then you don't need fake signals for Windows anymore, but apparently I got ahead of myself with that comment :-)) > My next thought was to simply ignore signal() errors for > the extra signals, but I don't think that's very nice because it just masks > broken code. Finally, I settled on modifying initdb's setup_signals() to > use WIN32 checks instead of checking for the signals themselves. This is > how it's done elsewhere, and this is apparently all that's needed to make > cfbot's Windows run happy. Cool. > I've also attached a 0002 that removes the redefinitions of SIG_ERR and > friends. That looks good on cfbot, but we'll see what the buildfarm has to > say... Ditto.
I've now committed all of this. I ended up finding a couple other frontend programs that called pqsignal() with an invalid signal number on Windows, so I fixed those as well. AFAICT the reason I didn't catch them in my earlier testing is because they aren't tested! I'll keep an eye on the buildfarm for any problems with the SIG_* redefinition removal, too. Thanks for reviewing and for sharing lots of context about this stuff. -- nathan