Обсуждение: EINTR in ftruncate()
Nicola Contu reported two years ago to pgsql-general[1] that they were having sporadic query failures, because EINTR is reported on some system call. I have been told that the problem persists, though it is very infrequent. I propose the attached patch. Kyotaro proposed a slightly different patch which also protects write(), but I think that's not necessary. Thomas M. produced some more obscure theories for other things that could fail, but I think we should patch this problem first, which seems the most obvious one, and deal with others if and when they are reported. [1] https://www.postgresql.org/message-id/CAMTZZh2V%2B0wJVgSqTVvXUAVMduF57Uxubvvw58%3DkbOae%2B53%2BQQ%40mail.gmail.com -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
Вложения
Hi, On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote: > Nicola Contu reported two years ago to pgsql-general[1] that they were > having sporadic query failures, because EINTR is reported on some system > call. I have been told that the problem persists, though it is very > infrequent. I propose the attached patch. Kyotaro proposed a slightly > different patch which also protects write(), but I think that's not > necessary. What is the reason for the || ProcDiePending || QueryCancelPending bit? What if there's dsm operations intentionally done while QueryCancelPending? Greetings, Andres Freund
On 2022-Jul-01, Andres Freund wrote: > On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote: > > Nicola Contu reported two years ago to pgsql-general[1] that they were > > having sporadic query failures, because EINTR is reported on some system > > call. I have been told that the problem persists, though it is very > > infrequent. I propose the attached patch. Kyotaro proposed a slightly > > different patch which also protects write(), but I think that's not > > necessary. > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What > if there's dsm operations intentionally done while QueryCancelPending? That mirrors the test for the other block in that function, which was added by 63efab4ca139, whose commit message explains: Allow DSM allocation to be interrupted. Chris Travers reported that the startup process can repeatedly try to cancel a backend that is in a posix_fallocate()/EINTR loop and cause it to loop forever. Teach the retry loop to give up if an interrupt is pending. Don't actually check for interrupts in that loop though, because a non-local exit would skip some clean-up code in the caller. Thanks for looking! -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Hi, On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > On 2022-Jul-01, Andres Freund wrote: > > > On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote: > > > Nicola Contu reported two years ago to pgsql-general[1] that they were > > > having sporadic query failures, because EINTR is reported on some system > > > call. I have been told that the problem persists, though it is very > > > infrequent. I propose the attached patch. Kyotaro proposed a slightly > > > different patch which also protects write(), but I think that's not > > > necessary. > > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What > > if there's dsm operations intentionally done while QueryCancelPending? > > That mirrors the test for the other block in that function, which was > added by 63efab4ca139, whose commit message explains: > > Allow DSM allocation to be interrupted. > > Chris Travers reported that the startup process can repeatedly try to > cancel a backend that is in a posix_fallocate()/EINTR loop and cause it > to loop forever. Teach the retry loop to give up if an interrupt is > pending. Don't actually check for interrupts in that loop though, > because a non-local exit would skip some clean-up code in the caller. That whole approach seems quite wrong to me. At the absolute very least the code needs to check if interrupts are being processed in the current context before just giving up due to ProcDiePending || QueryCancelPending. I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather than the startup process signalling. There is an argument for allowing more things to be cancelled, but we'd need a retry loop for the !INTERRUPTS_CAN_BE_PROCESSED() case. Greetings, Andres Freund
Hi Chris, On 2022-07-01 13:29:44 -0700, Andres Freund wrote: > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > > On 2022-Jul-01, Andres Freund wrote: > > > > > On 2022-07-01 17:41:05 +0200, Alvaro Herrera wrote: > > > > Nicola Contu reported two years ago to pgsql-general[1] that they were > > > > having sporadic query failures, because EINTR is reported on some system > > > > call. I have been told that the problem persists, though it is very > > > > infrequent. I propose the attached patch. Kyotaro proposed a slightly > > > > different patch which also protects write(), but I think that's not > > > > necessary. > > > > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What > > > if there's dsm operations intentionally done while QueryCancelPending? > > > > That mirrors the test for the other block in that function, which was > > added by 63efab4ca139, whose commit message explains: > > > > Allow DSM allocation to be interrupted. > > > > Chris Travers reported that the startup process can repeatedly try to > > cancel a backend that is in a posix_fallocate()/EINTR loop and cause it > > to loop forever. Teach the retry loop to give up if an interrupt is > > pending. Don't actually check for interrupts in that loop though, > > because a non-local exit would skip some clean-up code in the caller. > > That whole approach seems quite wrong to me. At the absolute very least the > code needs to check if interrupts are being processed in the current context > before just giving up due to ProcDiePending || QueryCancelPending. > > I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather > than the startup process signalling. > > There is an argument for allowing more things to be cancelled, but we'd need a > retry loop for the !INTERRUPTS_CAN_BE_PROCESSED() case. Chris, do you have any additional details about the machine that lead to this change? OS version, whether it might have been swapping, etc? I wonder if what happened is that posix_fallocate() used glibc's fallback implementation because the kernel was old enough to not support fallocate() for tmpfs. Looks like support for fallocate() for tmpfs was added in 3.5 ([1]). So e.g. a rhel 6 wouldn't have had that. Greetings, Andres Freund [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e2d12e22c59ce714008aa5266d769f8568d74eac
On Sat, Jul 2, 2022 at 9:06 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-07-01 13:29:44 -0700, Andres Freund wrote: > > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > > > Allow DSM allocation to be interrupted. > > > > > > Chris Travers reported that the startup process can repeatedly try to > > > cancel a backend that is in a posix_fallocate()/EINTR loop and cause it > > > to loop forever. Teach the retry loop to give up if an interrupt is > > > pending. Don't actually check for interrupts in that loop though, > > > because a non-local exit would skip some clean-up code in the caller. > > > > That whole approach seems quite wrong to me. At the absolute very least the > > code needs to check if interrupts are being processed in the current context > > before just giving up due to ProcDiePending || QueryCancelPending. > > > > I'm very unconvinced this ought to be fixed in dsm_impl_posix_resize(), rather > > than the startup process signalling. I agree it's not great. It was a back-patchable bandaid in need of a better solution. > Chris, do you have any additional details about the machine that lead to this > change? OS version, whether it might have been swapping, etc? > > I wonder if what happened is that posix_fallocate() used glibc's fallback > implementation because the kernel was old enough to not support fallocate() > for tmpfs. Looks like support for fallocate() for tmpfs was added in 3.5 > ([1]). So e.g. a rhel 6 wouldn't have had that. With a quick test program on my Linux 5.10 kernel I see that an SA_RESTART signal handler definitely causes posix_fallocate() to return EINTR (can post trivial program). A drive-by look at the current/modern kernel source supports this: shmem_fallocate returns -EINTR directly (not -ERESTARTSYS, which seems to be the Linux-y way to say you want EINTR or restart as appropriate?), and it also undoes all partial progress too (not too surprising), which would explain why a perfectly timed machine gun stream of signals from our recovery conflict system can make an fallocate retry loop never terminate, for large enough sizes.
Hi, On 2022-07-02 09:52:33 +1200, Thomas Munro wrote: > On Sat, Jul 2, 2022 at 9:06 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-07-01 13:29:44 -0700, Andres Freund wrote: > > Chris, do you have any additional details about the machine that lead to this > > change? OS version, whether it might have been swapping, etc? > > > > I wonder if what happened is that posix_fallocate() used glibc's fallback > > implementation because the kernel was old enough to not support fallocate() > > for tmpfs. Looks like support for fallocate() for tmpfs was added in 3.5 > > ([1]). So e.g. a rhel 6 wouldn't have had that. > > With a quick test program on my Linux 5.10 kernel I see that an > SA_RESTART signal handler definitely causes posix_fallocate() to > return EINTR (can post trivial program). > > A drive-by look at the current/modern kernel source supports this: > shmem_fallocate returns -EINTR directly (not -ERESTARTSYS, which seems > to be the Linux-y way to say you want EINTR or restart as > appropriate?), and it also undoes all partial progress too (not too > surprising), which would explain why a perfectly timed machine gun > stream of signals from our recovery conflict system can make an > fallocate retry loop never terminate, for large enough sizes. Yea :( And even if we fix recovery to not do douse other processes in signals quite that badly, there are plenty other sources of signals that can arrive at a steady clip. So I think we need to do something to defuse this another way. Ideas: 1) do the fallocate in smaller chunks, thereby making it much more likely to complete between two signal deliveries 2) block signals while calling posix_fallocate(). That won't work for everything (e.g. rapid SIGSTOP/SIGCONT), but that's not something we'd send ourselves, so whatever. 3) 1+2 4) ? Greetings, Andres Freund
On 2022-Jul-01, Andres Freund wrote: > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > > On 2022-Jul-01, Andres Freund wrote: > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What > > > if there's dsm operations intentionally done while QueryCancelPending? > > > > That mirrors the test for the other block in that function, which was > > added by 63efab4ca139, whose commit message explains: > That whole approach seems quite wrong to me. At the absolute very least the > code needs to check if interrupts are being processed in the current context > before just giving up due to ProcDiePending || QueryCancelPending. For the time being, I can just push the addition of the EINTR retry without testing ProcDiePending || QueryCancelPending. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "El sudor es la mejor cura para un pensamiento enfermo" (Bardia)
Вложения
Hi, On 2022-07-04 13:07:50 +0200, Alvaro Herrera wrote: > On 2022-Jul-01, Andres Freund wrote: > > > On 2022-07-01 19:55:16 +0200, Alvaro Herrera wrote: > > > On 2022-Jul-01, Andres Freund wrote: > > > > > What is the reason for the || ProcDiePending || QueryCancelPending bit? What > > > > if there's dsm operations intentionally done while QueryCancelPending? > > > > > > That mirrors the test for the other block in that function, which was > > > added by 63efab4ca139, whose commit message explains: > > > That whole approach seems quite wrong to me. At the absolute very least the > > code needs to check if interrupts are being processed in the current context > > before just giving up due to ProcDiePending || QueryCancelPending. > > For the time being, I can just push the addition of the EINTR retry > without testing ProcDiePending || QueryCancelPending. I think we'd be better off disabling at least some signals during dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another variation of these problems. I haven't checked the source of ftruncate, but what Thomas dug up for fallocate makes it pretty clear that our current approach of just retrying again and again isn't good enough. It's a bit more obvious that it's a problem for fallocate, but I don't think it's worth having different solutions for the two. Greetings, Andres Freund
On 2022-Jul-05, Andres Freund wrote: > I think we'd be better off disabling at least some signals during > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another > variation of these problems. I haven't checked the source of ftruncate, but > what Thomas dug up for fallocate makes it pretty clear that our current > approach of just retrying again and again isn't good enough. It's a bit more > obvious that it's a problem for fallocate, but I don't think it's worth having > different solutions for the two. So what if we move the retry loop one level up? As in the attached. Here, if we get EINTR then we retry both syscalls. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Вложения
Hi, On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote: > On 2022-Jul-05, Andres Freund wrote: > > > I think we'd be better off disabling at least some signals during > > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another > > variation of these problems. I haven't checked the source of ftruncate, but > > what Thomas dug up for fallocate makes it pretty clear that our current > > approach of just retrying again and again isn't good enough. It's a bit more > > obvious that it's a problem for fallocate, but I don't think it's worth having > > different solutions for the two. > > So what if we move the retry loop one level up? As in the attached. > Here, if we get EINTR then we retry both syscalls. Doesn't really seem to address the problem to me. posix_fallocate() takes some time (~1s for 3GB roughly), so if we signal at a higher rate, we'll just get stuck. I hacked a bit on a test program from Thomas, and it's pretty clearly that with a 5ms timer interval you'll pretty much not make progress. It's much easier to get fallocate() to get interrupted than ftruncate(), but the latter gets interrupted e.g. when you do a strace in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in situations that are retried otherwise). So I think we need: 1) block most signals, 2) a retry loop *without* interrupt checks. Greetings, Andres Freund
On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote: > > On 2022-Jul-05, Andres Freund wrote: > > > > > I think we'd be better off disabling at least some signals during > > > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another > > > variation of these problems. I haven't checked the source of ftruncate, but > > > what Thomas dug up for fallocate makes it pretty clear that our current > > > approach of just retrying again and again isn't good enough. It's a bit more > > > obvious that it's a problem for fallocate, but I don't think it's worth having > > > different solutions for the two. > > > > So what if we move the retry loop one level up? As in the attached. > > Here, if we get EINTR then we retry both syscalls. > > Doesn't really seem to address the problem to me. posix_fallocate() > takes some time (~1s for 3GB roughly), so if we signal at a higher rate, > we'll just get stuck. > > I hacked a bit on a test program from Thomas, and it's pretty clearly > that with a 5ms timer interval you'll pretty much not make > progress. It's much easier to get fallocate() to get interrupted than > ftruncate(), but the latter gets interrupted e.g. when you do a strace > in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in > situations that are retried otherwise). > > So I think we need: 1) block most signals, 2) a retry loop *without* > interrupt checks. Yeah. I was also wondering about wrapping the whole function in PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the while (rc == EINTR) loop there (without the check for *Pending variables), only because otherwise when you attach a debugger and continue you'll get a spurious EINTR and it'll interfere with program execution. All blockable signals would be blocked *except* SIGQUIT, which means that fast shutdown/crash will still work. It seems nice to leave that way to interrupt it without resorting to SIGKILL.
Hi, On 2022-07-07 08:56:33 +1200, Thomas Munro wrote: > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote: > > So I think we need: 1) block most signals, 2) a retry loop *without* > > interrupt checks. > > Yeah. I was also wondering about wrapping the whole function in > PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the > while (rc == EINTR) loop there (without the check for *Pending > variables), only because otherwise when you attach a debugger and > continue you'll get a spurious EINTR and it'll interfere with program > execution. All blockable signals would be blocked *except* SIGQUIT, > which means that fast shutdown/crash will still work. It seems nice > to leave that way to interrupt it without resorting to SIGKILL. Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think it's fine to allow immediate shutdowns, but I don't think we should allow fast shutdowns to interrupt it. Greetings, Andres Freund
On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote: > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote: > > > So I think we need: 1) block most signals, 2) a retry loop *without* > > > interrupt checks. > > > > Yeah. I was also wondering about wrapping the whole function in > > PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the > > while (rc == EINTR) loop there (without the check for *Pending > > variables), only because otherwise when you attach a debugger and > > continue you'll get a spurious EINTR and it'll interfere with program > > execution. All blockable signals would be blocked *except* SIGQUIT, > > which means that fast shutdown/crash will still work. It seems nice > > to leave that way to interrupt it without resorting to SIGKILL. > > Fast shutdown shouldn't use SIGQUIT - did you mean immediate? I think > it's fine to allow immediate shutdowns, but I don't think we should > allow fast shutdowns to interrupt it. Err, yeah, that one.
On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote: > > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote: > > > > So I think we need: 1) block most signals, 2) a retry loop *without* > > > > interrupt checks. Here's a draft patch that tries to explain all this in the commit message and comments. Even if we go with this approach now, I think it's plausible that we might want to reconsider this yet again one day, perhaps allocating memory with some future asynchronous infrastructure while still processing interrupts. It's not very nice to hold up recovery or ProcSignalBarrier for long operations. I'm a little unclear about ftruncate() here. I don't expect it to report EINTR in other places where we use it (ie to make a local file on a non-"slow device" smaller), because I expect that to be like read(), write() etc which we don't wrap in EINTR loops. Here you've observed EINTR when messing around with a debugger*. It seems inconsistent to put posix_fallocate() in an EINTR retry loop for the benefit of debuggers, but not ftruncate(). But perhaps that's good enough, on the theory that posix_fallocate(1GB) is a very large target and you have a decent chance of hitting it. Another observation while staring at that ftruncate(): It's entirely redundant on Linux, because we only ever call dsm_impl_posix_resize() to make the segment bigger. Before commit 3c60d0fa (v12) it was possible to resize a segment to be smaller with dsm_resize(), so you needed one or t'other depending on the requested size and we just called both, but dsm_resize() wasn't ever used AFAIK and didn't even work on all DSM implementations, among other problems, so we ripped it out. So... on master at least, we could also change the #ifdef to be either-or. While refactoring like that, I think we might as well also rearrange the code so that the wait event is reported also for other OSes, just in case it takes a long time. See 0002 patch. *It's funny that ftruncate() apparently doesn't automatically restart for ptrace SIGCONT on Linux according to your report, while poll() does according to my experiments, even though the latter is one of the never-restart functions (it doesn't on other OSes I hack on, and you feel the difference when debugging missing wakeup type bugs...). Random implementation details...
Вложения
On 2022-Jul-07, Thomas Munro wrote: > On Thu, Jul 7, 2022 at 9:05 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Jul 7, 2022 at 9:03 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2022-07-07 08:56:33 +1200, Thomas Munro wrote: > > > > On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <andres@anarazel.de> wrote: > > > > > So I think we need: 1) block most signals, 2) a retry loop *without* > > > > > interrupt checks. > > Here's a draft patch that tries to explain all this in the commit > message and comments. I gave 0001 a try. I agree with the approach, and it seems to work as intended; or at least I couldn't break it under GDB. I didn't look at 0002, but I wish that the pgstat_report_wait calls were moved to cover both syscalls in a separate commit, just so we still have them even if we decide not to do 0002. Do you intend to get it pushed before the next minors? I have an interest in that happening. Thanks for working on this. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Here's a general engineering tip: if the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious." (John Naylor) https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com
Hi, On 2022-07-07 17:58:10 +1200, Thomas Munro wrote: > Even if we go with this approach now, I think it's plausible that we > might want to reconsider this yet again one day, perhaps allocating > memory with some future asynchronous infrastructure while still > processing interrupts. It's not very nice to hold up recovery or > ProcSignalBarrier for long operations. I think the next improvement would be to do the fallocate in smaller chunks, and accept interrupts inbetween. > I'm a little unclear about ftruncate() here. I don't expect it to > report EINTR in other places where we use it (ie to make a local file > on a non-"slow device" smaller), because I expect that to be like > read(), write() etc which we don't wrap in EINTR loops. Here you've > observed EINTR when messing around with a debugger*. It seems > inconsistent to put posix_fallocate() in an EINTR retry loop for the > benefit of debuggers, but not ftruncate(). But perhaps that's good > enough, on the theory that posix_fallocate(1GB) is a very large target > and you have a decent chance of hitting it. > *It's funny that ftruncate() apparently doesn't automatically restart > for ptrace SIGCONT on Linux according to your report, while poll() > does according to my experiments, even though the latter is one of the > never-restart functions (it doesn't on other OSes I hack on, and you > feel the difference when debugging missing wakeup type bugs...). > Random implementation details... My test was basically while (true); {if (!ftruncate()) bleat(); if (!fallocate()) bleat();} with a SIGALRM triggering regularly in the background. The ftruncate failed, rarely, when attaching / detaching strace -p. Rarely enough that I had already written you an IM saying that I couldn't make it fail... So it's hard to be confident this can't otherwise be hit. With that caveat: I didn't hit it with a "real" file on a "real" filesystem in a few minutes of trying. But unsurprisingly it triggers when putting the file on a tmpfs. > @@ -362,6 +355,14 @@ dsm_impl_posix_resize(int fd, off_t size) > { > int rc; > > + /* > + * Block all blockable signals, except SIGQUIT. posix_fallocate() can run > + * for quite a long time, and is an all-or-nothing operation. If we > + * allowed SIGUSR1 to interrupt us repeatedly (for example, due to recovery > + * conflicts), the retry loop might never succeed. > + */ > + PG_SETMASK(&BlockSig); > + > /* Truncate (or extend) the file to the requested size. */ > rc = ftruncate(fd, size); > Hm - given that we've observed ftruncate failing with strace, and that stracing to find problems isn't insane, shouldn't we retry the ftruncate too? It's kind of obsoleted by your next patch, but not really, because it's not unconceivable that other OSs behave similarly? And IIUC you're planning to not backpatch 0002? > + pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE); (not new in this patch, just moved around) - FILL_ZERO_WRITE is imo an odd description of what's happening... In my understanding this isn't doing any writing, just allocating. But whatever... Greetings, Andres Freund
On Tue, Jul 12, 2022 at 5:45 AM Andres Freund <andres@anarazel.de> wrote: > Hm - given that we've observed ftruncate failing with strace, and that > stracing to find problems isn't insane, shouldn't we retry the ftruncate too? > It's kind of obsoleted by your next patch, but not really, because it's not > unconceivable that other OSs behave similarly? And IIUC you're planning to not > backpatch 0002? Yeah. Done, and pushed. 0002 not back-patched. > > + pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE); > > (not new in this patch, just moved around) - FILL_ZERO_WRITE is imo an odd > description of what's happening... In my understanding this isn't doing any > writing, just allocating. But whatever... True. Fixed in master.
On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Yeah. Done, and pushed. 0002 not back-patched. Hmm, there were a couple of hard to understand build farm failures. My first thought is that the signal mask stuff should only be done if IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask when reached from dsm_postmaster_startup(). Looking into that.
On Fri, Jul 15, 2022 at 1:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jul 15, 2022 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Yeah. Done, and pushed. 0002 not back-patched. > > Hmm, there were a couple of hard to understand build farm failures. > My first thought is that the signal mask stuff should only be done if > IsUnderPostmaster, otherwise it clobbers the postmaster's signal mask > when reached from dsm_postmaster_startup(). Looking into that. I pushed that change, and I hope that clears up the problems seen on eg curculio. It does raise the more general question of why it's safe to assume the signal mask is UnBlockSig on entry in regular backends. I expect it to be in released branches, but things get more complicated as we use DSM in more ways and it's not ideal to bet on that staying true. I checked that this throw-away assertion doesn't fail currently: if (IsUnderPostmaster) + { + sigset_t old; + sigprocmask(SIG_SETMASK, NULL, &old); + Assert(memcmp(&old, &UnBlockSig, sizeof(UnBlockSig)) == 0); PG_SETMASK(&BlockSig); + } ... but now I'm wondering if we should be more defensive and possibly even save/restore the mask. Originally I discounted that because I thought I had to go through PG_SETMASK for portability reasons, but on closer inspection, I don't see any reason not to use sigprocmask directly in Unix-only code.
On 2022-Jul-15, Thomas Munro wrote: > I checked that this throw-away assertion doesn't fail currently: > > if (IsUnderPostmaster) > + { > + sigset_t old; > + sigprocmask(SIG_SETMASK, NULL, &old); > + Assert(memcmp(&old, &UnBlockSig, sizeof(UnBlockSig)) == 0); > PG_SETMASK(&BlockSig); > + } > > ... but now I'm wondering if we should be more defensive and possibly > even save/restore the mask. Yeah, that sounds better to me. > Originally I discounted that because I thought I had to go through > PG_SETMASK for portability reasons, but on closer inspection, I don't > see any reason not to use sigprocmask directly in Unix-only code. ISTM it would be cleaner to patch PG_SETMASK to have a second argument and to return the original mask if that's not NULL. This is more invasive, but there aren't that many callsites of that macro. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Thomas Munro <thomas.munro@gmail.com> writes: > ... but now I'm wondering if we should be more defensive and possibly > even save/restore the mask. +1, sounds like a more future-proof solution. > Originally I discounted that because I > thought I had to go through PG_SETMASK for portability reasons, but on > closer inspection, I don't see any reason not to use sigprocmask > directly in Unix-only code. Seems like the thing to do is to add a suitable operation to the pqsignal.h API. We could leave it unimplemented for now on Windows, and then worry what to do if we ever need it in that context. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > ISTM it would be cleaner to patch PG_SETMASK to have a second argument > and to return the original mask if that's not NULL. This is more > invasive, but there aren't that many callsites of that macro. [ shoulda read your message before replying ] Given that this needs back-patched, I think changing PG_SETMASK is a bad idea: there might be outside callers. However, we could add another macro with the additional argument. PG_GET_AND_SET_MASK? regards, tom lane
On Fri, Jul 15, 2022 at 3:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > ISTM it would be cleaner to patch PG_SETMASK to have a second argument > > and to return the original mask if that's not NULL. This is more > > invasive, but there aren't that many callsites of that macro. > > [ shoulda read your message before replying ] > > Given that this needs back-patched, I think changing PG_SETMASK > is a bad idea: there might be outside callers. However, we could > add another macro with the additional argument. PG_GET_AND_SET_MASK? It's funny though, the reason we had PG_SETMASK in the first place is not for Windows. Ancient commit 47937403676 added that for long gone pre-POSIX systems like NeXTSTEP which only had single-argument sigsetmask(), not sigprocmask(). In general on Windows we're emulating POSIX signal interfaces with normal names like sigemptyset() etc, it just so happens that we chose to emulate that pre-standard sigsetmask() interface (as you complained about in the commit message for a65e0864). So why would I add another wrapper like PG_SETMASK and leave it unimplemented for now on Windows, when I could just use sigprocmask() directly and leave it unimplemented for now on Windows? The only reason I can think of for a wrapper is to provide a place to check the return code and ereport (panic?). That seems to be of limited value (how can it fail ... bad "how" value, or a sandbox denying some system calls, ...?). I did make sure to preserve the errno though; even though we're assuming these calls can't fail by long standing tradition, I didn't feel like additionally assuming that successful calls don't clobber errno. I guess, coded like this, it should also be safe to do it in the postmaster, but I think maybe we should leave it conditional, rather than relying on BlockSig being initialised and sane during early postmaster initialisation.
Вложения
Thomas Munro <thomas.munro@gmail.com> writes: > So why would I add another wrapper like PG_SETMASK and leave it > unimplemented for now on Windows, when I could just use sigprocmask() > directly and leave it unimplemented for now on Windows? Fair enough, I guess. No objection to this patch. (Someday we oughta go ahead and make our Windows signal API look more like POSIX, as I suggested back in 2015. I'm still not taking point on that, though.) regards, tom lane
On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > (Someday we oughta go ahead and make our Windows signal API look more > like POSIX, as I suggested back in 2015. I'm still not taking > point on that, though.) For the sigprocmask() part, here's a patch that passes CI. Only the SIG_SETMASK case is actually exercised by our current code, though. One weird thing about our PG_SETMASK() macro is that you couldn't have used its return value portably: on Windows we were returning the old mask (like sigsetmask(), which has no way to report errors), and on Unix we were returning 0/-1 (from setprocmask(), ie the error we never checked).
Вложения
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> (Someday we oughta go ahead and make our Windows signal API look more >> like POSIX, as I suggested back in 2015. I'm still not taking >> point on that, though.) > For the sigprocmask() part, here's a patch that passes CI. Only the > SIG_SETMASK case is actually exercised by our current code, though. Passes an eyeball check, but I can't actually test it. regards, tom lane
On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> (Someday we oughta go ahead and make our Windows signal API look more > >> like POSIX, as I suggested back in 2015. I'm still not taking > >> point on that, though.) > > > For the sigprocmask() part, here's a patch that passes CI. Only the > > SIG_SETMASK case is actually exercised by our current code, though. > > Passes an eyeball check, but I can't actually test it. Thanks. Pushed. I'm not brave enough to try to write a replacement sigaction() yet, but it does appear that we could rip more ugliness and inconsistencies that way, eg sa_mask.
On Sat, Jul 16, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> (Someday we oughta go ahead and make our Windows signal API look more > > >> like POSIX, as I suggested back in 2015. I'm still not taking > > >> point on that, though.) > > > > > For the sigprocmask() part, here's a patch that passes CI. Only the > > > SIG_SETMASK case is actually exercised by our current code, though. > > > > Passes an eyeball check, but I can't actually test it. > > Thanks. Pushed. > > I'm not brave enough to try to write a replacement sigaction() yet, > but it does appear that we could rip more ugliness and inconsistencies > that way, eg sa_mask. Here's a draft patch that adds a minimal sigaction() implementation for Windows. It doesn't implement stuff we're not using, for sample sa_sigaction functions, but it has the sa_mask feature so we can harmonize the stuff that I believe you were talking about.
Вложения
On Wed, Aug 17, 2022 at 07:51:34AM +1200, Thomas Munro wrote: > Here's a draft patch that adds a minimal sigaction() implementation > for Windows. It doesn't implement stuff we're not using, for sample > sa_sigaction functions, but it has the sa_mask feature so we can > harmonize the stuff that I believe you were talking about. Did you see that this paniced ? https://cirrus-ci.com/task/4975957546106880 https://api.cirrus-ci.com/v1/artifact/task/4975957546106880/testrun/build/testrun/recovery/027_stream_regress/log/027_stream_regress_standby_1.log 2022-09-30 09:13:03.496 GMT [7312][startup] PANIC: hash_xlog_split_allocate_page: failed to acquire cleanup lock 2022-09-30 09:13:03.496 GMT [7312][startup] CONTEXT: WAL redo at 0/7AF6FA8 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket 63,meta_page_masks_updated F, issplitpoint_changed F; blkref #0: rel 1663/16384/23784, blk 45; blkref #1: rel 1663/16384/23784,blk 78; blkref #2: rel 1663/16384/23784, blk 0
Hi, On 2022-09-30 13:53:45 -0500, Justin Pryzby wrote: > On Wed, Aug 17, 2022 at 07:51:34AM +1200, Thomas Munro wrote: > > Here's a draft patch that adds a minimal sigaction() implementation > > for Windows. It doesn't implement stuff we're not using, for sample > > sa_sigaction functions, but it has the sa_mask feature so we can > > harmonize the stuff that I believe you were talking about. > > Did you see that this paniced ? > > https://cirrus-ci.com/task/4975957546106880 > https://api.cirrus-ci.com/v1/artifact/task/4975957546106880/testrun/build/testrun/recovery/027_stream_regress/log/027_stream_regress_standby_1.log > > 2022-09-30 09:13:03.496 GMT [7312][startup] PANIC: hash_xlog_split_allocate_page: failed to acquire cleanup lock > 2022-09-30 09:13:03.496 GMT [7312][startup] CONTEXT: WAL redo at 0/7AF6FA8 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket 63,meta_page_masks_updated F, issplitpoint_changed F; blkref #0: rel 1663/16384/23784, blk 45; blkref #1: rel 1663/16384/23784,blk 78; blkref #2: rel 1663/16384/23784, blk 0 This looks like broken code in hash, independent of any recent changes: https://www.postgresql.org/message-id/20220817193032.z35vdjhpzkgldrd3%40awork3.anarazel.de Greetings, Andres Freund
On Wed, Aug 17, 2022 at 7:51 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Jul 16, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Jul 16, 2022 at 1:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Thomas Munro <thomas.munro@gmail.com> writes: > > > > On Fri, Jul 15, 2022 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > >> (Someday we oughta go ahead and make our Windows signal API look more > > > >> like POSIX, as I suggested back in 2015. I'm still not taking > > > >> point on that, though.) > > > > > > > For the sigprocmask() part, here's a patch that passes CI. Only the > > > > SIG_SETMASK case is actually exercised by our current code, though. > > > > > > Passes an eyeball check, but I can't actually test it. > > > > Thanks. Pushed. > > > > I'm not brave enough to try to write a replacement sigaction() yet, > > but it does appear that we could rip more ugliness and inconsistencies > > that way, eg sa_mask. > > Here's a draft patch that adds a minimal sigaction() implementation > for Windows. It doesn't implement stuff we're not using, for sample > sa_sigaction functions, but it has the sa_mask feature so we can > harmonize the stuff that I believe you were talking about. Pushed. As discussed before, a much better idea would probably be to use latch-based wakeup instead of putting postmaster logic in signal handlers, but in the meantime this gets rid of the extra Windows-specific noise.