Обсуждение: Problem after removal of exec(), help
Since the removal of exec(), Thomas has seen, and I have confirmed that if a backend crashes, and the postmaster must reset the shared memory, no backends can connect anymore. One way to reproduce it is to run the regression tests, which on their last test will crash for an un-related reason. However, it will not allow you to restart any more backends. The error it gets is: Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue. c", Line: 83) !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory] In this case nextElem = ShmemBase, so it is not greater. Removing the Assert() still does not make things work, so there must be something else. Now, the problem is probably not at that exact spot, but somewhere deeper. There are two differences between the old non-exec() behavior and new behavior. In the old setup, the backend had all its global variables initialized, while in the new no-exec case, they take the global variable values from the postmaster. Second, the old setup had each backend attaching to the shared memory, while the new setup has them inheriting the shared memory from the fork(). My guess is that there is something buggy about the reset code in postmaster.c that was not resetting completely, but the initialization of the global variables in the backend was masking the bug, or the attach() operation did some extra work that we now need to do when resetting the shared memory: static void reset_shared(short port) { ipc_key = port * 1000 + shmem_seq * 100; CreateSharedMemoryAndSemaphores(ipc_key); ActiveBackends = FALSE; shmem_seq += 1; if (shmem_seq >= 10) shmem_seq -= 10; } I am stumped on this. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Since the removal of exec(), Thomas has seen, and I have confirmed that > if a backend crashes, and the postmaster must reset the shared memory, > no backends can connect anymore. One way to reproduce it is to run the > regression tests, which on their last test will crash for an un-related > reason. However, it will not allow you to restart any more backends. > > The error it gets is: > > Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue. > c", Line: 83) > !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory] > > In this case nextElem = ShmemBase, so it is not greater. Removing the > Assert() still does not make things work, so there must be something > else. > > Now, the problem is probably not at that exact spot, but somewhere > deeper. There are two differences between the old non-exec() behavior > and new behavior. In the old setup, the backend had all its global > variables initialized, while in the new no-exec case, they take the > global variable values from the postmaster. Second, the old setup had > each backend attaching to the shared memory, while the new setup has > them inheriting the shared memory from the fork(). > > My guess is that there is something buggy about the reset code in > postmaster.c that was not resetting completely, but the initialization > of the global variables in the backend was masking the bug, or the > attach() operation did some extra work that we now need to do when > resetting the shared memory: > > static void > reset_shared(short port) > { > ipc_key = port * 1000 + shmem_seq * 100; > CreateSharedMemoryAndSemaphores(ipc_key); > ActiveBackends = FALSE; > shmem_seq += 1; > if (shmem_seq >= 10) > shmem_seq -= 10; > } > > > I am stumped on this. No help here, but a request: Could we have an option to do the fork()/exec() the old way as well as the new sleek fork() only. I want to do some performance testing under gprof and want to be able to replace my postgres binary with a shell script to save the gmon.out file eg: #!/bin/sh postgres.bin $* mv gmon.out gmon.$$ This won't work unless and exec() is done. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken
> No help here, but a request: > > Could we have an option to do the fork()/exec() the old way as well as the > new sleek fork() only. I want to do some performance testing under gprof and > want to be able to replace my postgres binary with a shell script to save > the gmon.out file eg: > > #!/bin/sh > postgres.bin $* > mv gmon.out gmon.$$ > > This won't work unless and exec() is done. I am confused. What doesn't work without the exec()? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > > Since the removal of exec(), Thomas has seen, and I have confirmed that > if a backend crashes, and the postmaster must reset the shared memory, > no backends can connect anymore. One way to reproduce it is to run the > regression tests, which on their last test will crash for an un-related > reason. However, it will not allow you to restart any more backends. > > The error it gets is: > > Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue. > c", Line: 83) > !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory] > > In this case nextElem = ShmemBase, so it is not greater. Removing the > Assert() still does not make things work, so there must be something > else. > > Now, the problem is probably not at that exact spot, but somewhere > deeper. There are two differences between the old non-exec() behavior > and new behavior. In the old setup, the backend had all its global > variables initialized, while in the new no-exec case, they take the > global variable values from the postmaster. Second, the old setup had > each backend attaching to the shared memory, while the new setup has > them inheriting the shared memory from the fork(). Bruce, I have not look into it the specifics yet, but I suggest looking into what is done when the child process exits. This (the pg_exit() et al.) caused some bugs when we introduced unix domain sockets and it is not the first place one looks. :-( regards, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna
> Bruce, > I have not look into it the specifics yet, > but I suggest looking into what is done when > the child process exits. > This (the pg_exit() et al.) caused some bugs > when we introduced unix domain sockets and > it is not the first place one looks. :-( Are you suggesting that because one of the backends did not exit cleanly, that there is some problem? Because the postmaster is resetting all shared memory at that point, I am not sure that is the area. I have been thinking about it, and my guess is that one of the initialization functions (lock?) just appends to the lock queue on restart, instead of clearing it first, and a backend that does exec() starts out with clean global variables, which they now do not. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > > No help here, but a request: > > > > Could we have an option to do the fork()/exec() the old way as well as the > > new sleek fork() only. I want to do some performance testing under gprof and > > want to be able to replace my postgres binary with a shell script to save > > the gmon.out file eg: > > > > #!/bin/sh > > postgres.bin $* > > mv gmon.out gmon.$$ > > > > This won't work unless and exec() is done. > > I am confused. What doesn't work without the exec()? Replacing the postgres binary with a shell script that executes the real postgres binary and then moves the gmon.out file out of the way. $ mv postgres postgres.bin $cat > postgres #!/bin/sh postgres.bin $* mv gmon.out gmon.$$ ^D $ postmaster ... $ psql template1 -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.
> Replacing the postgres binary with a shell script that executes the real > postgres binary and then moves the gmon.out file out of the way. > > $ mv postgres postgres.bin > $cat > postgres > #!/bin/sh > postgres.bin $* > mv gmon.out gmon.$$ > ^D > $ postmaster ... > $ psql template1 > Ah, I see. Re-enabling exec() is not a trivial job. Perhaps you can put a system("mv ...") call in the postmaster backend cleanup code. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Since the removal of exec(), Thomas has seen, and I have confirmed that > if a backend crashes, and the postmaster must reset the shared memory, > no backends can connect anymore. One way to reproduce it is to run the > regression tests, which on their last test will crash for an un-related > reason. However, it will not allow you to restart any more backends. > > The error it gets is: > > Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue. > c", Line: 83) > !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory] > > In this case nextElem = ShmemBase, so it is not greater. Removing the > Assert() still does not make things work, so there must be something > else. > > Now, the problem is probably not at that exact spot, but somewhere > deeper. There are two differences between the old non-exec() behavior > and new behavior. In the old setup, the backend had all its global > variables initialized, while in the new no-exec case, they take the > global variable values from the postmaster. Second, the old setup had > each backend attaching to the shared memory, while the new setup has > them inheriting the shared memory from the fork(). I have fixed the problem. The problem was that InitMultiLevelLocks() was not re-initializing the LockTable, which was still pointing to the old shared memory lock structures, not the new ones in the new shared memory segment. I had to change InitMultiLevelLocks so it always reset the memory, and force LockTableInit to set Numtables in lock.c to 1 on startup, so it re-creates the LOCKTAB entries that do not point to the old shared memory stuff. I also replaces on_exitpg with new on_proc_exit and on_shmem_exit() to clarify when these are being run, and removed quasi_exit(). -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)