Обсуждение: Problem after removal of exec(), help

Поиск
Список
Период
Сортировка

Problem after removal of exec(), help

От
Bruce Momjian
Дата:
Since the removal of exec(), Thomas has seen, and I have confirmed that
if a backend crashes, and the postmaster must reset the shared memory,
no backends can connect anymore.  One way to reproduce it is to run the
regression tests, which on their last test will crash for an un-related
reason.  However, it will not allow you to restart any more backends.

The error it gets is:

Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue.
c", Line: 83)
!((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory]

In this case nextElem = ShmemBase, so it is not greater.  Removing the
Assert() still does not make things work, so there must be something
else.

Now, the problem is probably not at that exact spot, but somewhere
deeper.  There are two differences between the old non-exec() behavior
and new behavior.  In the old setup, the backend had all its global
variables initialized, while in the new no-exec case, they take the
global variable values from the postmaster.  Second, the old setup had
each backend attaching to the shared memory, while the new setup has
them inheriting the shared memory from the fork().

My guess is that there is something buggy about the reset code in
postmaster.c that was not resetting completely, but the initialization
of the global variables in the backend was masking the bug, or the
attach() operation did some extra work that we now need to do when
resetting the shared memory:

    static void
    reset_shared(short port)
    {
        ipc_key = port * 1000 + shmem_seq * 100;
        CreateSharedMemoryAndSemaphores(ipc_key);
        ActiveBackends = FALSE;
        shmem_seq += 1;
        if (shmem_seq >= 10)
            shmem_seq -= 10;
    }


I am stumped on this.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Problem after removal of exec(), help

От
dg@illustra.com (David Gould)
Дата:
>
> Since the removal of exec(), Thomas has seen, and I have confirmed that
> if a backend crashes, and the postmaster must reset the shared memory,
> no backends can connect anymore.  One way to reproduce it is to run the
> regression tests, which on their last test will crash for an un-related
> reason.  However, it will not allow you to restart any more backends.
>
> The error it gets is:
>
> Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue.
> c", Line: 83)
> !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory]
>
> In this case nextElem = ShmemBase, so it is not greater.  Removing the
> Assert() still does not make things work, so there must be something
> else.
>
> Now, the problem is probably not at that exact spot, but somewhere
> deeper.  There are two differences between the old non-exec() behavior
> and new behavior.  In the old setup, the backend had all its global
> variables initialized, while in the new no-exec case, they take the
> global variable values from the postmaster.  Second, the old setup had
> each backend attaching to the shared memory, while the new setup has
> them inheriting the shared memory from the fork().
>
> My guess is that there is something buggy about the reset code in
> postmaster.c that was not resetting completely, but the initialization
> of the global variables in the backend was masking the bug, or the
> attach() operation did some extra work that we now need to do when
> resetting the shared memory:
>
>     static void
>     reset_shared(short port)
>     {
>         ipc_key = port * 1000 + shmem_seq * 100;
>         CreateSharedMemoryAndSemaphores(ipc_key);
>         ActiveBackends = FALSE;
>         shmem_seq += 1;
>         if (shmem_seq >= 10)
>             shmem_seq -= 10;
>     }
>
>
> I am stumped on this.

No help here, but a request:

Could we have an option to do the fork()/exec() the old way as well as the
new sleek fork() only. I want to do some performance testing under gprof and
want to be able to replace my postgres binary with a shell script to save
the gmon.out file eg:

#!/bin/sh
postgres.bin $*
mv gmon.out gmon.$$

This won't work unless and exec() is done.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken

Re: [HACKERS] Problem after removal of exec(), help

От
Bruce Momjian
Дата:
> No help here, but a request:
>
> Could we have an option to do the fork()/exec() the old way as well as the
> new sleek fork() only. I want to do some performance testing under gprof and
> want to be able to replace my postgres binary with a shell script to save
> the gmon.out file eg:
>
> #!/bin/sh
> postgres.bin $*
> mv gmon.out gmon.$$
>
> This won't work unless and exec() is done.

I am confused.  What doesn't work without the exec()?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Problem after removal of exec(), help

От
Goran Thyni
Дата:
Bruce Momjian wrote:
>
> Since the removal of exec(), Thomas has seen, and I have confirmed that
> if a backend crashes, and the postmaster must reset the shared memory,
> no backends can connect anymore.  One way to reproduce it is to run the
> regression tests, which on their last test will crash for an un-related
> reason.  However, it will not allow you to restart any more backends.
>
> The error it gets is:
>
> Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue.
> c", Line: 83)
> !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory]
>
> In this case nextElem = ShmemBase, so it is not greater.  Removing the
> Assert() still does not make things work, so there must be something
> else.
>
> Now, the problem is probably not at that exact spot, but somewhere
> deeper.  There are two differences between the old non-exec() behavior
> and new behavior.  In the old setup, the backend had all its global
> variables initialized, while in the new no-exec case, they take the
> global variable values from the postmaster.  Second, the old setup had
> each backend attaching to the shared memory, while the new setup has
> them inheriting the shared memory from the fork().

Bruce,
I have not look into it the specifics yet,
but I suggest looking into what is done when
the child process exits.
This (the pg_exit() et al.) caused some bugs
when we introduced unix domain sockets and
it is not the first place one looks. :-(

    regards,
--
---------------------------------------------
Göran Thyni, sysadm, JMS Bildbasen, Kiruna

Re: [HACKERS] Problem after removal of exec(), help

От
Bruce Momjian
Дата:
> Bruce,
> I have not look into it the specifics yet,
> but I suggest looking into what is done when
> the child process exits.
> This (the pg_exit() et al.) caused some bugs
> when we introduced unix domain sockets and
> it is not the first place one looks. :-(

Are you suggesting that because one of the backends did not exit
cleanly, that there is some problem?

Because the postmaster is resetting all shared memory at that point, I
am not sure that is the area.  I have been thinking about it, and my
guess is that one of the initialization functions (lock?) just appends
to the lock queue on restart, instead of clearing it first, and a
backend that does exec() starts out with clean global variables, which
they now do not.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Problem after removal of exec(), help

От
dg@illustra.com (David Gould)
Дата:
>
> > No help here, but a request:
> >
> > Could we have an option to do the fork()/exec() the old way as well as the
> > new sleek fork() only. I want to do some performance testing under gprof and
> > want to be able to replace my postgres binary with a shell script to save
> > the gmon.out file eg:
> >
> > #!/bin/sh
> > postgres.bin $*
> > mv gmon.out gmon.$$
> >
> > This won't work unless and exec() is done.
>
> I am confused.  What doesn't work without the exec()?

Replacing the postgres binary with a shell script that executes the real
postgres binary and then moves the gmon.out file out of the way.

    $ mv postgres postgres.bin
    $cat > postgres
    #!/bin/sh
    postgres.bin $*
    mv gmon.out gmon.$$
    ^D
    $ postmaster ...
    $ psql template1

-dg

David Gould           dg@illustra.com            510.628.3783 or 510.305.9468
Informix Software                      300 Lakeside Drive   Oakland, CA 94612
 - A child of five could understand this!  Fetch me a child of five.

Re: [HACKERS] Problem after removal of exec(), help

От
Bruce Momjian
Дата:
> Replacing the postgres binary with a shell script that executes the real
> postgres binary and then moves the gmon.out file out of the way.
>
>     $ mv postgres postgres.bin
>     $cat > postgres
>     #!/bin/sh
>     postgres.bin $*
>     mv gmon.out gmon.$$
>     ^D
>     $ postmaster ...
>     $ psql template1
>

Ah, I see.  Re-enabling exec() is not a trivial job.  Perhaps you can put a
system("mv ...") call in the postmaster backend cleanup code.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Problem after removal of exec(), help

От
Bruce Momjian
Дата:
>
> Since the removal of exec(), Thomas has seen, and I have confirmed that
> if a backend crashes, and the postmaster must reset the shared memory,
> no backends can connect anymore.  One way to reproduce it is to run the
> regression tests, which on their last test will crash for an un-related
> reason.  However, it will not allow you to restart any more backends.
>
> The error it gets is:
>
> Failed Assertion("!((((unsigned long)nextElem) > ShmemBase)):", File: "shmqueue.
> c", Line: 83)
> !((((unsigned long)nextElem) > ShmemBase)) (0) [No such file or directory]
>
> In this case nextElem = ShmemBase, so it is not greater.  Removing the
> Assert() still does not make things work, so there must be something
> else.
>
> Now, the problem is probably not at that exact spot, but somewhere
> deeper.  There are two differences between the old non-exec() behavior
> and new behavior.  In the old setup, the backend had all its global
> variables initialized, while in the new no-exec case, they take the
> global variable values from the postmaster.  Second, the old setup had
> each backend attaching to the shared memory, while the new setup has
> them inheriting the shared memory from the fork().

I have fixed the problem.  The problem was that InitMultiLevelLocks()
was not re-initializing the LockTable, which was still pointing to the
old shared memory lock structures, not the new ones in the new shared
memory segment.

I had to change InitMultiLevelLocks so it always reset the memory, and
force LockTableInit to set Numtables in lock.c to 1 on startup, so it
re-creates the LOCKTAB entries that do not point to the old shared
memory stuff.

I also replaces on_exitpg with new on_proc_exit and on_shmem_exit() to
clarify when these are being run, and removed quasi_exit().

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)