Обсуждение: Re: Backend core dump, Please help, Urgent!

Поиск
Список
Период
Сортировка

Re: Backend core dump, Please help, Urgent!

От
Tom Lane
Дата:
[ I'm redirecting this to pg-hackers since it doesn't look like an
interfaces problem ... ]

Matthew Hagerty <matthew@venux.net> writes:
> The app is written in PHP3-3.0.12 compiled as an Apache-1.3.6 module.  The
> OS is FreeBSD-3.1-Release with GCC-2.7.2.1 and a PostgreSQL-6.5.1 backend.

You should probably update to 6.5.3 for starters.  I'm not all that
hopeful that any of the bugfixes in 6.5.3 will fix this, but it'd be
pretty silly not to try it before investing a lot of work running down
the problem.

> The app went online on August 30, 1999 and has run without incident until
> yesterday.  At about 10am Dec, 13th, 1999 one of the programmers noticed
> that none of the forum messages would come up.  I went to the console of
> the server and saw this message about 10 or 15 times:

> Dec 13 10:35:56 redbox /kernel: pid 13856 (postgres), uid 1002: exited on
> signal 11 (core dumped)

> A ps -xa revealed about 15 or so postgres processes!  I did not think
> postgres made any child processes?!?!  So I stopped the web server and
> killed the main postgres process which seemed to kill all the other
> postgres processes.  I then tried to restart postgres and got an error
> message that was something like:

> IpcSemaphore??? - Key=54321234 Max

You could probably have recovered from this with "ipcclean" instead of a
reboot; it sounds like the postmaster failed to release the shared
semaphores before exiting.  Which it should have, unless maybe you used
kill -9 on it...

> At 9:36am on the 14th it happened again.  Again I was unable to recover the
> data and had to rebuild the data directory.  I did not delete the data
> directory this time, I just moved it to another directory so I would have
> it.  I also have the core dumps.  The only file I had to delete was the
> pg_log in the data directory.  What is this file?  It had grown to 700Meg
> in under 24 hours!!  Also, the core dump for the main app grew from 2.7Meg
> to over 80Meg while I was trying to dump the data.

Sure sounds like a corrupted-data problem.  Can you use gdb on the
corefiles to get a backtrace of what they were doing?

> My biggest hang-up is why all of a sudden?

Good question.  We'll probably know the answer when we find the problem.
        regards, tom lane


Re: [HACKERS] Re: Backend core dump, Please help, Urgent!

От
Tatsuo Ishii
Дата:
> Sure sounds like a corrupted-data problem.  Can you use gdb on the
> corefiles to get a backtrace of what they were doing?
> 
> > My biggest hang-up is why all of a sudden?
> 
> Good question.  We'll probably know the answer when we find the problem.

Besides the problem Tom has pointed out its possibility, there is a
known problem with 6.5.x on FreeBSD. It would be rather important,
since it results in a core dump as well. The problem occurs while a
backend is waiting for acquiring a lock. Thus it tends to happen on
relatively heavy load (I observed the problem starting with 4
concurrent transactions). As far as I know, Linux does not have the
problem at all, but FreeBSD does. I'm not sure about other
platforms. Solaris seems to be not suffered.

You could try following patch. It was made for 6.5.3, but you could
apply it to 6.5.1 or 6.5.2 as well. Current has been already fixed
with more complex and long-term-aid solution. But I would prefer to
minimize the impact to existing releases. Keeping that in mind, I have
made the patch the simplest.
--
Tatsuo Ishii

---------------------------- cut here -----------------------------
*** postgresql-6.5.3/src/backend/storage/lmgr/lock.c~    Sat May 29 15:14:42 1999
--- postgresql-6.5.3/src/backend/storage/lmgr/lock.c    Mon Dec 13 16:45:47 1999
***************
*** 940,946 **** {     PROC_QUEUE *waitQueue = &(lock->waitProcs);     LOCKMETHODTABLE *lockMethodTable =
LockMethodTable[lockmethod];
!     char        old_status[64],                 new_status[64];      Assert(lockmethod < NumLockMethods);
--- 940,946 ---- {     PROC_QUEUE *waitQueue = &(lock->waitProcs);     LOCKMETHODTABLE *lockMethodTable =
LockMethodTable[lockmethod];
!     static char        old_status[64],                 new_status[64];      Assert(lockmethod < NumLockMethods);


Re: [HACKERS] Re: Backend core dump, Please help, Urgent!

От
Matthew Hagerty
Дата:
Thanks for the patch. I think I'm going to upgrade to FreeBSD-3.3 and 
PG-6.5.3 tonight. Will I still need the patch with 6.5.3? I'm also going 
to do a connection test on another offline server to see if it is indeed a 
load problem. I'll post the results if anyone is interested.
Thank you for the help, 
Matthew


At 08:43 PM 12/15/99 +0900, Tatsuo Ishii wrote:
>> Sure sounds like a corrupted-data problem.  Can you use gdb on the
>> corefiles to get a backtrace of what they were doing?
>> 
>> > My biggest hang-up is why all of a sudden?
>> 
>> Good question.  We'll probably know the answer when we find the problem.
>
>Besides the problem Tom has pointed out its possibility, there is a
>known problem with 6.5.x on FreeBSD. It would be rather important,
>since it results in a core dump as well. The problem occurs while a
>backend is waiting for acquiring a lock. Thus it tends to happen on
>relatively heavy load (I observed the problem starting with 4
>concurrent transactions). As far as I know, Linux does not have the
>problem at all, but FreeBSD does. I'm not sure about other
>platforms. Solaris seems to be not suffered.
>
>You could try following patch. It was made for 6.5.3, but you could
>apply it to 6.5.1 or 6.5.2 as well. Current has been already fixed
>with more complex and long-term-aid solution. But I would prefer to
>minimize the impact to existing releases. Keeping that in mind, I have
>made the patch the simplest.
>--
>Tatsuo Ishii
>
>---------------------------- cut here -----------------------------
>*** postgresql-6.5.3/src/backend/storage/lmgr/lock.c~    Sat May 29 15:14:42 1999
>--- postgresql-6.5.3/src/backend/storage/lmgr/lock.c    Mon Dec 13 16:45:47 1999
>***************
>*** 940,946 ****
>  {
>      PROC_QUEUE *waitQueue = &(lock->waitProcs);
>      LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod];
>!     char        old_status[64],
>                  new_status[64];
>  
>      Assert(lockmethod < NumLockMethods);
>--- 940,946 ----
>  {
>      PROC_QUEUE *waitQueue = &(lock->waitProcs);
>      LOCKMETHODTABLE *lockMethodTable = LockMethodTable[lockmethod];
>!     static char        old_status[64],
>                  new_status[64];
>  
>      Assert(lockmethod < NumLockMethods);