Re: Core dump
От | Tom Lane |
---|---|
Тема | Re: Core dump |
Дата | |
Msg-id | 27214.971381455@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Core dump (Dan Moschuk <dan@freebsd.org>) |
Ответы |
Re: Core dump
|
Список | pgsql-hackers |
Dan Moschuk <dan@freebsd.org> writes: > Sparc solaris 2.7 with postgres 7.0.2 > It seems to be reproducable, the server crashes on us at a rate of about > every few hours. That's a very bizarre backtrace. Why the multiple levels of recursive entry to the quickdie() signal handler? I wonder if you aren't looking at some kind of Solaris bug --- perhaps it's not able to cope with a signal handler turning around and issuing new kernel calls. The core file you are looking at is probably *not* from the original failure, whatever that is. The sequence is probably 1. Some backend crashes for unknown reason, dumping core. 2. Postmaster observes messy death of a child, decides that mass suicide followed by restart is called for. Postmastersends SIGUSR1 to all remaining backends to make them commit hara-kiri. 3. One or more other backends crash trying to obey postmaster's command. The corefile left for you to examine comes fromwhichever crashed last. So there are at least two problems here, but we only have evidence of the second one. Since the problem is fairly reproducible, I'd suggest you temporarily dike out the elog(NOTICE) call in quickdie() (in src/backend/tcop/postgres.c), which will probably allow the backends to honor SIGUSR1 without dumping core. Then you have a shot at seeing the core from the original failure. Assuming that this works (ie, you find a core that's not got anything to do with quickdie()), I'd suggest an inquiry to Sun about whether their signal handler logic hasn't got a problem with write() issued from inside a signal handler. Meanwhile let us know what the new backtrace shows. regards, tom lane
В списке pgsql-hackers по дате отправления: