Обсуждение: postgres 9.0 crash when bringing up hot standby
Hello,
OS level = AIX 5.3 ML-8
Postgres version = 9.0 beta-4
I’m testing “hot standby” using “streaming WAL records”. On trying to bring up the hot standby, I see the following error in the log:
LOG: database system was interrupted; last known up at 2010-08-05 14:46:36 EDT
LOG: entering standby mode
LOG: restored log file "000000010000000000000007" from archive
LOG: redo starts at 0/7000020
LOG: consistent recovery state reached at 0/8000000
LOG: database system is ready to accept read only connections
cp: /pgarclog/pg1/000000010000000000000008: A file or directory in the path name does not exist.
LOG: WAL receiver process (PID 1073206) was terminated by signal 11
LOG: terminating any other active server processes
There is a core dump. The debugger indicates the crash sequence as follows:
(dbx) where
_alloc_initial_pthread(??) at 0x90000000049567c
__pth_init(??) at 0x900000000493ba4
uload(??, ??, ??, ??, ??, ??, ??, ??) at 0x9fffffff0001954
load_64.load(??, ??, ??) at 0x90000000004686c
loadAndInit() at 0x90000000047bd7c
dlopen(??, ??) at 0x90000000011cc4c
internal_load_library(libname = "/apps/pg_9.0_b4/lib/postgresql/libpqwalreceiver.so"), line 234 in "dfmgr.c"
load_file(filename = "libpqwalreceiver", restricted = '\0'), line 156 in "dfmgr.c"
WalReceiverMain(), line 248 in "walreceiver.c"
AuxiliaryProcessMain(argc = 2, argv = 0x0fffffffffffa8b8), line 428 in "bootstrap.c"
StartChildProcess(type = WalReceiverProcess), line 4405 in "postmaster.c"
sigusr1_handler(postgres_signal_arg = 30), line 4227 in "postmaster.c"
__fd_select(??, ??, ??, ??, ??) at 0x90000000011805c
postmaster.select(__fds = 5, __readlist = 0x0fffffffffffd0a8, __writelist = (nil), __exceptlist = (nil), __timeout = 0x0ffffffffffff0c0), line 229 in "time.h"
unnamed block in ServerLoop(), line 1391 in "postmaster.c"
unnamed block in ServerLoop(), line 1391 in "postmaster.c"
ServerLoop(), line 1391 in "postmaster.c"
PostmasterMain(argc = 1, argv = 0x00000001102aa4b0), line 1092 in "postmaster.c"
main(argc = 1, argv = 0x00000001102aa4b0), line 188 in "main.c"
Any pointers on how to resolve the issue will be much appreciated.
Thanks.
Alanoly Andrews (alanolya@invera.com)
Senior Software Engineer
Invera Inc.
Montreal, QC
This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.
Ce courriel est confidentiel et prot�g�. L'exp�diteur ne renonce pas aux droits et obligations qui s'y rapportent. Toute diffusion, utilisation ou copie de ce message ou des renseignements qu'il contient par une personne autre que le (les) destinataire(s) d�sign�(s) est interdite. Si vous recevez ce courriel par erreur, veuillez m'en aviser imm�diatement, par retour de courriel ou par un autre moyen.
Mail sent via the Abaca EPG
On Fri, Aug 6, 2010 at 10:10 PM, Alanoly Andrews <alanolya@invera.com> wrote: > I’m testing “hot standby” using “streaming WAL records”. On trying to bring > up the hot standby, I see the following error in the log: Thanks for the report! > LOG: database system was interrupted; last known up at 2010-08-05 14:46:36 > LOG: entering standby mode > LOG: restored log file "000000010000000000000007" from archive > LOG: redo starts at 0/7000020 > LOG: consistent recovery state reached at 0/8000000 > LOG: database system is ready to accept read only connections > cp: /pgarclog/pg1/000000010000000000000008: A file or directory in the path > name does not exist. > LOG: WAL receiver process (PID 1073206) was terminated by signal 11 > LOG: terminating any other active server processes > > There is a core dump. The debugger indicates the crash sequence as follows: > > (dbx) where > _alloc_initial_pthread(??) at 0x90000000049567c > __pth_init(??) at 0x900000000493ba4 > uload(??, ??, ??, ??, ??, ??, ??, ??) at 0x9fffffff0001954 > load_64.load(??, ??, ??) at 0x90000000004686c > loadAndInit() at 0x90000000047bd7c > dlopen(??, ??) at 0x90000000011cc4c > internal_load_library(libname = > "/apps/pg_9.0_b4/lib/postgresql/libpqwalreceiver.so"), line 234 in "dfmgr.c" > load_file(filename = "libpqwalreceiver", restricted = '\0'), line 156 in > "dfmgr.c" > WalReceiverMain(), line 248 in "walreceiver.c" > AuxiliaryProcessMain(argc = 2, argv = 0x0fffffffffffa8b8), line 428 in > "bootstrap.c" > StartChildProcess(type = WalReceiverProcess), line 4405 in "postmaster.c" > sigusr1_handler(postgres_signal_arg = 30), line 4227 in "postmaster.c" > __fd_select(??, ??, ??, ??, ??) at 0x90000000011805c > postmaster.select(__fds = 5, __readlist = 0x0fffffffffffd0a8, __writelist = > (nil), __exceptlist = (nil), __timeout = 0x0ffffffffffff0c0), line 229 in > "time.h" > unnamed block in ServerLoop(), line 1391 in "postmaster.c" > unnamed block in ServerLoop(), line 1391 in "postmaster.c" > ServerLoop(), line 1391 in "postmaster.c" > PostmasterMain(argc = 1, argv = 0x00000001102aa4b0), line 1092 in > "postmaster.c" > main(argc = 1, argv = 0x00000001102aa4b0), line 188 in "main.c" > > Any pointers on how to resolve the issue will be much appreciated. Sorry, I have no idea what's wrong :( Is the simple LOAD command successful on your AIX? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Thanks. Yes, the LOAD command does work, on another database cluster on the same AIX machine. -----Original Message----- From: Fujii Masao [mailto:masao.fujii@gmail.com] Sent: Friday, August 06, 2010 10:31 AM To: Alanoly Andrews Cc: pgsql-admin@postgresql.org; PostgreSQL-development Subject: Re: [ADMIN] postgres 9.0 crash when bringing up hot standby On Fri, Aug 6, 2010 at 10:10 PM, Alanoly Andrews <alanolya@invera.com> wrote: > I'm testing "hot standby" using "streaming WAL records". On trying to bring > up the hot standby, I see the following error in the log: Thanks for the report! > LOG: database system was interrupted; last known up at 2010-08-05 14:46:36 > LOG: entering standby mode > LOG: restored log file "000000010000000000000007" from archive > LOG: redo starts at 0/7000020 > LOG: consistent recovery state reached at 0/8000000 > LOG: database system is ready to accept read only connections > cp: /pgarclog/pg1/000000010000000000000008: A file or directory in the path > name does not exist. > LOG: WAL receiver process (PID 1073206) was terminated by signal 11 > LOG: terminating any other active server processes > > There is a core dump. The debugger indicates the crash sequence as follows: > > (dbx) where > _alloc_initial_pthread(??) at 0x90000000049567c > __pth_init(??) at 0x900000000493ba4 > uload(??, ??, ??, ??, ??, ??, ??, ??) at 0x9fffffff0001954 > load_64.load(??, ??, ??) at 0x90000000004686c > loadAndInit() at 0x90000000047bd7c > dlopen(??, ??) at 0x90000000011cc4c > internal_load_library(libname = > "/apps/pg_9.0_b4/lib/postgresql/libpqwalreceiver.so"), line 234 in "dfmgr.c" > load_file(filename = "libpqwalreceiver", restricted = '\0'), line 156 in > "dfmgr.c" > WalReceiverMain(), line 248 in "walreceiver.c" > AuxiliaryProcessMain(argc = 2, argv = 0x0fffffffffffa8b8), line 428 in > "bootstrap.c" > StartChildProcess(type = WalReceiverProcess), line 4405 in "postmaster.c" > sigusr1_handler(postgres_signal_arg = 30), line 4227 in "postmaster.c" > __fd_select(??, ??, ??, ??, ??) at 0x90000000011805c > postmaster.select(__fds = 5, __readlist = 0x0fffffffffffd0a8, __writelist = > (nil), __exceptlist = (nil), __timeout = 0x0ffffffffffff0c0), line 229 in > "time.h" > unnamed block in ServerLoop(), line 1391 in "postmaster.c" > unnamed block in ServerLoop(), line 1391 in "postmaster.c" > ServerLoop(), line 1391 in "postmaster.c" > PostmasterMain(argc = 1, argv = 0x00000001102aa4b0), line 1092 in > "postmaster.c" > main(argc = 1, argv = 0x00000001102aa4b0), line 188 in "main.c" > > Any pointers on how to resolve the issue will be much appreciated. Sorry, I have no idea what's wrong :( Is the simple LOAD command successful on your AIX? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center **************************************************** This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Anydistribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized.If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux droits et obligations qui s'y rapportent. Toutediffusion, utilisation ou copie de ce message ou des renseignements qu'il contient par une personne autre que le (les)destinataire(s) désigné(s) est interdite. Si vous recevez ce courriel par erreur, veuillez m'en aviser immédiatement,par retour de courriel ou par un autre moyen. ****************************************************
On 06/08/10 17:31, Fujii Masao wrote: > On Fri, Aug 6, 2010 at 10:10 PM, Alanoly Andrews<alanolya@invera.com> wrote: >> I’m testing “hot standby” using “streaming WAL records”. On trying to bring >> (dbx) where >> _alloc_initial_pthread(??) at 0x90000000049567c >> __pth_init(??) at 0x900000000493ba4 >> uload(??, ??, ??, ??, ??, ??, ??, ??) at 0x9fffffff0001954 >> load_64.load(??, ??, ??) at 0x90000000004686c >> loadAndInit() at 0x90000000047bd7c >> dlopen(??, ??) at 0x90000000011cc4c >> internal_load_library(libname = >> "/apps/pg_9.0_b4/lib/postgresql/libpqwalreceiver.so"), line 234 in "dfmgr.c" >> load_file(filename = "libpqwalreceiver", restricted = '\0'), line 156 in >> "dfmgr.c" >> WalReceiverMain(), line 248 in "walreceiver.c" >> AuxiliaryProcessMain(argc = 2, argv = 0x0fffffffffffa8b8), line 428 in >> "bootstrap.c" >> StartChildProcess(type = WalReceiverProcess), line 4405 in "postmaster.c" >> sigusr1_handler(postgres_signal_arg = 30), line 4227 in "postmaster.c" >> __fd_select(??, ??, ??, ??, ??) at 0x90000000011805c >> postmaster.select(__fds = 5, __readlist = 0x0fffffffffffd0a8, __writelist = >> (nil), __exceptlist = (nil), __timeout = 0x0ffffffffffff0c0), line 229 in >> "time.h" >> unnamed block in ServerLoop(), line 1391 in "postmaster.c" >> unnamed block in ServerLoop(), line 1391 in "postmaster.c" >> ServerLoop(), line 1391 in "postmaster.c" >> PostmasterMain(argc = 1, argv = 0x00000001102aa4b0), line 1092 in >> "postmaster.c" >> main(argc = 1, argv = 0x00000001102aa4b0), line 188 in "main.c" >> >> Any pointers on how to resolve the issue will be much appreciated. So, loading libpqwalreceiver library crashes. It looks like it might be pthread-related. Perhaps something wrong with our makefiles, causing libpqwalreceiver to be built with wrong flags? Does contrib/dblink work? If you look at the build log, what is the command line used to compile libpqwalreceiver, and what is the command line used to build other libraries, like contrib/dblink? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 6, 2010 at 3:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > So, loading libpqwalreceiver library crashes. It looks like it might be > pthread-related. Perhaps something wrong with our makefiles, causing > libpqwalreceiver to be built with wrong flags? Does contrib/dblink work? If > you look at the build log, what is the command line used to compile > libpqwalreceiver, and what is the command line used to build other > libraries, like contrib/dblink? I haven't seen any response to this from the OP, but it seems worrisome. Has anyone else tested a Hot Standby configuraration - successfully or otherwise - on AIX? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Aug 11, 2010 at 10:20 AM, Alanoly Andrews <alanolya@invera.com> wrote: > Ok..in response to the questions from Heikki, > > 1. Yes, "contrib/dblink" does work. Here's the output from the command used to "make" dblink: > postgres:thimar> /usr/bin/gmake -C contrib/dblink install > gmake: Entering directory `/dinabkp/faouzis/postgresql-9.0beta1/contrib/dblink' > /bin/sh ../../config/install-sh -c -d '/dinabkp/faouzis/local2/pgsql/lib' > /bin/sh ../../config/install-sh -c -d '/dinabkp/faouzis/local2/pgsql/share/contrib' > /bin/sh ../../config/install-sh -c -m 755 dblink.so '/dinabkp/faouzis/local2/pgsql/lib/dblink.so' > /bin/sh ../../config/install-sh -c -m 644 ./uninstall_dblink.sql '/dinabkp/faouzis/local2/pgsql/share/contrib' > /bin/sh ../../config/install-sh -c -m 644 dblink.sql '/dinabkp/faouzis/local2/pgsql/share/contrib' > gmake: Leaving directory `/dinabkp/faouzis/postgresql-9.0beta1/contrib/dblink' Unfortunately that only shows the install, not the link - it must have been built earlier. Can you do "make clean" in that just that one directory, and then "make install" again? > 2. I don't have records of the build logs for the regular postgres executables (which contains the libpqwalreceiver) butcan do a new compile/make if that is required. But they were compiled and installed using the regular make files suppliedalong with the postgres source code. The following flags were added during the compilation: > > --without-readline --without-zlib --enable-debug --enable-cassert --enable-thread-safety It'd be nice to see the whole build log, if it's not too much trouble to regenerate it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Ok..in response to the questions from Heikki, 1. Yes, "contrib/dblink" does work. Here's the output from the command used to "make" dblink: postgres:thimar> /usr/bin/gmake -C contrib/dblink install gmake: Entering directory `/dinabkp/faouzis/postgresql-9.0beta1/contrib/dblink' /bin/sh ../../config/install-sh -c -d '/dinabkp/faouzis/local2/pgsql/lib' /bin/sh ../../config/install-sh -c -d '/dinabkp/faouzis/local2/pgsql/share/contrib' /bin/sh ../../config/install-sh -c -m 755 dblink.so '/dinabkp/faouzis/local2/pgsql/lib/dblink.so' /bin/sh ../../config/install-sh -c -m 644 ./uninstall_dblink.sql '/dinabkp/faouzis/local2/pgsql/share/contrib' /bin/sh ../../config/install-sh -c -m 644 dblink.sql '/dinabkp/faouzis/local2/pgsql/share/contrib' gmake: Leaving directory `/dinabkp/faouzis/postgresql-9.0beta1/contrib/dblink' 2. I don't have records of the build logs for the regular postgres executables (which contains the libpqwalreceiver) butcan do a new compile/make if that is required. But they were compiled and installed using the regular make files suppliedalong with the postgres source code. The following flags were added during the compilation: --without-readline --without-zlib --enable-debug --enable-cassert --enable-thread-safety Thanks. Alanoly. -----Original Message----- From: Robert Haas [mailto:robertmhaas@gmail.com] Sent: Wednesday, August 11, 2010 10:13 AM To: Heikki Linnakangas Cc: Alanoly Andrews; pgsql-admin@postgresql.org; PostgreSQL-development Subject: Re: [HACKERS] [ADMIN] postgres 9.0 crash when bringing up hot standby On Fri, Aug 6, 2010 at 3:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > So, loading libpqwalreceiver library crashes. It looks like it might be > pthread-related. Perhaps something wrong with our makefiles, causing > libpqwalreceiver to be built with wrong flags? Does contrib/dblink work? If > you look at the build log, what is the command line used to compile > libpqwalreceiver, and what is the command line used to build other > libraries, like contrib/dblink? I haven't seen any response to this from the OP, but it seems worrisome. Has anyone else tested a Hot Standby configuraration - successfully or otherwise - on AIX? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company **************************************************** This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Anydistribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized.If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. Ce courriel est confidentiel et prot�g�. L'exp�diteur ne renonce pas aux droits et obligations qui s'y rapportent. Toutediffusion, utilisation ou copie de ce message ou des renseignements qu'il contient par une personne autre que le (les)destinataire(s) d�sign�(s) est interdite. Si vous recevez ce courriel par erreur, veuillez m'en aviser imm�diatement,par retour de courriel ou par un autre moyen. ****************************************************
On Wed, Aug 11, 2010 at 10:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [request form more information] Per off-list discussion with Alanoly, we've determined the following: dblink was compiled with the same flags as libpqwalreciever dblink works libpqwalreceiver crashes -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > Per off-list discussion with Alanoly, we've determined the following: > dblink was compiled with the same flags as libpqwalreciever > dblink works > libpqwalreceiver crashes I wonder if the problem is not so much libpqwalreceiver as the walreceiver process. Maybe an ordinary backend process does some prerequisite initialization that walreceiver is missing. Hard to guess what, though ... I can't think of anything dlopen() depends on that should be under our control. regards, tom lane
I wrote: > I wonder if the problem is not so much libpqwalreceiver as the > walreceiver process. Maybe an ordinary backend process does some > prerequisite initialization that walreceiver is missing. Hard to > guess what, though ... I can't think of anything dlopen() depends on > that should be under our control. Actually, that idea is easily tested: try doing LOAD 'libpqwalreceiver'; in a regular backend process. If that still crashes, it might be useful to truss or strace the backend while it runs the command, and compare that to the trace of LOAD 'dblink'; regards, tom lane
On Thu, Aug 12, 2010 at 5:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I wonder if the problem is not so much libpqwalreceiver as the >> walreceiver process. Maybe an ordinary backend process does some >> prerequisite initialization that walreceiver is missing. Hard to >> guess what, though ... I can't think of anything dlopen() depends on >> that should be under our control. > > Actually, that idea is easily tested: try doing > LOAD 'libpqwalreceiver'; > in a regular backend process. Alanoly, is this something you can try? > If that still crashes, it might be useful to truss or strace the backend > while it runs the command, and compare that to the trace of > LOAD 'dblink'; And if necessary, this too? Thanks for your help debugging this.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company