Обсуждение: unable to fail over to warm standby server
Hello -- We are using PostgreSQL 8.3.8 with a Warm Standy (PITR) setup. Recently we experienced a failure on our primary database server and when we attempted to fail over to the standby server it would not come up. This configuration has been tested previously (we've successfully transferred from primary to secondary at least 3 times in the past). Our system is up and running with the Warm Standby setup reconfigured and running again. I'm reporting this issue as bug to describe the circumstances of this failed recovery in the event it does indicates a bug in this very critical feature of PostgreSQL. Both servers are identically configured as follows: - Dual QuadCore Intel Xeon-Harpertown 5430 @ 2.66GHz - Redhat EL 5.1-64 - 64GB RAM - 250GB - 2-disk RAID 0 for / partition - 1.1TB - 8 disk RAID 10 configuration for /data - 72GB - 2-disk RAID 0 for /pg_xlog It is also probably worth mentioning that Simon Riggs (copied) consulted on the configuration of our warm standby setup, thus we are fairly confident it was setup correctly. ;-) Archive configuration on primary server (prod-db-1) was: archive_mode = on archive_command = 'rsync -az --rsh=ssh %p prod-db-2.vpn:wal_archive/%f </dev/null' archive_timeout = 120 Restore configuration (in recovery.conf) on secondary server (prod-db-2) was: restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r 2>>pg_standby.log' Timeline of events: 1.) At 20:58 UTC -- received alert that primary server was not responding to pings FYI the last lines in the postgresql.log file on the primary server (before it crashed) were: 2010-01-18 20:40:56 UTC (10.16.74.194)LOG: unexpected EOF on client connection 2010-01-18 20:40:59 UTC ()LOG: checkpoint starting: time 2010-01-18 20:45:56 UTC (10.16.74.200)LOG: unexpected EOF on client connection 2010-01-18 20:50:57 UTC (10.12.58.142)LOG: unexpected EOF on client connection 2010-01-18 20:51:45 UTC ()LOG: checkpoint complete: wrote 45066 buffers (8.6%); 0 transaction log file(s) added, 0 removed, 44 recycled; write=455.622 s, sync=190.549 s, total=646.443 s - sysadmin confirmed primary was offline, and started manual process to transition to standby server, while another sysadmin worked simultaneously on trying to bring the primary server back online. 2.) Running `ps aufx` on the standby (prod-db-2) included this output postgres 16323 0.0 0.1 4578680 98180 ? S 2009 0:00 /usr/local/pgsql-8.3.8/bin/postgres -D /data/pgdata-8.3 postgres 16324 0.0 0.0 71884 716 ? Ss 2009 0:09 \_ postgres: logger process postgres 16325 3.4 6.5 4581480 4284404 ? Ss 2009 3902:25 \_ postgres: startup process waiting for 0000000200003C82000000D8 postgres 9164 0.0 0.0 63772 996 ? S 20:56 0:00 \_ sh -c pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive 0000000200003C82000000D postgres 9165 0.0 0.0 19168 616 ? S 20:56 0:00 \_ pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive 0000000200003C82000000D8 Notice the standby server is waiting on WAL file: 0000000200003C82000000D8 3.) Running `ls -l /home/postgres/wal_archive` showed that the most recent files transferred were -rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D5 -rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D6 -rw------- 1 postgres postgres 16777216 Jan 18 20:56 0000000200003C82000000D7 4.) The sysadmin ran the following command to make the secondary server exit recovery mode and come online: sudo touch /tmp/pgsql.trigger.5432 NOTE: The use of 'sudo' here is a potential cause/contributor of the failure, because the file pgsql.trigger.5432 will be owned by root, yet the postmaster process starts as user 'postgres'. However a permission issue is inconsistent with what we saw next. 5.) After a few seconds the postgres process stopped but did not come backup. The last few lines of the postgresql.log file were: 2010-01-18 20:54:28 UTC ()LOG: restored log file "0000000200003C82000000D4" from archive 2010-01-18 20:55:03 UTC ()LOG: restored log file "0000000200003C82000000D5" from archive 2010-01-18 20:55:37 UTC ()LOG: restored log file "0000000200003C82000000D6" from archive 2010-01-18 20:56:20 UTC ()LOG: restored log file "0000000200003C82000000D7" from archive 2010-01-18 21:08:31 UTC ()FATAL: could not restore file "0000000200003C82000000D8" from archive: return code 65280 2010-01-18 21:08:31 UTC ()LOG: startup process (PID 16325) exited with exit code 1 2010-01-18 21:08:31 UTC ()LOG: aborting startup due to startup process failure Looking in the /home/postgres/wal_archive directory again confirmed that it did not contain any file named 0000000200003C82000000D8 This was the most mysterious part of the puzzle as the database apparently aborted with a FATAL error when trying to restore a file that didn't exist. We've searched Google and postgresql.org to determine what return code 65280 means, but have come up empty. 6.) Sysadmin attempted to restart the server by hand using `pg_ctl -D /data/pgdata-8.3/ start` -- this command returned a "server starting" message, but the server never appeared in the process list. 7.) The contents of the newly created postgresql.log were: tail postgresql-2010-01-18_210946.log 2010-01-18 21:09:46 UTC ()LOG: database system was interrupted while in recovery at log time 2010-01-18 20:10:59 UTC 2010-01-18 21:09:46 UTC ()HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-01-18 21:09:46 UTC ()LOG: starting archive recovery 2010-01-18 21:09:46 UTC ()LOG: restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r 2>>pg_standby.log' 2010-01-18 21:09:46 UTC ()FATAL: could not restore file "00000002.history" from archive: return code 65280 2010-01-18 21:09:46 UTC ()LOG: startup process (PID 9284) exited with exit code 1 2010-01-18 21:09:46 UTC ()LOG: aborting startup due to startup process failure 8.) Sysadmin noticed the 'recovery.conf' file was still in the db root directory (it had not been renamed recovery.done). The recovery.conf file was renamed to recovery.conf.old and running `pg_ctl -D /data/pgdata-8.3/ start` attempted again 9.) The server did not come up (again). This time the contents of the new postgresql.log file were: [postgres@prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log 2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while in recovery at log time 2010-01-18 20:10:59 UTC 2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No such file or directory 2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No such file or directory 2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record 2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint record 2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was terminated by signal 6: Aborted 2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process failure 10.) At this point a lot of time was spent on the #postgresql IRC channel looking for help diagnosing this problem. Before the issue was resolved, the primary server was brought back online. Soon after we decided to create a new base backup on the standby server and aborted trying to recover. FYI: The last few lines of pg_standby.log were: [postgres@prod-db-2 pgdata-8.3]$ tail -n 300 pg_standby.log trigger file found cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory trigger file found ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permittedtrigger file found ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted This file was not looked until after the attempt to recover was aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a problem, but we don't see how that would explain the error messages, which seem to indicate that data on the standby server was corrupted. If you need additional information we have saved console logs and logs files from both the primary and secondary servers. Thanks for reading this far and any help in tracking down the cause of this unexpected failure. cheers, - Mason
Mason Hale wrote: > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not > permittedtrigger file found > > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted > > This file was not looked until after the attempt to recover was > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a > problem, > but we don't see how that would explain the error messages, which seem > to indicate that data on the standby server was corrupted. Yes, that permission problem seems to be the root cause of the troubles. If pg_standby fails to remove the trigger file, it exit()s with whatever return code the unlink() call returned: > /* > * If trigger file found, we *must* delete it. Here's why: When > * recovery completes, we will be asked again for the same file from > * the archive using pg_standby so must remove trigger file so we can > * reload file again and come up correctly. > */ > rc = unlink(triggerPath); > if (rc != 0) > { > fprintf(stderr, "\n ERROR: could not remove \"%s\": %s", triggerPath, strerror(errno)); > fflush(stderr); > exit(rc); > } unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of the range of normal return codes, and apparently gets mangled into the mysterious 65280 code you saw in the logs. The server treats that as a fatal error, and dies. That seems like a bug in pg_standby, but I'm not sure what it should do if the unlink() fails. It could exit with some other exit code, so that the server wouldn't die, but the lingering trigger file could cause problems, as the comment explains. If it should indeed cause FATAL, it should do so in a more robust way than the exit(rc) call above. BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to delete the trigger file (so that problematic block of code is gone), but there's a new restore_end_command option in recovery.conf instead, where you're supposed to put 'rm <triggerfile>'. I think in that configuration, the standby would've started up, even though removal of the trigger file would've still failed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hello Heikki -- Thank you for investigating this issue and clearing up this mystery. I do not believe it is obvious that the postgres process needs to be able to remove the trigger file. My naive assumption was that the trigger file was merely a flag to signal that recovery mode needed to be stopped. If I were to guess what those steps would be, I would assume the following: - detect the presence of the trigger file - stop the postgres process safely (e.g pg_ctl ... stop) - rename recovery.conf to recovery.done - restart the postgres process (e.g. pg_ctl ... start) It is not obvious that the trigger file needs to be removed. And if permissions prevent it from being removed the last thing that should happen is to cause to database to become corrupted. At minimum the pg_standby documentation should make this requirement clear. I suggest language to the effect of the following: Note it is critical the trigger file be created with permissions that allow > the postgres process to remove the file. Generally this is best done by > creating the file from the postgres user account. Data corruption may result > if the trigger file permissions prevent deletion of the trigger file. Of course the best solution is to avoid this issue entirely. Something as easy to miss as file permissions should not cause data corruption, especially in the process meant to fail over from a crashing primary database. thanks, Mason Hale http://www.onespot.com direct +1 800.618.0768 ext 701 On Thu, Jan 28, 2010 at 3:49 AM, Heikki Linnakangas < heikki.linnakangas@enterprisedb.com> wrote: > Mason Hale wrote: > > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not > > permittedtrigger file found > > > > ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not > permitted > > > > This file was not looked until after the attempt to recover was > > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a > > problem, > > but we don't see how that would explain the error messages, which seem > > to indicate that data on the standby server was corrupted. > > Yes, that permission problem seems to be the root cause of the troubles. > If pg_standby fails to remove the trigger file, it exit()s with whatever > return code the unlink() call returned: > > > /* > > * If trigger file found, we *must* delete it. Here's why: > When > > * recovery completes, we will be asked again for the same > file from > > * the archive using pg_standby so must remove trigger file > so we can > > * reload file again and come up correctly. > > */ > > rc = unlink(triggerPath); > > if (rc != 0) > > { > > fprintf(stderr, "\n ERROR: could not remove \"%s\": > %s", triggerPath, strerror(errno)); > > fflush(stderr); > > exit(rc); > > } > > unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of > the range of normal return codes, and apparently gets mangled into the > mysterious 65280 code you saw in the logs. The server treats that as a > fatal error, and dies. > > That seems like a bug in pg_standby, but I'm not sure what it should do > if the unlink() fails. It could exit with some other exit code, so that > the server wouldn't die, but the lingering trigger file could cause > problems, as the comment explains. If it should indeed cause FATAL, it > should do so in a more robust way than the exit(rc) call above. > > BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to > delete the trigger file (so that problematic block of code is gone), but > there's a new restore_end_command option in recovery.conf instead, where > you're supposed to put 'rm <triggerfile>'. I think in that > configuration, the standby would've started up, even though removal of > the trigger file would've still failed. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
On Fri, Jan 29, 2010 at 12:03 AM, Mason Hale <mason@onespot.com> wrote: > Of course the best solution is to avoid this issue entirely. Something as > easy to miss as file permissions should not cause data corruption, > especially in the process meant to fail over from a crashing primary > database. I believe that such a file permission problem does nothing but shut down the standby by a FATAL error, and wouldn't cause data corruption. So if you remove the trigger file with a wrong permission after the shutdown, you can restart a recovery well by just starting the standby postgres. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jan 29, 2010 at 11:49 PM, Mason Hale <mason@onespot.com> wrote: > While I did not remove the trigger file, I did rename recovery.conf to > recovery.conf.old. > That file contained the recovery_command configuration that identified the > trigger file. So that rename should have eliminated the problem. But it > didn't. Even after making this change and taking the trigger file out of the > equation my database failed to come online. Renaming of the recovery.conf doesn't resolve the problem at all. Instead, the sysadmin had to remove only the trigger file with a wrong permission and just restart postgres. >> 9.) The server did not come up (again). This time the contents of the >> new postgresql.log file were: >> >> [postgres@prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log >> 2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while in recovery at log time 2010-01-18 20:10:59 UTC >> 2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once some data might be corrupted and you might need tochoose an earlier recovery target. >> 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163):No such file or directory >> 2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record >> 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): Nosuch file or directory >> 2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record >> 2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint record >> 2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was terminated by signal 6: Aborted >> 2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process failure You seem to focus on the above trouble. I think that this happened because recovery.conf was deleted and restore_command was not given. In fact, the WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery was unable to be restored from the archive because restore_command was not supplied. Then recovery failed. If the sysadmin had left the recovery.conf and removed the trigger file, pg_standby in restore_command would have restored all WAL files required for recovery, and recovery would advance well. Hope this helps. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> On Fri, Jan 29, 2010 at 12:03 AM, Mason Hale <mason@onespot.com> wrote: > > Of course the best solution is to avoid this issue entirely. Something as > > easy to miss as file permissions should not cause data corruption, > > especially in the process meant to fail over from a crashing primary > > database. > > I believe that such a file permission problem does nothing but > shut down the standby by a FATAL error, and wouldn't cause data > corruption. So if you remove the trigger file with a wrong > permission after the shutdown, you can restart a recovery well > by just starting the standby postgres. > > Perhaps my wording of "data corruption" was too harsh? While I did not remove the trigger file, I did rename recovery.conf to recovery.conf.old. That file contained the recovery_command configuration that identified the trigger file. So that rename should have eliminated the problem. But it didn't. Even after making this change and taking the trigger file out of the equation my database failed to come online. Maybe that wasn't data corruption. Maybe the issue was repairable. I just know that with my 3+ years of experience working with Postgres and the help of the #postgresql IRC channel, I was not able to revive the database at a time when I desperately needed it to work. The failover process failed for me at the worst possible time. I will surely be careful about trigger file permissions in the future. I just shared my experience so that future DBA's who might make the same mistake in a similar situation don't have to deal with the same unexpected results. - Mason
Hello Fujii -- Thanks for the clarification. It's clear my understanding of the recovery process is lacking. My naive assumption was that Postgres would recover using whatever files were available and if it had run out of files it would stop there and come up. And that if recovery.conf were renamed it would stop copying files from the wal_archive into pg_xlog. Thus without the recovery.conf file, the database would just come up, without expecting or waiting on additional files. I see my assumption was wrong, but I think you can agree that it is not surprising someone could expect things this work this way if they aren't directly familiar with the code. I think you can also see how seeing the message "If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target" in the log would lead me to believe my database was corrupted. It is good to know that if I had left recovery.conf in place and just removed the trigger file the issue would have resolved itself. I'm happy to hear the database was not, in fact, corrupted by this error. Perhaps its best to chalk this up to a scenario that creates a confusing, hard-to-diagnose issue -- one that easily looks like corruption, but thankfully is not. Hopefully if anyone tuning into this thread experiences or hears of similar fail-over problems in the future (say on IRC), they'll remember to check the permissions on the trigger file. Thanks again, Mason On Fri, Jan 29, 2010 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jan 29, 2010 at 11:49 PM, Mason Hale <mason@onespot.com> wrote: > > While I did not remove the trigger file, I did rename recovery.conf to > > recovery.conf.old. > > That file contained the recovery_command configuration that identified > the > > trigger file. So that rename should have eliminated the problem. But it > > didn't. Even after making this change and taking the trigger file out of > the > > equation my database failed to come online. > > Renaming of the recovery.conf doesn't resolve the problem at all. Instead, > the sysadmin had to remove only the trigger file with a wrong permission > and just restart postgres. > > >> 9.) The server did not come up (again). This time the contents of the > >> new postgresql.log file were: > >> > >> [postgres@prod-db-2 pg_log]$ tail -n 100 > postgresql-2010-01-18_211132.log > >> 2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while in > recovery at log time 2010-01-18 20:10:59 UTC > >> 2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once > some data might be corrupted and you might need to choose an earlier > recovery target. > >> 2010-01-18 21:11:32 UTC ()LOG: could not open file > "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No such > file or directory > >> 2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record > >> 2010-01-18 21:11:32 UTC ()LOG: could not open file > "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No such > file or directory > >> 2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record > >> 2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint > record > >> 2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was > terminated by signal 6: Aborted > >> 2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process > failure > > You seem to focus on the above trouble. I think that this happened because > recovery.conf was deleted and restore_command was not given. In fact, the > WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery > was unable to be restored from the archive because restore_command was > not supplied. Then recovery failed. > > If the sysadmin had left the recovery.conf and removed the trigger file, > pg_standby in restore_command would have restored all WAL files required > for recovery, and recovery would advance well. > > Hope this helps. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center >
On Fri, Jan 29, 2010 at 11:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > You seem to focus on the above trouble. I think that this happened because > recovery.conf was deleted and restore_command was not given. In fact, the > WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery > was unable to be restored from the archive because restore_command was > not supplied. Then recovery failed. > > If the sysadmin had left the recovery.conf and removed the trigger file, > pg_standby in restore_command would have restored all WAL files required > for recovery, and recovery would advance well. That may be true, but it's certainly seems unfortunate that we don't handle this case a bit more gracefully. ...Robert
Robert Haas wrote: > On Fri, Jan 29, 2010 at 11:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> You seem to focus on the above trouble. I think that this happened because >> recovery.conf was deleted and restore_command was not given. In fact, the >> WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery >> was unable to be restored from the archive because restore_command was >> not supplied. Then recovery failed. >> >> If the sysadmin had left the recovery.conf and removed the trigger file, >> pg_standby in restore_command would have restored all WAL files required >> for recovery, and recovery would advance well. > > That may be true, but it's certainly seems unfortunate that we don't > handle this case a bit more gracefully. Yeah. I don't think we can do much better in versions <= 8.3, though we should change that unlink() call to cause the FATAL error in a more explicit way; it seems accidental and possibly non-portable as it is. Adding the note to the docs that Mason suggested is a good idea. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> > > >> If the sysadmin had left the recovery.conf and removed the trigger file, > >> pg_standby in restore_command would have restored all WAL files required > >> for recovery, and recovery would advance well. > > > > That may be true, but it's certainly seems unfortunate that we don't > > handle this case a bit more gracefully. > > Yeah. I don't think we can do much better in versions <= 8.3, though we > should change that unlink() call to cause the FATAL error in a more > explicit way; it seems accidental and possibly non-portable as it is. > > Adding the note to the docs that Mason suggested is a good idea. > > Given that this situation did NOT actually cause corruption, rather the error message was mangled such that it suggested corruption, I offer this revised suggestion for update to the documentation: Important note: It is critical the trigger file be created with permissions > allowing the postgres process to remove the file. Generally this is best > done by creating the file from the postgres user account. Failure to do so > will prevent completion of WAL file recovery and the server from coming back > online successfully. Best regards, Mason
Actually, I think there's a tiny harmless bug in the server too. When it prints the error message: 2010-01-18 21:08:31 UTC ()FATAL: could not restore file "0000000200003C82000000D8" from archive: return code 65280 That return code is not the return code that came from the restore_command. Ie if you do exit(200) in the command, you won't see "return code 200", but you see the return value from system(). system() doesn't return the exit code directly, but an encoded integer that includes the exit code. You're supposed to call WEXITSTATUS() on it to get the real return code. That only affects the error message and is harmless otherwise, but I thought I'd mention it. I'll fix it, unless someone wants to argue that its more useful to print the raw return value of system(), because it might contain more information like which signal killed the process, that you could extract from the cryptic error code using e.g WTERMSIG() macro. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Actually, I think there's a tiny harmless bug in the server too. When it > prints the error message: > > 2010-01-18 21:08:31 UTC ()FATAL: =A0could not restore file > "0000000200003C82000000D8" from archive: return code 65280 > > That return code is not the return code that came from the > restore_command. Ie if you do exit(200) in the command, you won't see > "return code 200", but you see the return value from system(). system() > doesn't return the exit code directly, but an encoded integer that > includes the exit code. You're supposed to call WEXITSTATUS() on it to > get the real return code. > > That only affects the error message and is harmless otherwise, but I > thought I'd mention it. I'll fix it, unless someone wants to argue that > its more useful to print the raw return value of system(), because it > might contain more information like which signal killed the process, > that you could extract from the cryptic error code using e.g WTERMSIG() > macro. An "if" statement would seem to be in order, so that you can print out either the exit code or the signal number as appropriate. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas >> That only affects the error message and is harmless otherwise, but I >> thought I'd mention it. I'll fix it, unless someone wants to argue that >> its more useful to print the raw return value of system(), because it >> might contain more information like which signal killed the process, >> that you could extract from the cryptic error code using e.g WTERMSIG() >> macro. > An "if" statement would seem to be in order, so that you can print out > either the exit code or the signal number as appropriate. Yes. Please see the existing code in the postmaster that prints subprocess exit codes, and duplicate it (or perhaps refactor so you can avoid code duplication; though translatability of messages might limit what you can do there). regards, tom lane
Mason Hale wrote: > Given that this situation did NOT actually cause corruption, rather the > error message was mangled such that it suggested corruption, I offer this > revised suggestion for update to the documentation: > > Important note: It is critical the trigger file be created with permissions >> allowing the postgres process to remove the file. Generally this is best >> done by creating the file from the postgres user account. Failure to do so >> will prevent completion of WAL file recovery and the server from coming back >> online successfully. Ok, I've added this note to the 8.3 docs now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas >>> That only affects the error message and is harmless otherwise, but I >>> thought I'd mention it. I'll fix it, unless someone wants to argue that >>> its more useful to print the raw return value of system(), because it >>> might contain more information like which signal killed the process, >>> that you could extract from the cryptic error code using e.g WTERMSIG() >>> macro. > >> An "if" statement would seem to be in order, so that you can print out >> either the exit code or the signal number as appropriate. > > Yes. Please see the existing code in the postmaster that prints > subprocess exit codes, and duplicate it (or perhaps refactor so you can > avoid code duplication; though translatability of messages might limit > what you can do there). Here's what I came up with. Translatability indeed makes it pretty hard, I ended up copy-pasting. BTW, I don't think I'm going to bother or risk back-patching this. It was harmless, and for forensic purposes all the information was there in the old message too, just in a cryptic format. And the new messages would need translating too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** *** 546,551 **** static void pg_start_backup_callback(int code, Datum arg); --- 546,552 ---- static bool read_backup_label(XLogRecPtr *checkPointLoc); static void rm_redo_error_callback(void *arg); static int get_sync_bit(int method); + static int errdetail_childexit(const char *procname, int exitstatus); /* *************** *** 2959,2966 **** RestoreArchivedFile(char *path, const char *xlogfname, signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125; ereport(signaled ? FATAL : DEBUG2, ! (errmsg("could not restore file \"%s\" from archive: return code %d", ! xlogfname, rc))); not_available: /* --- 2960,2967 ---- signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125; ereport(signaled ? FATAL : DEBUG2, ! (errmsg("could not restore file \"%s\" from archive", xlogfname), ! errdetail_childexit("restore_command", rc))); not_available: /* *************** *** 3077,3088 **** ExecuteRecoveryEndCommand(void) signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125; ereport(signaled ? FATAL : WARNING, ! (errmsg("recovery_end_command \"%s\": return code %d", ! xlogRecoveryEndCmd, rc))); } } /* * Preallocate log files beyond the specified log endpoint. * * XXX this is currently extremely conservative, since it forces only one --- 3078,3149 ---- signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125; ereport(signaled ? FATAL : WARNING, ! (errmsg("recovery end command \"%s\" failed", xlogRecoveryEndCmd), ! errdetail_childexit("recovery_end_command", rc))); } } /* + * Add detail information (and possibly a hint) describing why a child + * process exited to the error currently being constructed. This is for + * use inside ereport, e.g: + * + * ereport(DEBUG1, + * (errmsg(...), + * (errdetail_childexit("mycommand", rc)))); + * + * 'procname' is a name of the exited command, and 'exitstatus' is a return + * code as returned by e.g wait(2). + * + * The logic is copied from LogChildExit() in postmaster/postmaster.c. If + * you change this, make sure you keep LogChildExit in sync. + */ + static int + errdetail_childexit(const char *procname, int exitstatus) + { + if (WIFEXITED(exitstatus)) + /*------ + translator: %s is a noun phrase describing a child process, such as + "restore_command" */ + errdetail("%s exited with exit code %d", + procname, WEXITSTATUS(exitstatus)); + else if (WIFSIGNALED(exitstatus)) + { + #if defined(WIN32) + /*------ + translator: %s is a noun phrase describing a child process, such as + "restore_command" */ + errdetail("%s was terminated by exception 0x%X", + procname, WTERMSIG(exitstatus)); + errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value."); + #elif defined(HAVE_DECL_SYS_SIGLIST) && HAVE_DECL_SYS_SIGLIST + /*------ + translator: %s is a noun phrase describing a child process, such as + "restore_command" */ + errdetail("%s was terminated by signal %d: %s", + procname, WTERMSIG(exitstatus), + WTERMSIG(exitstatus) < NSIG ? + sys_siglist[WTERMSIG(exitstatus)] : "(unknown)"); + #else + /*------ + translator: %s is a noun phrase describing a child process, such as + "restore_command" */ + errdetail("%s was terminated by signal %d", + procname, WTERMSIG(exitstatus)); + #endif + } + else + /*------ + translator: %s is a noun phrase describing a child process, such as + "restore_command" */ + errdetail("%s exited with unrecognized status %d", + procname, exitstatus); + + /* return dummy value, so that this can be used inside ereport(...) */ + return 1; + } + + /* * Preallocate log files beyond the specified log endpoint. * * XXX this is currently extremely conservative, since it forces only one *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** *** 2770,2775 **** HandleChildCrash(int pid, int exitstatus, const char *procname) --- 2770,2779 ---- /* * Log the death of a child process. + * + * errdetail_childexit() function in xlog.c uses the same logic for + * constructing the message. If you change this function, remember to + * keep errdetail_childexit() in sync. */ static void LogChildExit(int lev, const char *procname, int pid, int exitstatus)
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Yes. Please see the existing code in the postmaster that prints >> subprocess exit codes, and duplicate it (or perhaps refactor so you can >> avoid code duplication; though translatability of messages might limit >> what you can do there). > Here's what I came up with. Translatability indeed makes it pretty hard, > I ended up copy-pasting. Looks reasonable to me. One possible objection is that someone running in "terse" logging mode would not get any information about the child process's exit code, since errdetail would be suppressed in that case. But we can probably live with that --- as far as I can tell, few people use terse-mode logging. regards, tom lane
Heikki Linnakangas escribió: > Here's what I came up with. Translatability indeed makes it pretty hard, > I ended up copy-pasting. Looks sane to me too; msgmerge segfaults though so I couldn't test. Two minor comments: > + /*------ > + translator: %s is a noun phrase describing a child process, such as > + "restore_command" */ > + errdetail("%s was terminated by signal %d: %s", You probably want "the first %s is ... and the second %s is the signal name". And here: > + sys_siglist[WTERMSIG(exitstatus)] : "(unknown)"); Should probably be _("(unknown)") -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Feb 10, 2010 at 4:47 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's what I came up with. Translatability indeed makes it pretty hard, > I ended up copy-pasting. > > BTW, I don't think I'm going to bother or risk back-patching this. It > was harmless, and for forensic purposes all the information was there in > the old message too, just in a cryptic format. And the new messages > would need translating too. This looks applicable to also archive_command. Are you going to apply it to archive_command? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center