Обсуждение: unable to fail over to warm standby server

Поиск

Список

Период

Сортировка

unable to fail over to warm standby server

От

Mason Hale

Дата:

27 января 2010 г., 19:46:41

Hello --

We are using PostgreSQL 8.3.8 with a Warm Standy (PITR) setup.

Recently we experienced a failure on our primary database server and
when we attempted to fail over to the standby server it would not come
up.
This configuration has been tested previously (we've successfully
transferred from primary to secondary at least 3 times in the past).

Our system is up and running with the Warm Standby setup reconfigured
and running again.

I'm reporting this issue as bug to describe the circumstances of this
failed recovery in the event it does indicates a bug in this very
critical feature of PostgreSQL.

Both servers are identically configured as follows:
  - Dual QuadCore Intel Xeon-Harpertown 5430 @ 2.66GHz
  - Redhat EL 5.1-64
  - 64GB RAM
  - 250GB - 2-disk RAID 0 for / partition
  - 1.1TB - 8 disk RAID 10 configuration for /data
  - 72GB - 2-disk RAID 0 for /pg_xlog

It is also probably worth mentioning that Simon Riggs (copied)
consulted on the configuration of our warm standby setup, thus we are
fairly confident it was setup correctly. ;-)

Archive configuration on primary server (prod-db-1) was:

archive_mode = on
archive_command = 'rsync -az --rsh=ssh %p prod-db-2.vpn:wal_archive/%f
</dev/null'
archive_timeout = 120


Restore configuration (in recovery.conf) on secondary server (prod-db-2) was:

restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432
/home/postgres/wal_archive %f %p %r 2>>pg_standby.log'


Timeline of events:

1.)  At  20:58 UTC -- received alert that primary server was not
responding to pings

FYI the last lines in the postgresql.log file on the primary server
(before it crashed) were:

2010-01-18 20:40:56 UTC (10.16.74.194)LOG:  unexpected EOF on client connection
2010-01-18 20:40:59 UTC ()LOG:  checkpoint starting: time
2010-01-18 20:45:56 UTC (10.16.74.200)LOG:  unexpected EOF on client connection
2010-01-18 20:50:57 UTC (10.12.58.142)LOG:  unexpected EOF on client connection
2010-01-18 20:51:45 UTC ()LOG:  checkpoint complete: wrote 45066
buffers (8.6%); 0 transaction log file(s) added, 0 removed, 44
recycled; write=455.622 s, sync=190.549 s, total=646.443 s

  - sysadmin confirmed primary was offline, and started manual process
to transition to standby server,
while another sysadmin worked simultaneously on trying to bring the
primary server back online.

2.) Running `ps aufx` on the standby (prod-db-2) included this output

postgres 16323  0.0  0.1 4578680 98180 ?       S     2009   0:00
/usr/local/pgsql-8.3.8/bin/postgres -D /data/pgdata-8.3
postgres 16324  0.0  0.0  71884   716 ?        Ss    2009   0:09  \_
postgres: logger process
postgres 16325  3.4  6.5 4581480 4284404 ?     Ss    2009 3902:25  \_
postgres: startup process   waiting for 0000000200003C82000000D8
postgres  9164  0.0  0.0  63772   996 ?        S    20:56   0:00
\_ sh -c pg_standby -t /tmp/pgsql.trigger.5432
/home/postgres/wal_archive 0000000200003C82000000D
postgres  9165  0.0  0.0  19168   616 ?        S    20:56   0:00
   \_ pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive
0000000200003C82000000D8

Notice the standby server is waiting on WAL file: 0000000200003C82000000D8

3.) Running `ls -l /home/postgres/wal_archive` showed that the most
recent files transferred were

-rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D5
-rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D6
-rw------- 1 postgres postgres 16777216 Jan 18 20:56 0000000200003C82000000D7

4.) The sysadmin ran the following command to make the secondary
server exit recovery mode and come online:

sudo touch /tmp/pgsql.trigger.5432

NOTE: The use of 'sudo' here is a potential cause/contributor of the
failure, because the file pgsql.trigger.5432 will be owned by root,
yet
the postmaster process starts as user 'postgres'. However a permission
issue is inconsistent with what we saw next.

5.) After a few seconds the postgres process stopped but did not come backup.
The last few lines of the postgresql.log file were:

2010-01-18 20:54:28 UTC ()LOG:  restored log file
"0000000200003C82000000D4" from archive
2010-01-18 20:55:03 UTC ()LOG:  restored log file
"0000000200003C82000000D5" from archive
2010-01-18 20:55:37 UTC ()LOG:  restored log file
"0000000200003C82000000D6" from archive
2010-01-18 20:56:20 UTC ()LOG:  restored log file
"0000000200003C82000000D7" from archive
2010-01-18 21:08:31 UTC ()FATAL:  could not restore file
"0000000200003C82000000D8" from archive: return code 65280
2010-01-18 21:08:31 UTC ()LOG:  startup process (PID 16325) exited
with exit code 1
2010-01-18 21:08:31 UTC ()LOG:  aborting startup due to startup process failure

Looking in the /home/postgres/wal_archive directory again confirmed
that it did not contain any file named 0000000200003C82000000D8

This was the most mysterious part of the puzzle as the database
apparently aborted with a FATAL error when trying to restore a file
that didn't exist.

We've searched Google and postgresql.org to determine what return code
65280 means, but have come up empty.


6.) Sysadmin attempted to restart the server by hand using `pg_ctl -D
/data/pgdata-8.3/ start`  -- this command returned a "server starting"
message,
but the server never appeared in the process list.


7.) The contents of the newly created postgresql.log were:

tail postgresql-2010-01-18_210946.log
2010-01-18 21:09:46 UTC ()LOG:  database system was interrupted while
in recovery at log time 2010-01-18 20:10:59 UTC
2010-01-18 21:09:46 UTC ()HINT:  If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
2010-01-18 21:09:46 UTC ()LOG:  starting archive recovery
2010-01-18 21:09:46 UTC ()LOG:  restore_command = 'pg_standby -t
/tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r
2>>pg_standby.log'
2010-01-18 21:09:46 UTC ()FATAL:  could not restore file
"00000002.history" from archive: return code 65280
2010-01-18 21:09:46 UTC ()LOG:  startup process (PID 9284) exited with
exit code 1
2010-01-18 21:09:46 UTC ()LOG:  aborting startup due to startup process failure

8.) Sysadmin noticed the 'recovery.conf' file was still in the db root
directory (it had not been renamed recovery.done).
The recovery.conf file was renamed to recovery.conf.old and running
`pg_ctl -D /data/pgdata-8.3/ start` attempted again

9.) The server did not come up (again). This time the contents of the
new postgresql.log file were:

[postgres@prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log
2010-01-18 21:11:32 UTC ()LOG:  database system was interrupted while
in recovery at log time 2010-01-18 20:10:59 UTC
2010-01-18 21:11:32 UTC ()HINT:  If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
2010-01-18 21:11:32 UTC ()LOG:  could not open file
"pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No
such file or directory
2010-01-18 21:11:32 UTC ()LOG:  invalid primary checkpoint record
2010-01-18 21:11:32 UTC ()LOG:  could not open file
"pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No
such file or directory
2010-01-18 21:11:32 UTC ()LOG:  invalid secondary checkpoint record
2010-01-18 21:11:32 UTC ()PANIC:  could not locate a valid checkpoint record
2010-01-18 21:11:32 UTC ()LOG:  startup process (PID 9328) was
terminated by signal 6: Aborted
2010-01-18 21:11:32 UTC ()LOG:  aborting startup due to startup process failure

10.) At this point a lot of time was spent on the #postgresql IRC
channel looking for help diagnosing this problem.
Before the issue was resolved, the primary server was brought back
online. Soon after we decided to create a new
base backup on the standby server and aborted trying to recover.

FYI: The last few lines of pg_standby.log were:

[postgres@prod-db-2 pgdata-8.3]$ tail -n 300 pg_standby.log
trigger file found
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
trigger file found

 ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
permittedtrigger file found

 ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted

This file was not looked until after the attempt to recover was
aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
problem,
but we don't see how that would explain the error messages, which seem
to indicate that data on the standby server was corrupted.

If you need additional information we have saved console logs and logs
files from both the primary and secondary servers.

Thanks for reading this far and any help in tracking down the cause of
this unexpected failure.

cheers,

- Mason

Re: unable to fail over to warm standby server

От

Heikki Linnakangas

Дата:

28 января 2010 г., 05:49:45

Mason Hale wrote:
>  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permittedtrigger file found
>
>  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted
>
> This file was not looked until after the attempt to recover was
> aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> problem,
> but we don't see how that would explain the error messages, which seem
> to indicate that data on the standby server was corrupted.

Yes, that permission problem seems to be the root cause of the troubles.
If pg_standby fails to remove the trigger file, it exit()s with whatever
return code the unlink() call returned:

>         /*
>          * If trigger file found, we *must* delete it. Here's why: When
>          * recovery completes, we will be asked again for the same file from
>          * the archive using pg_standby so must remove trigger file so we can
>          * reload file again and come up correctly.
>          */
>         rc = unlink(triggerPath);
>         if (rc != 0)
>         {
>             fprintf(stderr, "\n ERROR: could not remove \"%s\": %s", triggerPath, strerror(errno));
>             fflush(stderr);
>             exit(rc);
>         }

unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
the range of normal return codes, and apparently gets mangled into the
mysterious 65280 code you saw in the logs. The server treats that as a
fatal error, and dies.

That seems like a bug in pg_standby, but I'm not sure what it should do
if the unlink() fails. It could exit with some other exit code, so that
the server wouldn't die, but the lingering trigger file could cause
problems, as the comment explains. If it should indeed cause FATAL, it
should do so in a more robust way than the exit(rc) call above.

BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
delete the trigger file (so that problematic block of code is gone), but
there's a new restore_end_command option in recovery.conf instead, where
you're supposed to put 'rm <triggerfile>'. I think in that
configuration, the standby would've started up, even though removal of
the trigger file would've still failed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: unable to fail over to warm standby server

От

Mason Hale

Дата:

28 января 2010 г., 12:11:17

Hello Heikki --

Thank you for investigating this issue and clearing up this mystery.
I do not believe it is obvious that the postgres process needs to be able to
remove the trigger file.

My naive assumption was that the trigger file was merely a flag to signal
that recovery mode needed to be stopped. If I were to guess what those steps
would be, I would assume the following:

   - detect the presence of the trigger file
   - stop the postgres process safely (e.g pg_ctl ... stop)
   - rename recovery.conf to recovery.done
   - restart the postgres process (e.g. pg_ctl ... start)

It is not obvious that the trigger file needs to be removed.
And if permissions prevent it from being removed the last thing that should
happen is to cause to database to become corrupted.

At minimum the pg_standby documentation should make this requirement clear.
I suggest language to the effect of the following:

Note it is critical the trigger file be created with permissions that allow
> the postgres process to remove the file. Generally this is best done by
> creating the file from the postgres user account. Data corruption may result
> if the trigger file permissions prevent deletion of the trigger file.


Of course the best solution is to avoid this issue entirely. Something as
easy to miss as file permissions should not cause data corruption,
especially in the process meant to fail over from a crashing primary
database.

thanks,

Mason Hale
http://www.onespot.com
direct +1 800.618.0768 ext 701



On Thu, Jan 28, 2010 at 3:49 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

> Mason Hale wrote:
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> > permittedtrigger file found
> >
> >  ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
> permitted
> >
> > This file was not looked until after the attempt to recover was
> > aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
> > problem,
> > but we don't see how that would explain the error messages, which seem
> > to indicate that data on the standby server was corrupted.
>
> Yes, that permission problem seems to be the root cause of the troubles.
> If pg_standby fails to remove the trigger file, it exit()s with whatever
> return code the unlink() call returned:
>
> >               /*
> >                * If trigger file found, we *must* delete it. Here's why:
> When
> >                * recovery completes, we will be asked again for the same
> file from
> >                * the archive using pg_standby so must remove trigger file
> so we can
> >                * reload file again and come up correctly.
> >                */
> >               rc = unlink(triggerPath);
> >               if (rc != 0)
> >               {
> >                       fprintf(stderr, "\n ERROR: could not remove \"%s\":
> %s", triggerPath, strerror(errno));
> >                       fflush(stderr);
> >                       exit(rc);
> >               }
>
> unlink() returns -1 on error, so pg_standby calls exit(-1). -1 is out of
> the range of normal return codes, and apparently gets mangled into the
> mysterious 65280 code you saw in the logs. The server treats that as a
> fatal error, and dies.
>
> That seems like a bug in pg_standby, but I'm not sure what it should do
> if the unlink() fails. It could exit with some other exit code, so that
> the server wouldn't die, but the lingering trigger file could cause
> problems, as the comment explains. If it should indeed cause FATAL, it
> should do so in a more robust way than the exit(rc) call above.
>
> BTW, this changed in PostgreSQL 8.4; pg_standby no longer tries to
> delete the trigger file (so that problematic block of code is gone), but
> there's a new restore_end_command option in recovery.conf instead, where
> you're supposed to put 'rm <triggerfile>'. I think in that
> configuration, the standby would've started up, even though removal of
> the trigger file would've still failed.
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>

Re: unable to fail over to warm standby server

От

Fujii Masao

Дата:

29 января 2010 г., 02:54:43

On Fri, Jan 29, 2010 at 12:03 AM, Mason Hale <mason@onespot.com> wrote:
> Of course the best solution is to avoid this issue entirely. Something as
> easy to miss as file permissions should not cause data corruption,
> especially in the process meant to fail over from a crashing primary
> database.

I believe that such a file permission problem does nothing but
shut down the standby by a FATAL error, and wouldn't cause data
corruption. So if you remove the trigger file with a wrong
permission after the shutdown, you can restart a recovery well
by just starting the standby postgres.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: unable to fail over to warm standby server

От

Fujii Masao

Дата:

29 января 2010 г., 12:02:38

On Fri, Jan 29, 2010 at 11:49 PM, Mason Hale <mason@onespot.com> wrote:
> While I did not remove the trigger file, I did rename recovery.conf to
> recovery.conf.old.
> That file contained the recovery_command configuration that identified the
> trigger file. So that rename should have eliminated the problem. But it
> didn't. Even after making this change and taking the trigger file out of the
> equation my database failed to come online.

Renaming of the recovery.conf doesn't resolve the problem at all. Instead,
the sysadmin had to remove only the trigger file with a wrong permission
and just restart postgres.

>> 9.) The server did not come up (again). This time the contents of the
>> new postgresql.log file were:
>>
>> [postgres@prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log
>> 2010-01-18 21:11:32 UTC ()LOG:  database system was interrupted while in recovery at log time 2010-01-18 20:10:59
UTC
>> 2010-01-18 21:11:32 UTC ()HINT:  If this has occurred more than once some data might be corrupted and you might need
tochoose an earlier recovery target. 
>> 2010-01-18 21:11:32 UTC ()LOG:  could not open file "pg_xlog/0000000200003C82000000A3" (log file 15490, segment
163):No such file or directory 
>> 2010-01-18 21:11:32 UTC ()LOG:  invalid primary checkpoint record
>> 2010-01-18 21:11:32 UTC ()LOG:  could not open file "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73):
Nosuch file or directory 
>> 2010-01-18 21:11:32 UTC ()LOG:  invalid secondary checkpoint record
>> 2010-01-18 21:11:32 UTC ()PANIC:  could not locate a valid checkpoint record
>> 2010-01-18 21:11:32 UTC ()LOG:  startup process (PID 9328) was terminated by signal 6: Aborted
>> 2010-01-18 21:11:32 UTC ()LOG:  aborting startup due to startup process failure

You seem to focus on the above trouble. I think that this happened because
recovery.conf was deleted and restore_command was not given. In fact, the
WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery
was unable to be restored from the archive because restore_command was
not supplied. Then recovery failed.

If the sysadmin had left the recovery.conf and removed the trigger file,
pg_standby in restore_command would have restored all WAL files required
for recovery, and recovery would advance well.

Hope this helps.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: unable to fail over to warm standby server

От

Mason Hale

Дата:

29 января 2010 г., 13:10:40

> On Fri, Jan 29, 2010 at 12:03 AM, Mason Hale <mason@onespot.com> wrote:
> > Of course the best solution is to avoid this issue entirely. Something as
> > easy to miss as file permissions should not cause data corruption,
> > especially in the process meant to fail over from a crashing primary
> > database.
>
> I believe that such a file permission problem does nothing but
> shut down the standby by a FATAL error, and wouldn't cause data
> corruption. So if you remove the trigger file with a wrong
> permission after the shutdown, you can restart a recovery well
> by just starting the standby postgres.
>
>
Perhaps my wording of "data corruption" was too harsh?

While I did not remove the trigger file, I did rename recovery.conf to
recovery.conf.old.

That file contained the recovery_command configuration that identified the
trigger file. So that rename should have eliminated the problem. But it
didn't. Even after making this change and taking the trigger file out of the
equation my database failed to come online.

Maybe that wasn't data corruption. Maybe the issue was repairable. I just
know that with my 3+ years of experience working with Postgres and the help
of the #postgresql IRC channel, I was not able to revive the database at a
time when I desperately needed it to work. The failover process failed for
me at the worst possible time.

I will surely be careful about trigger file permissions in the future. I
just shared my experience so that future  DBA's who might make the same
mistake in a similar situation don't have to deal with the same unexpected
results.

- Mason

Re: unable to fail over to warm standby server

От

Mason Hale

Дата:

29 января 2010 г., 13:10:51

Hello Fujii --

Thanks for the clarification. It's clear my understanding of the recovery
process is lacking.

My naive assumption was that Postgres would recover using whatever files
were available
and if it had run out of files it would stop there and come up. And that if
recovery.conf were renamed it would stop copying files from the wal_archive
into pg_xlog. Thus without the recovery.conf file, the database would just
come up, without expecting or waiting on additional files. I see my
assumption was wrong, but I think you can agree that it is not surprising
someone could expect things this work this way if they aren't directly
familiar with the code.

I think you can also see how seeing the message "If this has occurred more
than once some data might be corrupted and you might need to choose an
earlier recovery target" in the log would lead me to believe my database was
corrupted.

It is good to know that if I had left recovery.conf in place and just
removed the trigger file the issue would have resolved itself.

I'm happy to hear the database was not, in fact, corrupted by this error.

Perhaps its best to chalk this up to a scenario that creates a confusing,
hard-to-diagnose issue -- one that easily looks like corruption, but
thankfully is not.

Hopefully if anyone tuning into this thread experiences or hears of similar
fail-over problems in the future (say on IRC), they'll remember to check the
permissions on the trigger file.

Thanks again,
Mason

On Fri, Jan 29, 2010 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

> On Fri, Jan 29, 2010 at 11:49 PM, Mason Hale <mason@onespot.com> wrote:
> > While I did not remove the trigger file, I did rename recovery.conf to
> > recovery.conf.old.
> > That file contained the recovery_command configuration that identified
> the
> > trigger file. So that rename should have eliminated the problem. But it
> > didn't. Even after making this change and taking the trigger file out of
> the
> > equation my database failed to come online.
>
> Renaming of the recovery.conf doesn't resolve the problem at all. Instead,
> the sysadmin had to remove only the trigger file with a wrong permission
> and just restart postgres.
>
> >> 9.) The server did not come up (again). This time the contents of the
> >> new postgresql.log file were:
> >>
> >> [postgres@prod-db-2 pg_log]$ tail -n 100
> postgresql-2010-01-18_211132.log
> >> 2010-01-18 21:11:32 UTC ()LOG:  database system was interrupted while in
> recovery at log time 2010-01-18 20:10:59 UTC
> >> 2010-01-18 21:11:32 UTC ()HINT:  If this has occurred more than once
> some data might be corrupted and you might need to choose an earlier
> recovery target.
> >> 2010-01-18 21:11:32 UTC ()LOG:  could not open file
> "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No such
> file or directory
> >> 2010-01-18 21:11:32 UTC ()LOG:  invalid primary checkpoint record
> >> 2010-01-18 21:11:32 UTC ()LOG:  could not open file
> "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No such
> file or directory
> >> 2010-01-18 21:11:32 UTC ()LOG:  invalid secondary checkpoint record
> >> 2010-01-18 21:11:32 UTC ()PANIC:  could not locate a valid checkpoint
> record
> >> 2010-01-18 21:11:32 UTC ()LOG:  startup process (PID 9328) was
> terminated by signal 6: Aborted
> >> 2010-01-18 21:11:32 UTC ()LOG:  aborting startup due to startup process
> failure
>
> You seem to focus on the above trouble. I think that this happened because
> recovery.conf was deleted and restore_command was not given. In fact, the
> WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery
> was unable to be restored from the archive because restore_command was
> not supplied. Then recovery failed.
>
> If the sysadmin had left the recovery.conf and removed the trigger file,
> pg_standby in restore_command would have restored all WAL files required
> for recovery, and recovery would advance well.
>
> Hope this helps.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

Re: unable to fail over to warm standby server

От

Robert Haas

Дата:

29 января 2010 г., 14:48:52

On Fri, Jan 29, 2010 at 11:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> You seem to focus on the above trouble. I think that this happened because
> recovery.conf was deleted and restore_command was not given. In fact, the
> WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery
> was unable to be restored from the archive because restore_command was
> not supplied. Then recovery failed.
>
> If the sysadmin had left the recovery.conf and removed the trigger file,
> pg_standby in restore_command would have restored all WAL files required
> for recovery, and recovery would advance well.

That may be true, but it's certainly seems unfortunate that we don't
handle this case a bit more gracefully.

...Robert

Re: unable to fail over to warm standby server

От

Heikki Linnakangas

Дата:

29 января 2010 г., 15:14:58

Robert Haas wrote:
> On Fri, Jan 29, 2010 at 11:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> You seem to focus on the above trouble. I think that this happened because
>> recovery.conf was deleted and restore_command was not given. In fact, the
>> WAL file (e.g., pg_xlog/0000000200003C82000000A3) required for recovery
>> was unable to be restored from the archive because restore_command was
>> not supplied. Then recovery failed.
>>
>> If the sysadmin had left the recovery.conf and removed the trigger file,
>> pg_standby in restore_command would have restored all WAL files required
>> for recovery, and recovery would advance well.
>
> That may be true, but it's certainly seems unfortunate that we don't
> handle this case a bit more gracefully.

Yeah. I don't think we can do much better in versions <= 8.3, though we
should change that unlink() call to cause the FATAL error in a more
explicit way; it seems accidental and possibly non-portable as it is.

Adding the note to the docs that Mason suggested is a good idea.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: unable to fail over to warm standby server

От

Mason Hale

Дата:

29 января 2010 г., 16:20:47

>
>
> >> If the sysadmin had left the recovery.conf and removed the trigger file,
> >> pg_standby in restore_command would have restored all WAL files required
> >> for recovery, and recovery would advance well.
> >
> > That may be true, but it's certainly seems unfortunate that we don't
> > handle this case a bit more gracefully.
>
> Yeah. I don't think we can do much better in versions <= 8.3, though we
> should change that unlink() call to cause the FATAL error in a more
> explicit way; it seems accidental and possibly non-portable as it is.
>
> Adding the note to the docs that Mason suggested is a good idea.
>
>
Given that this situation did NOT actually cause corruption, rather the
error message was mangled such that it suggested corruption, I offer this
revised suggestion for update to the documentation:

Important note: It is critical the trigger file be created with permissions
> allowing the postgres process to remove the file. Generally this is best
> done by creating the file from the postgres user account. Failure to do so
> will prevent completion of WAL file recovery and the server from coming back
> online successfully.


Best regards,
Mason

Re: unable to fail over to warm standby server

От

Heikki Linnakangas

Дата:

29 января 2010 г., 16:33:05

Actually, I think there's a tiny harmless bug in the server too. When it
prints the error message:

2010-01-18 21:08:31 UTC ()FATAL:  could not restore file
"0000000200003C82000000D8" from archive: return code 65280

That return code is not the return code that came from the
restore_command. Ie if you do exit(200) in the command, you won't see
"return code 200", but you see the return value from system(). system()
doesn't return the exit code directly, but an encoded integer that
includes the exit code. You're supposed to call WEXITSTATUS() on it to
get the real return code.

That only affects the error message and is harmless otherwise, but I
thought I'd mention it. I'll fix it, unless someone wants to argue that
its more useful to print the raw return value of system(), because it
might contain more information like which signal killed the process,
that you could extract from the cryptic error code using e.g WTERMSIG()
macro.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: unable to fail over to warm standby server

От

Robert Haas

Дата:

29 января 2010 г., 20:07:56

On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Actually, I think there's a tiny harmless bug in the server too. When it
> prints the error message:
>
> 2010-01-18 21:08:31 UTC ()FATAL: =A0could not restore file
> "0000000200003C82000000D8" from archive: return code 65280
>
> That return code is not the return code that came from the
> restore_command. Ie if you do exit(200) in the command, you won't see
> "return code 200", but you see the return value from system(). system()
> doesn't return the exit code directly, but an encoded integer that
> includes the exit code. You're supposed to call WEXITSTATUS() on it to
> get the real return code.
>
> That only affects the error message and is harmless otherwise, but I
> thought I'd mention it. I'll fix it, unless someone wants to argue that
> its more useful to print the raw return value of system(), because it
> might contain more information like which signal killed the process,
> that you could extract from the cryptic error code using e.g WTERMSIG()
> macro.

An "if" statement would seem to be in order, so that you can print out
either the exit code or the signal number as appropriate.

...Robert

Re: unable to fail over to warm standby server

От

Tom Lane

Дата:

29 января 2010 г., 20:19:12

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas
>> That only affects the error message and is harmless otherwise, but I
>> thought I'd mention it. I'll fix it, unless someone wants to argue that
>> its more useful to print the raw return value of system(), because it
>> might contain more information like which signal killed the process,
>> that you could extract from the cryptic error code using e.g WTERMSIG()
>> macro.

> An "if" statement would seem to be in order, so that you can print out
> either the exit code or the signal number as appropriate.

Yes.  Please see the existing code in the postmaster that prints
subprocess exit codes, and duplicate it (or perhaps refactor so you can
avoid code duplication; though translatability of messages might limit
what you can do there).

            regards, tom lane

Re: unable to fail over to warm standby server

От

Heikki Linnakangas

Дата:

05 февраля 2010 г., 07:11:39

Mason Hale wrote:
> Given that this situation did NOT actually cause corruption, rather the
> error message was mangled such that it suggested corruption, I offer this
> revised suggestion for update to the documentation:
>
> Important note: It is critical the trigger file be created with permissions
>> allowing the postgres process to remove the file. Generally this is best
>> done by creating the file from the postgres user account. Failure to do so
>> will prevent completion of WAL file recovery and the server from coming back
>> online successfully.

Ok, I've added this note to the 8.3 docs now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: unable to fail over to warm standby server

От

Heikki Linnakangas

Дата:

09 февраля 2010 г., 15:47:43

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jan 29, 2010 at 3:32 PM, Heikki Linnakangas
>>> That only affects the error message and is harmless otherwise, but I
>>> thought I'd mention it. I'll fix it, unless someone wants to argue that
>>> its more useful to print the raw return value of system(), because it
>>> might contain more information like which signal killed the process,
>>> that you could extract from the cryptic error code using e.g WTERMSIG()
>>> macro.
>
>> An "if" statement would seem to be in order, so that you can print out
>> either the exit code or the signal number as appropriate.
>
> Yes.  Please see the existing code in the postmaster that prints
> subprocess exit codes, and duplicate it (or perhaps refactor so you can
> avoid code duplication; though translatability of messages might limit
> what you can do there).

Here's what I came up with. Translatability indeed makes it pretty hard,
I ended up copy-pasting.

BTW, I don't think I'm going to bother or risk back-patching this. It
was harmless, and for forensic purposes all the information was there in
the old message too, just in a cryptic format. And the new messages
would need translating too.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 546,551 **** static void pg_start_backup_callback(int code, Datum arg);
--- 546,552 ----
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void rm_redo_error_callback(void *arg);
  static int    get_sync_bit(int method);
+ static int errdetail_childexit(const char *procname, int exitstatus);


  /*
***************
*** 2959,2966 **** RestoreArchivedFile(char *path, const char *xlogfname,
      signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

      ereport(signaled ? FATAL : DEBUG2,
!         (errmsg("could not restore file \"%s\" from archive: return code %d",
!                 xlogfname, rc)));

  not_available:
      /*
--- 2960,2967 ----
      signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

      ereport(signaled ? FATAL : DEBUG2,
!             (errmsg("could not restore file \"%s\" from archive", xlogfname),
!              errdetail_childexit("restore_command", rc)));

  not_available:
      /*
***************
*** 3077,3088 **** ExecuteRecoveryEndCommand(void)
          signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

          ereport(signaled ? FATAL : WARNING,
!                 (errmsg("recovery_end_command \"%s\": return code %d",
!                         xlogRecoveryEndCmd, rc)));
      }
  }

  /*
   * Preallocate log files beyond the specified log endpoint.
   *
   * XXX this is currently extremely conservative, since it forces only one
--- 3078,3149 ----
          signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

          ereport(signaled ? FATAL : WARNING,
!                 (errmsg("recovery end command \"%s\" failed", xlogRecoveryEndCmd),
!                  errdetail_childexit("recovery_end_command", rc)));
      }
  }

  /*
+  * Add detail information (and possibly a hint) describing why a child
+  * process exited to the error currently being constructed. This is for
+  * use inside ereport, e.g:
+  *
+  * ereport(DEBUG1,
+  *           (errmsg(...),
+  *           (errdetail_childexit("mycommand", rc))));
+  *
+  * 'procname' is a name of the exited command, and 'exitstatus' is a return
+  * code as returned by e.g wait(2).
+  *
+  * The logic is copied from LogChildExit() in postmaster/postmaster.c. If
+  * you change this, make sure you keep LogChildExit in sync.
+  */
+ static int
+ errdetail_childexit(const char *procname, int exitstatus)
+ {
+     if (WIFEXITED(exitstatus))
+         /*------
+           translator: %s is a noun phrase describing a child process, such as
+           "restore_command" */
+         errdetail("%s exited with exit code %d",
+                   procname, WEXITSTATUS(exitstatus));
+     else if (WIFSIGNALED(exitstatus))
+     {
+ #if defined(WIN32)
+         /*------
+           translator: %s is a noun phrase describing a child process, such as
+           "restore_command" */
+         errdetail("%s was terminated by exception 0x%X",
+                   procname, WTERMSIG(exitstatus));
+         errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value.");
+ #elif defined(HAVE_DECL_SYS_SIGLIST) && HAVE_DECL_SYS_SIGLIST
+         /*------
+           translator: %s is a noun phrase describing a child process, such as
+           "restore_command" */
+         errdetail("%s was terminated by signal %d: %s",
+                   procname, WTERMSIG(exitstatus),
+                   WTERMSIG(exitstatus) < NSIG ?
+                   sys_siglist[WTERMSIG(exitstatus)] : "(unknown)");
+ #else
+         /*------
+           translator: %s is a noun phrase describing a child process, such as
+           "restore_command" */
+         errdetail("%s was terminated by signal %d",
+                   procname, WTERMSIG(exitstatus));
+ #endif
+     }
+     else
+         /*------
+           translator: %s is a noun phrase describing a child process, such as
+           "restore_command" */
+         errdetail("%s exited with unrecognized status %d",
+                   procname, exitstatus);
+
+     /* return dummy value, so that this can be used inside ereport(...) */
+     return 1;
+ }
+
+ /*
   * Preallocate log files beyond the specified log endpoint.
   *
   * XXX this is currently extremely conservative, since it forces only one
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 2770,2775 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
--- 2770,2779 ----

  /*
   * Log the death of a child process.
+  *
+  * errdetail_childexit() function in xlog.c uses the same logic for
+  * constructing the message. If you change this function, remember to
+  * keep errdetail_childexit() in sync.
   */
  static void
  LogChildExit(int lev, const char *procname, int pid, int exitstatus)

Re: unable to fail over to warm standby server

От

Tom Lane

Дата:

09 февраля 2010 г., 17:59:49

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Yes.  Please see the existing code in the postmaster that prints
>> subprocess exit codes, and duplicate it (or perhaps refactor so you can
>> avoid code duplication; though translatability of messages might limit
>> what you can do there).

> Here's what I came up with. Translatability indeed makes it pretty hard,
> I ended up copy-pasting.

Looks reasonable to me.  One possible objection is that someone running
in "terse" logging mode would not get any information about the child
process's exit code, since errdetail would be suppressed in that case.
But we can probably live with that --- as far as I can tell, few people
use terse-mode logging.

            regards, tom lane

Re: unable to fail over to warm standby server

От

Alvaro Herrera

Дата:

09 февраля 2010 г., 18:09:37

Heikki Linnakangas escribió:

> Here's what I came up with. Translatability indeed makes it pretty hard,
> I ended up copy-pasting.

Looks sane to me too; msgmerge segfaults though so I couldn't test.  Two
minor comments:

> +             /*------
> +               translator: %s is a noun phrase describing a child process, such as
> +               "restore_command" */
> +             errdetail("%s was terminated by signal %d: %s",

You probably want "the first %s is ... and the second %s is the signal
name".

And here:

> +                   sys_siglist[WTERMSIG(exitstatus)] : "(unknown)");

Should probably be _("(unknown)")

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: unable to fail over to warm standby server

От

Fujii Masao

Дата:

09 февраля 2010 г., 20:31:09

On Wed, Feb 10, 2010 at 4:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Here's what I came up with. Translatability indeed makes it pretty hard,
> I ended up copy-pasting.
>
> BTW, I don't think I'm going to bother or risk back-patching this. It
> was harmless, and for forensic purposes all the information was there in
> the old message too, just in a cryptic format. And the new messages
> would need translating too.

This looks applicable to also archive_command.
Are you going to apply it to archive_command?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: unable to fail over to warm standby server