Обсуждение: testing cvs HEAD - HS/SR - cannot stat
Hello, Testing 9.0devel HS/SR: I used cvs HEAD (today) with the new_smart_shutdown_20100201.patch of Fujii Masao. Replication works well. The slave can be stopped and restarted. However, whenever (re-)starting the slave the I get messages like: cp: cannot stat `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': No such file or directory At this point, /var/data1/pg_stuff/dump/replication_archive: -rw------- 1 xxxxxxxx xxxxxxxx 240 Feb 3 22:35 000000010000000000000001.00000020.backup -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb 3 22:35 000000010000000000000001 -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb 3 22:35 000000010000000000000000 Maybe the message is not really a bug: everything works fine otherwise. thanks, Erik Rijkers # primary ./data/postgresql.conf data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data' max_connections = 100 shared_buffers = 32MB checkpoint_segments = 50 archive_mode = 'on' archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f' max_wal_senders = 5 # standby ./data/postgresql.conf data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_slavery/data' port = 6566 max_connections = 100 shared_buffers = 32MB checkpoint_segments = 50 archive_mode = 'on' archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f' max_wal_senders = 5 # standby ./data/recovery.conf standby_mode = 'on' primary_conninfo = 'host=/tmp port=6565 user=aardvark' restore_command = 'cp -i /var/data1/pg_stuff/dump/replication_archive/%f %p </dev/null' Centos, Linux 2.6.18-164.el5 x86_64 ./configure \--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary \--with-pgport=6565 --quiet --enable-cassert--enable-debug \--enable-depend --with-openssl --with-perl --with-libxml \--with-libxslt
On Thu, Feb 4, 2010 at 6:39 AM, Erik Rijkers <er@xs4all.nl> wrote: > However, whenever (re-)starting the slave the I get > messages like: > > cp: cannot stat `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': No such > file or directory > > At this point, /var/data1/pg_stuff/dump/replication_archive: > > -rw------- 1 xxxxxxxx xxxxxxxx 240 Feb 3 22:35 000000010000000000000001.00000020.backup > -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb 3 22:35 000000010000000000000001 > -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb 3 22:35 000000010000000000000000 > > > Maybe the message is not really a bug: everything works fine otherwise. Yeah, this is not a bug. At first, the standby performs an archive recovery until an invalid WAL record is found. Then it starts replication and tries to receive the missing WAL records from the primary. So such an error message would be logged whenever an invalid record is found and replication is started. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> Yeah, this is not a bug. > > At first, the standby performs an archive recovery until an invalid > WAL record is found. Then it starts replication and tries to receive > the missing WAL records from the primary. So such an error message > would be logged whenever an invalid record is found and replication > is started. Can we improve the error message? Right now it's alarming people. Such as: cannot stat `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': End of Log Of course, it would be even better to supress this error message entirely unless the user is at DEBUG1 or better. --Josh Berkus
Josh Berkus wrote: > Can we improve the error message? Right now it's alarming people. Such as: > > cannot stat > `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': > End of Log Not really, it's coming from 'cp'. Not sure if we could capture the stderr and somehow decorate the message to make it less alarming. This isn't really a new problem, PITR has always done that, but I agree it's always been a bit clumsy to have those messages in the log. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2010/2/4 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Josh Berkus wrote: >> Can we improve the error message? Right now it's alarming people. Such as: >> >> cannot stat >> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': >> End of Log > > Not really, it's coming from 'cp'. Not sure if we could capture the > stderr and somehow decorate the message to make it less alarming. > > This isn't really a new problem, PITR has always done that, but I agree > it's always been a bit clumsy to have those messages in the log. Can we follow it up with a well-phrased message that we switched to streaming mode right after it, so people realize it's not a problem? -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Thu, February 4, 2010 19:29, Heikki Linnakangas wrote: > Josh Berkus wrote: >> Can we improve the error message? Right now it's alarming people. Such as: >> >> cannot stat >> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': >> End of Log > > Not really, it's coming from 'cp'. Not sure if we could capture the > stderr and somehow decorate the message to make it less alarming. > > This isn't really a new problem, PITR has always done that, but I agree > it's always been a bit clumsy to have those messages in the log. It was (eventually) clear to me that it was not really a problem, but only because I was testing anyway :) I think most people would do what I did: search the documentation for the string 'No such file or directory' or 'cannot stat'. (But that produces no mention of this (non-)problem.) ISTM mentioning in the documentation is good enough. Erik Rijkers
Magnus Hagander wrote: > 2010/2/4 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: >> Josh Berkus wrote: >>> Can we improve the error message? Right now it's alarming people. Such as: >>> >>> cannot stat >>> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': >>> End of Log >> Not really, it's coming from 'cp'. Not sure if we could capture the >> stderr and somehow decorate the message to make it less alarming. >> >> This isn't really a new problem, PITR has always done that, but I agree >> it's always been a bit clumsy to have those messages in the log. > > Can we follow it up with a well-phrased message that we switched to > streaming mode right after it, so people realize it's not a problem? Yeah, that would be a good idea. We should give the messages we emit some holistic thinking. There's quite a few new messages printed thanks to both Hot Standby and Streaming replication, and it seems we still need a few more. Which ones are really required, and which ones are just noise? Here's what I see in a test standby server: > LOG: database system was interrupted while in recovery at log time 2010-02-04 20:45:40 EET > HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recoverytarget. Can we avoid printing this? It's a bit alarming to talk about corrupted data. > LOG: starting archive recovery > LOG: restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p' > LOG: standby_mode = 'true' > LOG: primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass' > LOG: trigger_file = '/tmp/standby-trigger' Do we really need to echo all the lines in recovery.conf? That might be interesting information, but perhaps it could be condensed and worded more nicely. > cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory This is the noise line Josh started this thread with. > LOG: automatic recovery in progress Ok, so what? > LOG: initializing recovery connections Seems like unnecessary noise, doesn't mean anything to a DBA. > LOG: redo starts at 0/700140C I guess this could be useful debug information sometimes. > LOG: consistent recovery state reached at 0/700142C It's nice to know that it has reached consistency, but was there any way to know before this line that it hadn't been reached yet? Perhaps the "redo starts" line should mention that consistency hasn't been reached yet. > LOG: database system is ready to accept read only connections This is an important piece of information with Hot Standby. Now, should we print a line when we connect to the master successfully? Seems like useful information. Then there's the LOG lines whenever a file is restored successfully from archive; are the necessary anymore, now that you can connect to the standby and use pg_last_xlog_replay_location() to poll its status? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Feb 5, 2010 at 3:58 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> LOG: database system was interrupted while in recovery at log time 2010-02-04 20:45:40 EET >> HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recoverytarget. > > Can we avoid printing this? It's a bit alarming to talk about corrupted > data. I think this is useful information as a hint of data corruption. But it's odd for this to be reported even when restarting the standby server that has been previously shut down normally (i.e., in fact shutdown mode). How about adding new system status like DB_SHUTDOWNED_IN_RECOVERY, setting the status to it when the shutdown is performed during recovery, and reporting the suitable message when starting up the server from it? >> LOG: starting archive recovery This is reported even if restore_command is not given and the WAL files are never restored from the archive. We should get rid of this in that case? >> LOG: restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p' >> LOG: standby_mode = 'true' >> LOG: primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass' >> LOG: trigger_file = '/tmp/standby-trigger' > > Do we really need to echo all the lines in recovery.conf? That might be > interesting information, but perhaps it could be condensed and worded > more nicely. It's OK for me to move them from LOG to DEBUG. >> cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory > > This is the noise line Josh started this thread with. Agreed. But the messages other than ENOENT that restore_command emits might be useful. >> LOG: automatic recovery in progress > > Ok, so what? Seems unnecessary. >> LOG: initializing recovery connections > > Seems like unnecessary noise, doesn't mean anything to a DBA. Agreed. >> LOG: redo starts at 0/700140C > > I guess this could be useful debug information sometimes. Agreed. >> LOG: consistent recovery state reached at 0/700142C > > It's nice to know that it has reached consistency, but was there any way > to know before this line that it hadn't been reached yet? Perhaps the > "redo starts" line should mention that consistency hasn't been reached yet. But redo might restart from the consistent database if the standby server was shut down after it reached the consistent status. >> LOG: database system is ready to accept read only connections > > This is an important piece of information with Hot Standby. Agreed. > Now, should we print a line when we connect to the master successfully? > Seems like useful information. Agreed. > Then there's the LOG lines whenever a file is restored successfully from > archive; are the necessary anymore, now that you can connect to the > standby and use pg_last_xlog_replay_location() to poll its status? How about moving those messages from LOG to DEBUG? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Ok, I committed a patch to reduce the chatter a bit: Fujii Masao wrote: > On Fri, Feb 5, 2010 at 3:58 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> LOG: starting archive recovery > > This is reported even if restore_command is not given and the WAL files are > never restored from the archive. We should get rid of this in that case? Yeah, removed. >>> LOG: restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p' >>> LOG: standby_mode = 'true' >>> LOG: primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass' >>> LOG: trigger_file = '/tmp/standby-trigger' >> Do we really need to echo all the lines in recovery.conf? That might be >> interesting information, but perhaps it could be condensed and worded >> more nicely. > > It's OK for me to move them from LOG to DEBUG. Done. >>> cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory >> This is the noise line Josh started this thread with. > > Agreed. But the messages other than ENOENT that restore_command emits > might be useful. > >>> LOG: automatic recovery in progress >> Ok, so what? > > Seems unnecessary. I replaced this with a more informative message that says either: "entering standby mode", or "starting point-in-time recovery to XID %u", or "starting archive recovery" >>> LOG: initializing recovery connections >> Seems like unnecessary noise, doesn't mean anything to a DBA. > > Agreed. Demoted to DEBUG1. >>> LOG: redo starts at 0/700140C >> I guess this could be useful debug information sometimes. > > Agreed. I left this as it was. >>> LOG: consistent recovery state reached at 0/700142C >> It's nice to know that it has reached consistency, but was there any way >> to know before this line that it hadn't been reached yet? Perhaps the >> "redo starts" line should mention that consistency hasn't been reached yet. > > But redo might restart from the consistent database if the standby > server was shut down after it reached the consistent status. Yeah, it would be nice to say whether the database is already consistent or not, but I've left this alone for now. >> Now, should we print a line when we connect to the master successfully? >> Seems like useful information. > > Agreed. > >> Then there's the LOG lines whenever a file is restored successfully from >> archive; are the necessary anymore, now that you can connect to the >> standby and use pg_last_xlog_replay_location() to poll its status? > > How about moving those messages from LOG to DEBUG? Didn't touch these yet. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com