Обсуждение: Recovery_target_time misinterpreted?

Поиск
Список
Период
Сортировка

Recovery_target_time misinterpreted?

От
Klaus Ita
Дата:
Hello PG Experts!

I have restored a Database Cluster with a recovery_target_time set to

recovery_target_time =  '2013-07-27 21:20:17.127664+00'
recovery_target_inclusive = false


now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is there an explanation for this huge gap? Is that the last 'consistent state'?

pg version:
xaxos_ch=> select version();
                                           version                                           
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit
(1 row)


somewhat irrelevant, as my question is more general?


thank you for your help!

klaus

Re: Recovery_target_time misinterpreted?

От
Albe Laurenz
Дата:
Klaus Ita wrote:
> I have restored a Database Cluster with a recovery_target_time set to
> 
> recovery_target_time =  '2013-07-27 21:20:17.127664+00'
> recovery_target_inclusive = false
> 
> 
> 
> now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is
> there an explanation for this huge gap? Is that the last 'consistent state'?

Maybe the log entries created during restore can answer the question.

Yours,
Laurenz Albe

Re: Recovery_target_time misinterpreted?

От
Klaus Ita
Дата:
2013-07-30 11:15:15 UTC <%> LOG:  starting point-in-time recovery to 2013-07-27 21:20:17.127664+00
2013-07-30 11:15:15 UTC <%> LOG:  restored log file "00000001000002300000005C" from archive
2013-07-30 11:15:15 UTC <%> LOG:  restored log file "00000001000002300000005A" from archive
2013-07-30 11:15:15 UTC <%> LOG:  redo starts at 230/5ACD7CC0
...
...
...
2013-07-30 14:28:45 UTC <%> LOG:  restored log file "000000010000026400000002" from archive
2013-07-30 14:28:45 UTC <%> LOG:  unexpected pageaddr 263/C706C000 in log file 612, segment 2, offset 442368
2013-07-30 14:28:45 UTC <%> LOG:  redo done at 264/20698A8
2013-07-30 14:28:45 UTC <%> LOG:  last completed transaction was at log time 2013-07-18 11:42:22.121512+00
2013-07-30 14:28:45 UTC <%> LOG:  restored log file "000000010000026400000002" from archive
cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000002.history*': No such file or directory
mv: cannot stat `/tmp/00000002.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG:  selected new timeline ID: 2
cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000001.history*': No such file or directory
mv: cannot stat `/tmp/00000001.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG:  archive recovery complete
2013-07-30 14:29:09 UTC <%> LOG:  autovacuum launcher started
2013-07-30 14:29:09 UTC <%> LOG:  database system is ready to accept connections


well, that does not indicate anything for me.



On Wed, Jul 31, 2013 at 9:37 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Klaus Ita wrote:
> I have restored a Database Cluster with a recovery_target_time set to
>
> recovery_target_time =  '2013-07-27 21:20:17.127664+00'
> recovery_target_inclusive = false
>
>
>
> now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is
> there an explanation for this huge gap? Is that the last 'consistent state'?

Maybe the log entries created during restore can answer the question.

Yours,
Laurenz Albe

Re: Recovery_target_time misinterpreted?

От
Albe Laurenz
Дата:
Klaus Ita wrote:
>>> I have restored a Database Cluster with a recovery_target_time set to
>>>
>>> recovery_target_time =  '2013-07-27 21:20:17.127664+00'
>>> recovery_target_inclusive = false
>>>
>>>
>>>
>>> now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is
>>> there an explanation for this huge gap? Is that the last 'consistent state'?
>> 
>> 
>> Maybe the log entries created during restore can answer the question.

> 2013-07-30 11:15:15 UTC <%> LOG:  starting point-in-time recovery to 2013-07-27 21:20:17.127664+00
> 2013-07-30 11:15:15 UTC <%> LOG:  restored log file "00000001000002300000005C" from archive
> 2013-07-30 11:15:15 UTC <%> LOG:  restored log file "00000001000002300000005A" from archive
> 2013-07-30 11:15:15 UTC <%> LOG:  redo starts at 230/5ACD7CC0
> ...
> ...
> ...
> 2013-07-30 14:28:45 UTC <%> LOG:  restored log file "000000010000026400000002" from archive
> 2013-07-30 14:28:45 UTC <%> LOG:  unexpected pageaddr 263/C706C000 in log file 612, segment 2, offset
> 442368
> 2013-07-30 14:28:45 UTC <%> LOG:  redo done at 264/20698A8
> 2013-07-30 14:28:45 UTC <%> LOG:  last completed transaction was at log time 2013-07-18
> 11:42:22.121512+00
> 2013-07-30 14:28:45 UTC <%> LOG:  restored log file "000000010000026400000002" from archive
> cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000002.history*': No such file or
> directory
> mv: cannot stat `/tmp/00000002.history': No such file or directory
> 2013-07-30 14:28:45 UTC <%> LOG:  selected new timeline ID: 2
> cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000001.history*': No such file or
> directory
> mv: cannot stat `/tmp/00000001.history': No such file or directory
> 2013-07-30 14:28:45 UTC <%> LOG:  archive recovery complete
> 2013-07-30 14:29:09 UTC <%> LOG:  autovacuum launcher started
> 2013-07-30 14:29:09 UTC <%> LOG:  database system is ready to accept connections
> 
> well, that does not indicate anything for me.

To me it indicates that log file 000000010000026400000002 might be corrupt.

PostgreSQL stops replaying WAL after it detects a corrupt WAL record.

Do you have a second copy of the WAL file?

Yours,
Laurenz Albe

Re: Recovery_target_time misinterpreted?

От
Klaus Ita
Дата:
No, it's super frustrating. While I do the recovery, it says it reaches a consistent recovery state, and i just cannot find a way how to convince pg to stop at that state:



2013-08-02 09:23:25 GMT DEBUG:  postgres: PostmasterMain: initial environment dump:
2013-08-02 09:23:25 GMT DEBUG:  -----------------------------------------
2013-08-02 09:23:25 GMT DEBUG:          PG_GRANDPARENT_PID=9077
2013-08-02 09:23:25 GMT DEBUG:          PGLOCALEDIR=/usr/share/locale
2013-08-02 09:23:25 GMT DEBUG:          PGSYSCONFDIR=/etc/postgresql-common
2013-08-02 09:23:25 GMT DEBUG:          LANG=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG:          PWD=/var/lib/postgresql
2013-08-02 09:23:25 GMT DEBUG:          PGDATA=/var/lib/postgresql/9.1/main
2013-08-02 09:23:25 GMT DEBUG:          LC_COLLATE=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG:          LC_CTYPE=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG:          LC_MESSAGES=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG:          LC_MONETARY=C
2013-08-02 09:23:25 GMT DEBUG:          LC_NUMERIC=C
2013-08-02 09:23:25 GMT DEBUG:          LC_TIME=C
2013-08-02 09:23:25 GMT DEBUG:  -----------------------------------------


2013-08-02 11:23:26 CEST DEBUG:  invoking IpcMemoryCreate(size=32399360)
2013-08-02 11:23:26 CEST DEBUG:  removing file "pg_notify/0000"
2013-08-02 11:23:26 CEST DEBUG:  max_safe_fds = 982, usable_fds = 1000, already_open = 8
2013-08-02 11:23:26 CEST LOG:  database system was interrupted while in recovery at log time 2013-07-29 11:45:24 CEST
2013-08-02 11:23:26 CEST HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-08-02 11:23:26 CEST DEBUG:  restore_command = 'cp /home/validad-pg-backups/pgmaster/wal_files/%f.gz /tmp/%f.gz && gunzip /tmp/%f.gz && mv /tmp/%f %p'
2013-08-02 11:23:26 CEST DEBUG:  trigger_file = '/var/lib/postgresql/9.1/main/stop_replication_trigger'
2013-08-02 11:23:26 CEST LOG:  starting archive recovery
2013-08-02 11:23:26 CEST DEBUG:  executing restore command "cp /home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002C.gz /tmp/000000010000027A0000002C.gz && gunzip /tmp/000000010000027A0000002C.gz && mv /tmp/000000010000027A0000002C pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:26 CEST DEBUG:  forked new backend, pid=9090 socket=9
2013-08-02 11:23:26 CEST LOG:  incomplete startup packet
2013-08-02 11:23:26 CEST DEBUG:  shmem_exit(0): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG:  proc_exit(0): 1 callbacks to make
2013-08-02 11:23:26 CEST DEBUG:  exit(0)
2013-08-02 11:23:26 CEST DEBUG:  shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG:  proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG:  reaping dead processes
2013-08-02 11:23:26 CEST DEBUG:  server process (PID 9090) exited with exit code 0
2013-08-02 11:23:26 CEST LOG:  restored log file "000000010000027A0000002C" from archive
2013-08-02 11:23:26 CEST DEBUG:  got WAL segment from archive
2013-08-02 11:23:26 CEST DEBUG:  checkpoint record is at 27A/2CB77750
2013-08-02 11:23:26 CEST DEBUG:  redo record is at 27A/2CB77750; shutdown TRUE
2013-08-02 11:23:26 CEST DEBUG:  next transaction ID: 0/381985248; next OID: 1201662
2013-08-02 11:23:26 CEST DEBUG:  next MultiXactId: 130079; next MultiXactOffset: 272843
2013-08-02 11:23:26 CEST DEBUG:  oldest unfrozen transaction ID: 197713560, in database 331065
2013-08-02 11:23:26 CEST DEBUG:  transaction ID wrap limit is 2345197207, limited by database with OID 331065
2013-08-02 11:23:26 CEST DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-08-02 11:23:26 CEST LOG:  redo starts at 27A/2CB777A8
2013-08-02 11:23:26 CEST DEBUG:  executing restore command "cp /home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002D.gz /tmp/000000010000027A0000002D.gz && gunzip /tmp/000000010000027A0000002D.gz && mv /tmp/000000010000027A0000002D pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:27 CEST DEBUG:  forked new backend, pid=9098 socket=9
2013-08-02 11:23:27 CEST FATAL:  the database system is starting up
2013-08-02 11:23:27 CEST DEBUG:  shmem_exit(1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  proc_exit(1): 1 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  exit(1)
2013-08-02 11:23:27 CEST DEBUG:  shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  reaping dead processes
2013-08-02 11:23:27 CEST DEBUG:  server process (PID 9098) exited with exit code 1
2013-08-02 11:23:27 CEST LOG:  restored log file "000000010000027A0000002D" from archive
2013-08-02 11:23:27 CEST DEBUG:  got WAL segment from archive
2013-08-02 11:23:27 CEST DEBUG:  executing restore command "cp /home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002E.gz /tmp/000000010000027A0000002E.gz && gunzip /tmp/000000010000027A0000002E.gz && mv /tmp/000000010000027A0000002E pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:27 CEST DEBUG:  forked new backend, pid=9105 socket=9
2013-08-02 11:23:27 CEST FATAL:  the database system is starting up
2013-08-02 11:23:27 CEST DEBUG:  shmem_exit(1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  proc_exit(1): 1 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  exit(1)
2013-08-02 11:23:27 CEST DEBUG:  shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG:  reaping dead processes
2013-08-02 11:23:27 CEST DEBUG:  server process (PID 9105) exited with exit code 1
2013-08-02 11:23:28 CEST LOG:  restored log file "000000010000027A0000002E" from archive
2013-08-02 11:23:28 CEST DEBUG:  got WAL segment from archive
2013-08-02 11:23:28 CEST LOG:  consistent recovery state reached at 27A/2E3F42E8
2013-08-02 11:23:28 CEST PANIC:  _bt_restore_page: cannot add item to page
2013-08-02 11:23:28 CEST CONTEXT:  xlog redo split_r: rel 1663/16405/797541 left 4743, right 18008, next 9681, level 0, firstright 194
2013-08-02 11:23:28 CEST DEBUG:  reaping dead processes
2013-08-02 11:23:28 CEST LOG:  startup process (PID 9086) was terminated by signal 6: Aborted
2013-08-02 11:23:28 CEST LOG:  terminating any other active server processes
2013-08-02 11:23:28 CEST DEBUG:  sending SIGQUIT to process 9092
2013-08-02 11:23:28 CEST DEBUG:  shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG:  proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG:  reaping dead processes
2013-08-02 11:23:28 CEST DEBUG:  shmem_exit(1): 3 callbacks to make
2013-08-02 11:23:28 CEST DEBUG:  proc_exit(1): 3 callbacks to make
2013-08-02 11:23:28 CEST DEBUG:  exit(1)
2013-08-02 11:23:28 CEST DEBUG:  shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG:  proc_exit(-1): 0 callbacks to make


now with debug5 setting.

I have like 3 different wal_files, but they were all generated by pg in this way. The backup script actually always takes the original and uses this as source to copy it to 3 different locations. But the sha1sum of the files is always the same :(

I don't so much mind that, we have recovered some backups and replayed from another standby system, but i need to understand why i cannot reactivate this cluster on this state:

'2013-08-02 11:23:28 CEST LOG:  consistent recovery state reached at 27A/2E3F42E8'

I am posting to bugs, too, as i start thinking this is one.

lg,k