Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
От | Heikki Linnakangas |
---|---|
Тема | Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio |
Дата | |
Msg-id | 535F538E.9020206@vmware.com обсуждение исходный текст |
Ответ на | BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio (skeefe@rdx.com) |
Список | pgsql-bugs |
On 04/25/2014 08:43 PM, skeefe@rdx.com wrote: > The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL > Replication. If the master goes down during a massive transaction and we > promote the first slave then next slave looks for a WAL log that never > existed, New timeline before the split of timelines. I can't reproduce this. Would it be possible to create a self-contained script that reproduces the whole scenario? Something like the attached (which I used to try to reproduce this). > Below is how to recreate the issue: > > 1. Create M using postgresql.conf_M. Start M. > CREATE TABLE t_test (id int4); > > 2. Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used > rsync). Start S1 > > 3. Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used > rsync). Start S2 > > 4. Insert data in t_test table in M > INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ; > 5. Important: Do not shutdown M. If you want you can crash M by killing > pids. I just let it run and immediately proceeded to next step. The idea > here is to promote S1 before M transmits the last WAL which has the COMMIT > of the above INSERT. > > 6. Promote S1. S1 will change its timeline. > > 7. S2 will not recognize the new timeline of its master S1. Yeah, that's expected behavior, or a known issue if you will, which was fixed in 9.3. However, S1 should automatically terminate the connection, with a message in the log like this: LOG: terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect That should allow S2 to find the new timeline, without restarting, as long as you have a WAL archive set up. > PGSTOP S2 and > then PGSTART. S2 will now change its timeline. However, as you see in the > pg_log, it will wait for a WAL that will never arrive. It will look for WALs > from previous timeline in new timeline file naming format. E.g it will wait > for 0000000A00000026000000F1. You will see that such log exists in the name > 0000000900000026000000F1. So it will wait forever and if you try to connect > to S2 you will see error “FATAL: the database system is starting up” This seems to be the crux of this bug report. I just tested this and didn't see this behavior. S2 tries restoring files from the archive first, but then it connects to S1 and catches up. > Recovery.conf for S1: > restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' > recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' > > recovery_target_timeline = 'latest' > > recovery.conf for S2: > restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' > recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' > > recovery_target_timeline = 'latest' There are no primary_conninfo lines here, so you're either not showing us the full recovery.conf files used, or you haven't in fact set up cascading replication. - Heikki
Вложения
В списке pgsql-bugs по дате отправления: