Re: streaming replication breaks horribly if master crashes

Поиск

Список

Период

Сортировка

От	Stefan Kaltenbrunner
Тема	Re: streaming replication breaks horribly if master crashes
Дата	16 июня 2010 г. 17:07:43
Msg-id	4C192EFE.10204@kaltenbrunner.cc обсуждение исходный текст
Ответ на	streaming replication breaks horribly if master crashes (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: streaming replication breaks horribly if master crashes
Список	pgsql-hackers

Дерево обсуждения

On 06/16/2010 09:47 PM, Robert Haas wrote:
> On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>> But that change would cause the problem that Robert pointed out.
>>> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php
>>
>> Presumably this means that if synchronous_commit = off on primary that
>> SR in 9.0 will no longer work correctly if the primary crashes?
>
> I spent some time investigating this today and have come to the
> conclusion that streaming replication is really, really broken in the
> face of potential crashes on the master.  Using a copy of VMware
> parallels provided by $EMPLOYER, I set up two Fedora 12 virtual
> machines on my MacBook in a master/slave configuration.  Then I
> crashed the master repeatedly using 'echo b>  /proc/sysrq-trigger',
> which causes an immediate reboot (without syncing the disks, closing
> network connections, etc.) while running pgbench or other stuff
> against it.
>
> The first problem I noticed is that the slave never seems to realize
> that the master has gone away.  Every time I crashed the master, I had
> to kill the wal receiver process on the slave to get it to reconnect;
> otherwise it just sat there waiting, either forever or at least for
> longer than I was willing to wait.

well this is likely caused by the OS not noticing that the connections 
went away (linux has really long timeouts here) - maybe we should 
unconditionally enable keepalive on systems that support that for 
replication connections (if that is possible in the current design anyway)


>
> More seriously, I was able to demonstrate that the problem linked in
> the thread above is real: if the master crashes after streaming WAL
> that it hasn't yet fsync'd, then on recovery the slave's xlog position
> is ahead of the master.  So far I've only been able to reproduce this
> with fsync=off, but I believe it's possible anyway, and this just
> makes it more likely.  After the most recent crash, the master thought
> pg_current_xlog_location() was 1/86CD4000; the slave thought
> pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
> the master, the slave then thought that
> pg_last_xlog_receive_location() was 1/87000000.  The slave didn't
> think this was a problem yet, though.  When I then restarted a pgbench
> run against the master, the slave pretty quickly started spewing an
> endless stream of messages complaining of "LOG: invalid record length
> at 1/8733A828".

this is obviously bad but with fsync=off(or sync_commit=off?) it is 
probably impossible to prevent...



Stefan

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: streaming replication breaks horribly if master crashes