Loss of replication after simple misconfiguration

Поиск

Список

Период

Сортировка

От	hubert depesz lubaczewski
Тема	Loss of replication after simple misconfiguration
Дата	9 апреля 2020 г. 14:08:36
Msg-id	20200409140836.GA5990@depesz.com обсуждение исходный текст
Ответы	Re: Loss of replication after simple misconfiguration
Список	pgsql-bugs

Дерево обсуждения

Hi,

I was trying to get it to replicate, but I couldn't in controlled
environment, so I have no repeatable proof of concept, but we've seen
the problem at least 4 times.

Configuration:

PostgreSQL 9.5.15 on Ubuntu bionic.
Streaming replication from master to slave, working without problem.
max_worker_processes set to 8 on both servers.

Preparation:
At some point in time, to accommodate upcoming (yes, really) upgrade to
pg12, we pushed change that write "max_worker_processes = 50" to all
configs, but didn't restart Pgs.

Fail scenario:
Master died (underlying AWS EC2 instance got hosed, was restarted).
Master got back up, but loaded new configuration from conf file.

Immediately it crashed slave with message:
2020-04-09 15:55:45.653 CEST  1304738   LOG:  started streaming WAL from primary at 3/D000000 on timeline 1
2020-04-09 15:55:45.679 CEST  505710   FATAL:  hot standby is not possible because max_worker_processes = 8 is a lower
settingthan on the master server (its value was 50)
 
2020-04-09 15:55:45.679 CEST  505710   CONTEXT:  xlog redo XLOG/PARAMETER_CHANGE: max_connections=100
max_worker_processes=50max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off
track_commit_timestamp=off
(this message is from my test environment, but the idea is clear).

Given that we have max_worker_processes = 50 in config on slave, we
tried to restart only to be greeted by:
2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG:  restored log file "000000030001779200000061"
fromarchive
 
2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL:  could not access status of transaction
4275781146
2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL:  Could not read from file
"pg_commit_ts/27D4B"at offset 245760: Success.
 
2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT:  xlog redo Transaction/COMMIT: 2020-04-07
02:40:10.065859+00
2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG:  startup process (PID 20491) exited with exit
code1
 
2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG:  terminating any other active server processes

We had similar case (could not access status of transaction) on at least
4 separate pg installations, so it looks to be related to the
restart of master while bad max_worker_processes are set.

We have since rolled back the change to max_worker_processes, and will
be doing it in much saner/safer way, but the fact that slave can't start
after such master restart is troubling.

Unfortunately we don't have any instances broken in this way anymore,
and we don't have logs either - we had to rebuild the Pg setup
(master/slave) ASAP, and thus the evidence is lost.

Is there anything we could have done to fix it once it happened?

Is it a known problem?

Or maybe the max_worker_processes mismatch is redherring and the problem
was simply because master died while there was heavy write traffic with
lots of transactions?

Best regards,

depesz

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Loss of replication after simple misconfiguration