pg ignores wal files in pg_wal, and instead tries to load them from archive/primary

Поиск

Список

Период

Сортировка

От	hubert depesz lubaczewski
Тема	pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Дата	29 сентября 2022 г. 15:51:02
Msg-id	YzW+5v/VwbguW+XU@depesz.com обсуждение исходный текст
Ответы	Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Список	pgsql-bugs

Дерево обсуждения

Hi,
we have following situation:

1. primary on 14.5 that is *not* archiving (this is temporary situation
   related to ongoing upgrade from pg 12 proces) - all on ubuntu focal.
2. on new replica we run (via wrapper, but this doesn't seem to be
   related):
   pg_basebackup -D /var/lib/postgresql/14/main -c fast -v -P -U some-user -h sourcedb.hostname
3. after it is done, if the datadir was large enough, pg on replica
   doesn't replicate/catchup, because, from logs:
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,1,,2022-09-29 14:59:26 UTC,,0,LOG,00000,"started streaming WAL
fromprimary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
 
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,2,,2022-09-29 14:59:26 UTC,,0,FATAL,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 00000001000007E800000067 has already been
removed",,,,,,,,,"","walreceiver",,0
4. if there is restore_command configured, it tries to read data from archive
   too, but archive is non-existant.
5. the "missing" file is there, in pg_wal (I would assume that
   pg_basebackup copied it there):
   root@host# /bin/ls -c1 0* | wc -l
   1068
   root@host# /bin/ls -c1 0* | sort -V | head -n 1
   00000001000007E4000000A0
   root@host# /bin/ls -c1 0* | sort -V | tail -n 1
   00000001000007E800000092
   root@host# /bin/ls -c1 0* | sort -V | grep -n 00000001000007E800000067
   1043:00000001000007E800000067
   root@host# /bin/ls -c1 0* | sort -V | grep -n -C5 00000001000007E800000067
   1038-00000001000007E800000062
   1039-00000001000007E800000063
   1040-00000001000007E800000064
   1041-00000001000007E800000065
   1042-00000001000007E800000066
   1043:00000001000007E800000067
   1044-00000001000007E800000068
   1045-00000001000007E800000069
   1046-00000001000007E800000070
   1047-00000001000007E800000071
   1048-00000001000007E800000072
6. What's more - I straced startup process, and it does:
   a. opens the wal file (the problematic one)
   b. read 8k form it
   c. closes it
   d. checks existence of finish.recovery trigger file (it doesn't exist)
   e. starts restore program (which fails).
   f. rinse and repeat

What am I missing? what is wrong? How can I fix it? The problem is not fixing
*this server*, because we are in process of upgrading LOTS and LOTS of servers,
and I need to know what is broken/how to work around it.


Currently our goto fix is:
1. increase wal_keep_size to ~ 200GB
2. standaup replica
3. once it catches up decrease wal_keep_size to standard (for us) 16GB

but it is not really nice solution.

Best regards,

depesz

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

pg ignores wal files in pg_wal, and instead tries to load them from archive/primary