pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
От | hubert depesz lubaczewski |
---|---|
Тема | pg ignores wal files in pg_wal, and instead tries to load them from archive/primary |
Дата | |
Msg-id | YzW+5v/VwbguW+XU@depesz.com обсуждение исходный текст |
Ответы |
Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary |
Список | pgsql-bugs |
Hi, we have following situation: 1. primary on 14.5 that is *not* archiving (this is temporary situation related to ongoing upgrade from pg 12 proces) - all on ubuntu focal. 2. on new replica we run (via wrapper, but this doesn't seem to be related): pg_basebackup -D /var/lib/postgresql/14/main -c fast -v -P -U some-user -h sourcedb.hostname 3. after it is done, if the datadir was large enough, pg on replica doesn't replicate/catchup, because, from logs: 2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,1,,2022-09-29 14:59:26 UTC,,0,LOG,00000,"started streaming WAL fromprimary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0 2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,2,,2022-09-29 14:59:26 UTC,,0,FATAL,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0 4. if there is restore_command configured, it tries to read data from archive too, but archive is non-existant. 5. the "missing" file is there, in pg_wal (I would assume that pg_basebackup copied it there): root@host# /bin/ls -c1 0* | wc -l 1068 root@host# /bin/ls -c1 0* | sort -V | head -n 1 00000001000007E4000000A0 root@host# /bin/ls -c1 0* | sort -V | tail -n 1 00000001000007E800000092 root@host# /bin/ls -c1 0* | sort -V | grep -n 00000001000007E800000067 1043:00000001000007E800000067 root@host# /bin/ls -c1 0* | sort -V | grep -n -C5 00000001000007E800000067 1038-00000001000007E800000062 1039-00000001000007E800000063 1040-00000001000007E800000064 1041-00000001000007E800000065 1042-00000001000007E800000066 1043:00000001000007E800000067 1044-00000001000007E800000068 1045-00000001000007E800000069 1046-00000001000007E800000070 1047-00000001000007E800000071 1048-00000001000007E800000072 6. What's more - I straced startup process, and it does: a. opens the wal file (the problematic one) b. read 8k form it c. closes it d. checks existence of finish.recovery trigger file (it doesn't exist) e. starts restore program (which fails). f. rinse and repeat What am I missing? what is wrong? How can I fix it? The problem is not fixing *this server*, because we are in process of upgrading LOTS and LOTS of servers, and I need to know what is broken/how to work around it. Currently our goto fix is: 1. increase wal_keep_size to ~ 200GB 2. standaup replica 3. once it catches up decrease wal_keep_size to standard (for us) 16GB but it is not really nice solution. Best regards, depesz
В списке pgsql-bugs по дате отправления: