Re: speed up a logical replica setup

Поиск

Список

Период

Сортировка

От	Euler Taveira
Тема	Re: speed up a logical replica setup
Дата	26 марта 2024 г. 20:17:15
Msg-id	bbbeed26-41f9-468c-ac24-ad99f429ccd7@app.fastmail.com обсуждение исходный текст
Ответ на	Re: speed up a logical replica setup (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: speed up a logical replica setup Re: speed up a logical replica setup
Список	pgsql-hackers

Дерево обсуждения

On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:

Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?

It was an attempt to decoupled a connection failure (that keeps streaming the

WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the primary

is gone during the standby recovery process, there is a way to bail out. The

recovery-timeout is 0 (infinite) by default so you have an infinite wait without

this check. The idea behind this implementation is to avoid exiting in this

critical code path. If it times out here you might have to rebuild the standby

and start again. Amit suggested [1] that we use a value as recovery-timeout but

how high is a good value? I've already saw some long recovery process using

pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too

worried about a small percentage of cases and we should use 1h as default, for

example. It would reduce the complexity since the recovery process lacks some

progress indicators (LSN is not sufficient in this case and there isn't a

function to provide the current state -- stop applying WAL, reach target, new

timeline, etc).

If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery

by default otherwise we will have some reports saying the tool is hanging when

in reality the primary has gone and WAL should be streamed.

IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.

That's a good idea. Tests are not exercising the recovery-timeout option.

Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.

Or is there a deeper problem with deciding if the system is in recovery?

As I said with some recovery progress indicators it would be easier to make some

decisions like wait a few seconds because the WAL has already been applied and

it is creating a new timeline. The recovery timeout decision is a shot in the

dark because we might be aborting pg_createsubscriber when the target server is

about to set RECOVERY_STATE_DONE.

[1] https://www.postgresql.org/message-id/CAA4eK1JRgnhv_ySzuFjN7UaX9qxz5Hqcwew7Fk%3D%2BAbJbu0Kd9w%40mail.gmail.com

Euler Taveira

EDB https://www.enterprisedb.com/

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: speed up a logical replica setup