Re: speed up a logical replica setup

Поиск
Список
Период
Сортировка
От Euler Taveira
Тема Re: speed up a logical replica setup
Дата
Msg-id bbbeed26-41f9-468c-ac24-ad99f429ccd7@app.fastmail.com
обсуждение исходный текст
Ответ на Re: speed up a logical replica setup  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: speed up a logical replica setup  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: speed up a logical replica setup  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?

It was an attempt to decoupled a connection failure (that keeps streaming the
WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the primary
is gone during the standby recovery process, there is a way to bail out. The
recovery-timeout is 0 (infinite) by default so you have an infinite wait without
this check. The idea behind this implementation is to avoid exiting in this
critical code path. If it times out here you might have to rebuild the standby
and start again. Amit suggested [1] that we use a value as recovery-timeout but
how high is a good value? I've already saw some long recovery process using
pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
worried about a small percentage of cases and we should use 1h as default, for
example. It would reduce the complexity since the recovery process lacks some
progress indicators (LSN is not sufficient in this case and there isn't a
function to provide the current state -- stop applying WAL, reach target, new
timeline, etc).

If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
by default otherwise we will have some reports saying the tool is hanging when
in reality the primary has gone and WAL should be streamed.

IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.

That's a good idea. Tests are not exercising the recovery-timeout option.

Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.

Or is there a deeper problem with deciding if the system is in recovery?

As I said with some recovery progress indicators it would be easier to make some
decisions like wait a few seconds because the WAL has already been applied and
it is creating a new timeline. The recovery timeout decision is a shot in the
dark because we might be aborting pg_createsubscriber when the target server is
about to set RECOVERY_STATE_DONE.


--
Euler Taveira

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Remove some redundant set_cheapest() calls
Следующее
От: Dmitry Dolgov
Дата:
Сообщение: Re: pg_stat_statements and "IN" conditions