RE: speed up a logical replica setup

Поиск
Список
Период
Сортировка
От Hayato Kuroda (Fujitsu)
Тема RE: speed up a logical replica setup
Дата
Msg-id TYCPR01MB12077907C052EC329C45AE198F5362@TYCPR01MB12077.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: speed up a logical replica setup  (Peter Eisentraut <peter@eisentraut.org>)
Ответы Re: speed up a logical replica setup  (vignesh C <vignesh21@gmail.com>)
Re: speed up a logical replica setup  ("Euler Taveira" <euler@eulerto.com>)
Список pgsql-hackers
Dear Bharath, Peter,

> Looks like BF animals aren't happy, please check -
> > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> 
> Looks like sanitizer failures.  There were a few messages about that
> recently, but those were all just about freeing memory after use, which
> we don't necessarily require for client programs.  So maybe something else.

It seems that there are several time of failures, [1] and [2].

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the check
in wait_for_end_recovery() would be failed during the time. Since we allow to miss
the walreceiver 10 times and it is checked once per second, the failure occurs if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
but it's not a fundamental solution.

## Analysis for failure 2

According to [2], the physical replication slot which is specified as primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name =
'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one is
to ease the condition.

How do you think?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-25%2013%3A03%3A07
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-25%2013%3A53%3A58

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
Следующее
От: Melanie Plageman
Дата:
Сообщение: Re: BitmapHeapScan streaming read user and prelim refactoring