RE: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Synchronizing slots from primary to standby
Дата
Msg-id OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Ответы Re: Synchronizing slots from primary to standby  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Synchronizing slots from primary to standby  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
Hi,

When analyzing one BF error[1], we find an issue of slotsync: Since we don't
perform logical decoding for the synced slots when syncing the lsn/xmin of
slot, no logical snapshots will be serialized to disk. So, when user starts to
use these synced slots after promotion, it needs to re-build the consistent
snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
position indicates that there are running transactions. This however could
cause the data that before the consistent point to be missed[2].

This issue doesn't exist on the primary because the snapshot at restart_lsn
should have been serialized to disk (SnapBuildProcessRunningXacts ->
SnapBuildSerialize), so even if the logical decoding restarts, it can find
consistent snapshot immediately at restart_lsn.

To fix this, we could use the fast forward logical decoding to advance the synced
slot's lsn/xmin when syncing these values instead of directly updating the
slot's info. This way, the snapshot will be serialized to disk when decoding.
If we could not reach to the consistent point at the remote restart_lsn, the
slot is marked as temp and will be persisted once it reaches the consistent
point. I am still analyzing the fix and will share once ready.


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-19%2010%3A03%3A06
[2] The steps to reproduce the data miss issue on a primary->standby setup:

Note, we need to set LOG_SNAPSHOT_INTERVAL_MS to a bigger number(1500000) to
prevent cocurrent LogStandbySnapshot() call and enable sync_replication_slots on the standby.

1. Create a failover logical slot on the primary.
SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 'test_decoding', false, false, true);

2. Use the following steps to advance the restart_lsn of the failover slot to a
position where the xl_running_xacts at that position indicates that there is
running transaction.

TXN1
BEGIN;
create table dummy1(a int);

    TXN2
    SELECT pg_log_standby_snapshot();

TXN1
COMMIT;

TXN1
BEGIN;
create table dummy2(a int);

    TXN2
    SELECT pg_log_standby_snapshot();

TXN1
COMMIT;

-- the restart_lsn will be advanced to a position where there was 1 running
transaction. And we need to wait for the restart_lsn to be synced to the
standby.
SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn());

-- insert some data here before calling next pg_log_standby_snapshot().
INSERT INTO reptable VALUES(999);

3. Promote the standby and try to consume the change(999) from the synced slot
on the standby. We will find that no change is returned.

select * from pg_logical_slot_get_changes('logicalslot', NULL, NULL);

Best Regards,
Hou zj



В списке pgsql-hackers по дате отправления:

Предыдущее
От: shveta malik
Дата:
Сообщение: Re: Introduce XID age and inactive timeout based replication slot invalidation
Следующее
От: Aadhav Vignesh
Дата:
Сообщение: Re: Proposal: Introduce row-level security templates