RE: Synchronizing slots from primary to standby
| От | Zhijie Hou (Fujitsu) |
|---|---|
| Тема | RE: Synchronizing slots from primary to standby |
| Дата | |
| Msg-id | OS0PR01MB571665359F2F5DCD3ADABC9F94002@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
| Ответ на | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
| Ответы |
Re: Synchronizing slots from primary to standby
|
| Список | pgsql-hackers |
On Monday, April 8, 2024 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 8, 2024 at 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Saturday, April 6, 2024 12:43 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Yeah, that could be the first step. We can probably add an injection > > > point to control the bgwrite behavior and then add tests involving > > > walsender performing the decoding. But I think it is important to > > > have sufficient tests in this area as I see they are quite helpful in uncovering > the issues. > > > > Here is the patch to drop the subscription in the beginning so that > > the restart_lsn of the lsub1_slot won't be advanced due to concurrent > > xl_running_xacts from bgwriter. The subscription will be re-created > > after all the slots are sync-ready. I think maybe we can use this to > > stabilize the test as a first step and then think about how to make > > use of injection point to add more tests if it's worth it. > > > > Pushed. Thanks for pushing. I checked the BF status, and noticed one BF failure, which I think is related to a miss in the test code. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27 From the following log, I can see the sync failed because the standby is lagging behind of the failover slot. ----- # No postmaster PID for node "cascading_standby" error running SQL: 'psql:<stdin>:1: ERROR: skipping slot synchronization as the received slot sync LSN 0/4000148 for slot"snap_test_slot" is ahead of the standby position 0/4000114' while running 'psql -XAtq -d port=50074 host=/tmp/t4HQFlrDmI dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECTpg_sync_replication_slots();' at /home/bf/bf-build/adder/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line2042. # Postmaster PID for node "publisher" is 3715298 ----- I think it's because we missed to call wait_for_replay_catchup before syncing slots. ----- $primary->safe_psql('postgres', "SELECT pg_create_logical_replication_slot('snap_test_slot', 'test_decoding', false, false, true);" ); # ? missed to wait here $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); ----- While testing, I noticed another place where we were calling wait_for_replay_catchup before doing pg_replication_slot_advance, which also has a small possibility to cause the failover slot to be ahead of the standby if some logs are written in between these two steps. So, I adjusted them together. Here is a small patch to improve the test. Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: