RE: speed up a logical replica setup
От | Hayato Kuroda (Fujitsu) |
---|---|
Тема | RE: speed up a logical replica setup |
Дата | |
Msg-id | TYAPR01MB5692D87C0E1FD86713525F60F5A32@TYAPR01MB5692.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: speed up a logical replica setup ("Euler Taveira" <euler@eulerto.com>) |
Ответы |
Re: speed up a logical replica setup
|
Список | pgsql-hackers |
Dear Alexander, Euler, Amit, I also analyzed this failure, let me share it. Here, I think events in below ordering were occurred. 1. Backend created a publication on $db2, 2. BGWriter generated RUNNING_XACT record, then 3. Backend created a replication slot on $db2. In this case, the recovery_target_lsn is ahead of the RUNNING_XACT record generated at step 3. Also, since both bgwriter and slot creation mark the record as *UNIMPORTANT* one, the writer won't start again even after the LOG_SNAPSHOT_INTERVAL_MS. The rule is written in BackgroundWriterMain(): ``` /* * Only log if enough time has passed and interesting records have * been inserted since the last snapshot. Have to compare with <= * instead of < because GetLastImportantRecPtr() points at the * start of a record, whereas last_snapshot_lsn points just past * the end of the record. */ if (now >= timeout && last_snapshot_lsn <= GetLastImportantRecPtr()) { last_snapshot_lsn = LogStandbySnapshot(); last_snapshot_ts = now; } ``` Therefore, pg_createsubscriber waited until a new record was replicated, but no activities were recorded, causing a timeout. Since this is a timing issue, Alexander could reproduce the failure with shorter time duration and parallel running. IIUC, the root cause is that pg_create_logical_replication_slot() returns a LSN which is not generated yet. So, I think both mine [1] and Euler's approach [2] can solve the issue. My proposal was to add an extra WAL record after the final slot creation, and Euler's one was to use a restart_lsn as the recovery_target_lsn. In case of primary server, restart_lsn is set to the latest WAL insert position and then RUNNING_XACT record is generated later. How do you think? [1]: https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32@OSBPR01MB2552.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com Best regards, Hayato Kuroda FUJITSU LIMITED
В списке pgsql-hackers по дате отправления: