Re: BUG #18433: Logical replication timeout

Поиск
Список
Период
Сортировка
От Shlok Kyal
Тема Re: BUG #18433: Logical replication timeout
Дата
Msg-id CANhcyEVs-F7G+bf3j2sym-GHi1hJ0Eo+LmoRV3ONzbjEFW0W4Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #18433: Logical replication timeout  (Костянтин Томах <tomahkvt@gmail.com>)
Список pgsql-bugs
Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and
destinationDB but we got the same error. 
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication
targetrelation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have
REPLICAIDENTITY FULL". We didn't have such errors on Postgresql 10. 
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      18433
>> > Logged by:          Kostiantyn
>> > Email address:      tomahkvt@gmail.com
>> > PostgreSQL version: 13.14
>> > Operating system:   AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working  great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table  process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing  wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal



В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18442: Unnecessary Sort operator in indexScan Plan
Следующее
От: Peter Petrov
Дата:
Сообщение: Re: BUG #18351: VACUUM FULL fails with error: missing chunk number 0 for toast value XXX