Обсуждение: BUG #18433: Logical replication timeout

Поиск

Список

Период

Сортировка

BUG #18433: Logical replication timeout

От

PG Bug reporting form

Дата:

14 апреля, 19:08:07

The following bug has been logged on the website:

Bug reference:      18433
Logged by:          Kostiantyn
Email address:      tomahkvt@gmail.com
PostgreSQL version: 13.14
Operating system:   AWS RDS
Description:

On Postgresql 10 we used the following approach for the big tables:
1) Download schema from the source database instance
2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
schema
3)Upload the schema to the destination DB.
4) Configure Logical replication between source and destination DB.
5) When last table logical replication table synchronization worker for
subscription "db_name_public_subscription", table "table_name" has
finished
6) we created all the necessary PK, FK, and Indexes.
This approach allowed to us upload data more quickly. This approach was
working  great on PostgreSQL 10.

We tried the same approach for Postgresql13, but we got an error.
1) Download schema from the source database instance
2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
schema
3)Upload the schema to the destination DB.
4) configurated identity replication full at source DB for tables bigger
than 100Gb
5) Configured Logical replication between source and destination DB.
6) During catchup on this big table  process we got the following
messages:
Source DB
2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
walsender process due to replication timeout
2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
"db_name_public_subscription", output plugin "pgoutput", in the begin
callback, associated LSN 13705/2E913050
2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
(proto_version '1', publication_names '"db_name_public_publication"')
2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
walsender process due to replication timeout
2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
"db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
in the begin callback, associated LSN 13703/27942B70
2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
LOGICAL 13703/17622B58 (proto_version '1', publication_names
'"db_name_public_publication"')

One important point. If there is no request to source DB logical replication
works fine for big tables.
I saw the messages in PostgreSQL bugs like
https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
But I also did some tests and increased wal_keep_size
and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
without success. The setup works in PG 13 only with a small amount of data.

Re: BUG #18433: Logical replication timeout

От

Shlok Kyal

Дата:

17 апреля, 13:48:44

Hi,

On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18433
> Logged by:          Kostiantyn
> Email address:      tomahkvt@gmail.com
> PostgreSQL version: 13.14
> Operating system:   AWS RDS
> Description:
>
> On Postgresql 10 we used the following approach for the big tables:
> 1) Download schema from the source database instance
> 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> schema
> 3)Upload the schema to the destination DB.
> 4) Configure Logical replication between source and destination DB.
> 5) When last table logical replication table synchronization worker for
> subscription "db_name_public_subscription", table "table_name" has
> finished
> 6) we created all the necessary PK, FK, and Indexes.
> This approach allowed to us upload data more quickly. This approach was
> working  great on PostgreSQL 10.
>
> We tried the same approach for Postgresql13, but we got an error.
> 1) Download schema from the source database instance
> 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> schema
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for tables bigger
> than 100Gb
> 5) Configured Logical replication between source and destination DB.
> 6) During catchup on this big table  process we got the following
> messages:
> Source DB
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
> walsender process due to replication timeout
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
> "db_name_public_subscription", output plugin "pgoutput", in the begin
> callback, associated LSN 13705/2E913050
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
> START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
> (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
> walsender process due to replication timeout
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
> "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
> in the begin callback, associated LSN 13703/27942B70
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
> START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
> LOGICAL 13703/17622B58 (proto_version '1', publication_names
> '"db_name_public_publication"')
>
> One important point. If there is no request to source DB logical replication
> works fine for big tables.
> I saw the messages in PostgreSQL bugs like
> https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
> But I also did some tests and increased wal_keep_size
> and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
> without success. The setup works in PG 13 only with a small amount of data.

I went through the issue and I think that the given logs are appearing
due to some delay in the Apply Worker process.
I could reproduce it in my development environment by applying delays
in Apply Worker.

I think this issue can be resolved by setting 'wal_sender_timeout' to
a greater value. Can you try setting 'wal_sender_timeout' to a greater
value?

Also, I noticed that while using Postgresql13 you are configuring the
table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
Postgresql10. Is there any specific reason for it?
I pointed it out because REPLICA IDENTITY FULL has a different
behaviour and sets the entire row as key.

Thanks and Regards,
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Костянтин Томах

Дата:

18 апреля, 16:57:57

Hello, Shlok Kyal. Thank you very much for your answer.

Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and destination DB but we got the same error.

We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL". We didn't have such errors on Postgresql 10.

ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:

Hi,

On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 18433
> Logged by: Kostiantyn
> Email address: tomahkvt@gmail.com
> PostgreSQL version: 13.14
> Operating system: AWS RDS
> Description:
>
> On Postgresql 10 we used the following approach for the big tables:
> 1) Download schema from the source database instance
> 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> schema
> 3)Upload the schema to the destination DB.
> 4) Configure Logical replication between source and destination DB.
> 5) When last table logical replication table synchronization worker for
> subscription "db_name_public_subscription", table "table_name" has
> finished
> 6) we created all the necessary PK, FK, and Indexes.
> This approach allowed to us upload data more quickly. This approach was
> working great on PostgreSQL 10.
>
> We tried the same approach for Postgresql13, but we got an error.
> 1) Download schema from the source database instance
> 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> schema
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for tables bigger
> than 100Gb
> 5) Configured Logical replication between source and destination DB.
> 6) During catchup on this big table process we got the following
> messages:
> Source DB
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
> walsender process due to replication timeout
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
> "db_name_public_subscription", output plugin "pgoutput", in the begin
> callback, associated LSN 13705/2E913050
> 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
> START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
> (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
> walsender process due to replication timeout
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
> "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
> in the begin callback, associated LSN 13703/27942B70
> 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
> START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
> LOGICAL 13703/17622B58 (proto_version '1', publication_names
> '"db_name_public_publication"')
>
> One important point. If there is no request to source DB logical replication
> works fine for big tables.
> I saw the messages in PostgreSQL bugs like
> https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
> But I also did some tests and increased wal_keep_size
> and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
> without success. The setup works in PG 13 only with a small amount of data.

I went through the issue and I think that the given logs are appearing
due to some delay in the Apply Worker process.
I could reproduce it in my development environment by applying delays
in Apply Worker.

I think this issue can be resolved by setting 'wal_sender_timeout' to
a greater value. Can you try setting 'wal_sender_timeout' to a greater
value?

Also, I noticed that while using Postgresql13 you are configuring the
table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
Postgresql10. Is there any specific reason for it?
I pointed it out because REPLICA IDENTITY FULL has a different
behaviour and sets the entire row as key.

Thanks and Regards,
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Shlok Kyal

Дата:

19 апреля, 14:51:08

Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and
destinationDB but we got the same error. 
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication
targetrelation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have
REPLICAIDENTITY FULL". We didn't have such errors on Postgresql 10. 
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      18433
>> > Logged by:          Kostiantyn
>> > Email address:      tomahkvt@gmail.com
>> > PostgreSQL version: 13.14
>> > Operating system:   AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working  great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table  process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing  wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal

Fwd: BUG #18433: Logical replication timeout

От

Kostiantyn Tomakh

Дата:

23 апреля, 12:58:21

Hi, Shlok Kyal.

1. We set wal_sender_timeout=1h and wal_receiver_timeout=1h

2. Logs after this changes for table size 39GB

DB_source logs
2024-04-22 14:13:25 UTC:10.10.10.10(64442):replication_role@db:[16497]:STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  starting logical decoding for slot "db_name_public_subscription"
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  Streaming transactions committing after 139C2/AD4A6FA0, reading WAL from 139C2/AD414CC0.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  logical decoding found initial starting point at 139C2/AD416018
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  Waiting for transactions (approximately 3) older than 2782438580 to end.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  logical decoding found consistent point at 139C2/AD4A6F68
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription_18989252_sync_17090" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  starting logical decoding for slot "db_name_public_subscription_18989252_sync_17090"
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  Streaming transactions committing after 139C2/AD637D90, reading WAL from 139C2/AD6317D8.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  There are no running transactions.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:CONTEXT:  slot "db_name_public_subscription", output plugin "pgoutput", in the commit callback, associated LSN 139C6/3C286C98
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:CONTEXT:  slot "db_name_public_subscription_18989252_sync_17090", output plugin "pgoutput", in the begin callback, associated LSN 139C5/4C0A92B8
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')

DB_destination logs

2024-04-22 14:13:25 UTC::@:[847]:LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
2024-04-22 14:13:25 UTC::@:[848]:LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
2024-04-22 15:00:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:00:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.752 s, sync=0.005 s, total=1.782 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65540 kB, estimate=4065870 kB
2024-04-22 15:05:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:05:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 12 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.249 s, sync=0.005 s, total=1.277 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65487 kB, estimate=3665831 kB
2024-04-22 15:10:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:11:00 UTC::@:[403]:LOG:  checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=2.455 s, sync=0.005 s, total=2.483 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65631 kB, estimate=3305811 kB
2024-04-22 15:15:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:15:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 16 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.652 s, sync=0.006 s, total=1.719 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65541 kB, estimate=2981784 kB
2024-04-22 15:20:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:20:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.006 s, total=1.783 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65522 kB, estimate=2690158 kB
2024-04-22 15:25:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:25:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.645 s, sync=0.004 s, total=0.673 s; sync files=3, longest=0.004 s, average=0.002 s; distance=65456 kB, estimate=2427688 kB
2024-04-22 15:30:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:30:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 15 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.550 s, sync=0.005 s, total=1.580 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65573 kB, estimate=2191476 kB
2024-04-22 15:35:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:35:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.005 s, total=1.779 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65541 kB, estimate=1978883 kB
2024-04-22 15:40:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:40:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 7 buffers (0.0%); 0 WAL file(s) added, 2 removed, 0 recycled; write=0.746 s, sync=0.004 s, total=0.778 s; sync files=3, longest=0.004 s, average=0.002 s; distance=131032 kB, estimate=1794098 kB
2024-04-22 15:45:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:45:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.148 s, sync=0.005 s, total=1.180 s; sync files=3, longest=0.005 s, average=0.002 s; distance=65533 kB, estimate=1621241 kB
2024-04-22 15:48:33 UTC::@:[847]:FATAL:  terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG:  background worker "logical replication worker" (PID 847) exited with exit code 1
2024-04-22 15:48:33 UTC::@:[848]:FATAL:  terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG:  background worker "logical replication worker" (PID 848) exited with exit code 1

3. This log was got for a table size of 39 GB but we have tables much bigger than this.

Our workflow for the tables size 39 GB

1) Download schema from the source database instance

2) Deleted PK, FK, and Indexes  from the table schema(we did it to increase data load, startup process)
3)Upload the schema to the destination DB.
4) configurated identity replication full at source DB for this table
5) Configured Logical replication between source and destination DB this table
6) During catchup on this table  process we got the messages in log that you can see above.

7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.

I want to highlight that we used this workflow for PostgreSQL 10 and it worked.

4. Can you clarify the point 'One important point. If there is no request to source DB logical replication works fine for big tables.'?

I meant that if the source database doesn't have delete, insert, update queries catch up process very short and we don't have this problem.

Thank you in advance.

пт, 19 апр. 2024 г. в 14:51, Shlok Kyal <shlok.kyal.oss@gmail.com>:

Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and destination DB but we got the same error.
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL". We didn't have such errors on Postgresql 10.
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference: 18433
>> > Logged by: Kostiantyn
>> > Email address: tomahkvt@gmail.com
>> > PostgreSQL version: 13.14
>> > Operating system: AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Kostiantyn Tomakh

Дата:

24 апреля, 09:03:32

Hi, Shlok Kyal.

I want to add that when we tried to use REPLICA IDENTITY default in Postgresql 13.

We got following message:

2024-03-29 06:53:52 UTC::@:[28650]:LOG: logical replication apply worker for subscription "db_name_public_subscription" has started 2024-03-29 06:53:52 UTC::@:[28651]:LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started 2024-03-29 06:54:06 UTC::@:[408]:LOG: checkpoint starting: time 2024-03-29 06:55:36 UTC::@:[28651]:ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL 2024-03-29 06:55:36 UTC::@:[396]:LOG: background worker "logical replication worker" (PID 28651) exited with exit code 1 2024-03-29 06:55:36 UTC::@:[31784]:LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started

Startup state starts from beginning.

The table at destination DB overwrites many times as result table at destination DB becomes bigger than source.

The startup state starts from the beginning. The table at destination DB overwrites many times as a result table at destination DB becomes bigger than the source. This behavior in the Postgresql 13.

In Postgresql 10

As I remember There wasn't "background worker "logical replication worker" (PID 28651) exited with exit code 1" in Postgresql 10.

1) Destination DB informed that the "logical replication table synchronization worker for subscription "db_name_public_subscription", table "table_name" has finished"
2) Then Destination DB informed periodically that the destination table didn't have PRIMARY KEY.There wasn't message "background worker "logical replication worker" (PID 28651) exited with exit code 1". Replication process waited until problem was fixed.
3) We created a Primary Key at destination DB.
4) The replication process continued without any fails or restart.

вт, 23 апр. 2024 г. в 12:58, Kostiantyn Tomakh <tomahkvt@gmail.com>:

Hi, Shlok Kyal.
1. We set wal_sender_timeout=1h and wal_receiver_timeout=1h
2. Logs after this changes for table size 39GB
DB_source logs
2024-04-22 14:13:25 UTC:10.10.10.10(64442):replication_role@db:[16497]:STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  starting logical decoding for slot "db_name_public_subscription"
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  Streaming transactions committing after 139C2/AD4A6FA0, reading WAL from 139C2/AD414CC0.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  logical decoding found initial starting point at 139C2/AD416018
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  Waiting for transactions (approximately 3) older than 2782438580 to end.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  logical decoding found consistent point at 139C2/AD4A6F68
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:DETAIL:  There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  There are no running transactions.
2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription_18989252_sync_17090" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  starting logical decoding for slot "db_name_public_subscription_18989252_sync_17090"
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  Streaming transactions committing after 139C2/AD637D90, reading WAL from 139C2/AD6317D8.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  logical decoding found consistent point at 139C2/AD637D58
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:DETAIL:  There are no running transactions.
2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:LOG:  terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:CONTEXT:  slot "db_name_public_subscription", output plugin "pgoutput", in the commit callback, associated LSN 139C6/3C286C98
2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role@db:[16500]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:LOG:  terminating walsender process due to replication timeout
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:CONTEXT:  slot "db_name_public_subscription_18989252_sync_17090", output plugin "pgoutput", in the begin callback, associated LSN 139C5/4C0A92B8
2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role@db:[16501]:STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')

DB_destination logs

2024-04-22 14:13:25 UTC::@:[847]:LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
2024-04-22 14:13:25 UTC::@:[848]:LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
2024-04-22 15:00:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:00:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.752 s, sync=0.005 s, total=1.782 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65540 kB, estimate=4065870 kB
2024-04-22 15:05:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:05:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 12 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.249 s, sync=0.005 s, total=1.277 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65487 kB, estimate=3665831 kB
2024-04-22 15:10:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:11:00 UTC::@:[403]:LOG:  checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=2.455 s, sync=0.005 s, total=2.483 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65631 kB, estimate=3305811 kB
2024-04-22 15:15:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:15:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 16 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.652 s, sync=0.006 s, total=1.719 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65541 kB, estimate=2981784 kB
2024-04-22 15:20:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:20:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.006 s, total=1.783 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65522 kB, estimate=2690158 kB
2024-04-22 15:25:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:25:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.645 s, sync=0.004 s, total=0.673 s; sync files=3, longest=0.004 s, average=0.002 s; distance=65456 kB, estimate=2427688 kB
2024-04-22 15:30:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:30:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 15 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.550 s, sync=0.005 s, total=1.580 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65573 kB, estimate=2191476 kB
2024-04-22 15:35:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:35:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.005 s, total=1.779 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65541 kB, estimate=1978883 kB
2024-04-22 15:40:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:40:58 UTC::@:[403]:LOG:  checkpoint complete: wrote 7 buffers (0.0%); 0 WAL file(s) added, 2 removed, 0 recycled; write=0.746 s, sync=0.004 s, total=0.778 s; sync files=3, longest=0.004 s, average=0.002 s; distance=131032 kB, estimate=1794098 kB
2024-04-22 15:45:57 UTC::@:[403]:LOG:  checkpoint starting: time
2024-04-22 15:45:59 UTC::@:[403]:LOG:  checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.148 s, sync=0.005 s, total=1.180 s; sync files=3, longest=0.005 s, average=0.002 s; distance=65533 kB, estimate=1621241 kB
2024-04-22 15:48:33 UTC::@:[847]:FATAL:  terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG:  background worker "logical replication worker" (PID 847) exited with exit code 1
2024-04-22 15:48:33 UTC::@:[848]:FATAL:  terminating logical replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG:  background worker "logical replication worker" (PID 848) exited with exit code 1
3. This log was got for a table size of 39 GB but we have tables much bigger than this.
Our workflow for the tables size 39 GB
1) Download schema from the source database instance
2) Deleted PK, FK, and Indexes  from the table schema(we did it to increase data load, startup process)
3)Upload the schema to the destination DB.
4) configurated identity replication full at source DB for this table
5) Configured Logical replication between source and destination DB this table
6) During catchup on this table  process we got the messages in log that you can see above.
7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.
I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
4. Can you clarify the point 'One important point. If there is no request to source DB logical replication works fine for big tables.'?
I meant that if the source database doesn't have delete, insert, update queries catch up process very short and we don't have this problem.
Thank you in advance.
пт, 19 апр. 2024 г. в 14:51, Shlok Kyal <shlok.kyal.oss@gmail.com>:
Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and destination DB but we got the same error.
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL". We didn't have such errors on Postgresql 10.
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok.kyal.oss@gmail.com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference: 18433
>> > Logged by: Kostiantyn
>> > Email address: tomahkvt@gmail.com
>> > PostgreSQL version: 13.14
>> > Operating system: AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Shlok Kyal

Дата:

29 апреля, 13:14:29

Hi

> 3. This log was got for a table size of 39 GB but we have tables much bigger than this.
>
> Our workflow for the tables size 39 GB
>
> 1) Download schema from the source database instance
>
> 2) Deleted PK, FK, and Indexes  from the table schema(we did it to increase data load, startup process)
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for this table
> 5) Configured Logical replication between source and destination DB this table
> 6) During catchup on this table  process we got the messages in log that you can see above.
>
> 7) We also tried create primary key for this table during catchup state but this process was blocked by logical
replicationworker so we had to cancel primary key creation process.
 
>
> I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
>

I tried the steps you provided to reproduce the issue on Postgresql 13
but, I was not able to reproduce it. I tried it for tables with size
~120GB and also tried it separately for table with size ~40GB.

Steps I used:
1. Created a sourceDB and DestinationDB.
2.
Populated sourceDB with data.
"create table t1(c1 int PRIMARY KEY, c2 int);"
"insert into t1 values(generate_series(1,1500000000), 2)"
"update t1 set c1 = 0 where c1 = 1"
3.
Deleted Primary Key on sourceDB
"alter table t1 drop constraint t1_pkey"
4.
Created schema on destinationDB
"create table t1(c1 int, c2 int);"
5.
Configured Replica identity full on sourceDB
"ALTER TABLE t1 REPLICA IDENTITY FULL"
6.
Configured Logical Replication Setup
on sourceDB: "create publication pub1_new for all tables"
on destinationDB: ""create subscription test1_new connection
'dbname=postgres host=localhost port=5432' publication pub1_new;"

For me the initial catchup was successful and I did not receive any
issue in sourceDB logs or destinationDB logs.
Can you point out if you did anything different?

Also, For me the initial catchup succeeds even if I did not configure
replica identity full on sourceDB
Can you point out how your steps were different to get error related
to replica identity full?

Also, are you running any queries concurrently on sourceDB or destinationDB?

Thanks and Regards,
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Kostiantyn Tomakh

Дата:

01 мая, 20:25:35

Hello, Shlok Kyal.

I was able to reproduce the problem.

I did it on docker based platform I hope you will be able to reproduce this problem too.

1) I ran the docker-compose file with PostgreSQL 10:

docker-compose.yml:

version: '3.3'
services:
  postgres:
    image: postgres:10.21
    environment:
      POSTGRES_DB: "db"
      POSTGRES_USER: "user"
      POSTGRES_PASSWORD: "password"
      PGDATA: "/var/lib/postgresql/data/pgdata"
    command: -c config_file=/etc/postgresql.conf
    volumes:
      - ./data:/var/lib/postgresql/data
      - ./postgresql.conf:/etc/postgresql.conf
      - ./pg_hba.conf:/etc/pg_hba.conf
    ports:
      - "5432:5432"
  postgres2:
    image: postgres:10.21
    environment:
      POSTGRES_DB: "db"
      POSTGRES_USER: "user"
      POSTGRES_PASSWORD: "password"
      PGDATA: "/var/lib/postgresql/data/pgdata"
    command: -c config_file=/etc/postgresql.conf
    volumes:
      - ./data2:/var/lib/postgresql/data
      - ./postgresql.conf:/etc/postgresql.conf
      - ./pg_hba.conf:/etc/pg_hba.conf
    ports:
      - "5433:5432"

postgresql.conf:
hba_file = '/etc/pg_hba.conf'
listen_addresses = '*'
wal_level = logical
pg_hba.conf:
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    all             all             0.0.0.0/0               md5

So I have two Postgresql servers.

2) I created table at DB "postgres":

CREATE TABLE public.table (
   field_1 uuid NOT NULL,
   field_2 uuid NOT NULL,
   field_3 uuid NOT NULL,
   field_4 text NOT NULL,
   field_5 text NOT NULL,
   field_6 timestamp with time zone NOT NULL,
   field_7 timestamp with time zone NOT NULL,
   field_8 smallint NOT NULL,
   field_9 text NOT NULL,
   field_10 text NOT NULL,
   field_11 uuid NOT NULL,
   field_12 uuid,
   field_13 text,
   field_14 uuid,
   field_15 text NOT NULL,
   field_16 integer,
   field_17 text NOT NULL,
   field_18 uuid,
   field_19 character varying(20),
   field_20 character varying(29),
   field_21 text,
   field_22 bigint,
   field_23 character varying(20)
);

ALTER TABLE ONLY public.table
    ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);

3) Created publication at DB "postgres":

CREATE ROLE replication_role  WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";

4) I started two Python scripts that were generating requests to"postgres":

insert.py:

import psycopg2
from faker import Faker
import uuid
import random
from datetime import datetime, timedelta

# Connect to your PostgreSQL database
conn = psycopg2.connect(
    dbname="db",
    user="user",
    password="password",
    host="127.0.0.1",
    port="5432"
)
cur = conn.cursor()

# Function to generate fake data for one row
def generate_fake_row():
    fake = Faker()
    row = (
        str(uuid.uuid4()),  # Convert UUID to string
        str(uuid.uuid4()),  # Convert UUID to string
        str(uuid.uuid4()),  # Convert UUID to string
        fake.text(),   # field_4
        fake.text(),   # field_5
        fake.date_time_between(start_date='-1y', end_date='now', tzinfo=None),  # field_6
        fake.date_time_between(start_date='-1y', end_date='now', tzinfo=None),  # field_7
        random.randint(0, 100),  # field_8
        fake.text(),   # field_9
        fake.text(),   # field_10
        str(uuid.uuid4()),  # Convert UUID to string
        str(uuid.uuid4()) if random.random() > 0.5 else None,  # Convert UUID to string
        fake.text() if random.random() > 0.5 else None,  # field_13
        str(uuid.uuid4()) if random.random() > 0.5 else None,  # Convert UUID to string
        fake.text(),   # field_15
        random.randint(0, 1000),  # field_16
        fake.text(),   # field_17
        str(uuid.uuid4()) if random.random() > 0.5 else None,  # Convert UUID to string
        fake.random_letter() * 20,  # field_19
        fake.random_letter() * 29,  # field_20
        fake.text() if random.random() > 0.5 else None,  # field_21
        random.randint(0, 1000000),  # field_22
        fake.random_letter() * 20  # field_23
    )
    return row

# Function to insert one row of fake data into the table
def insert_fake_row(row):
    sql = """INSERT INTO public.table (
        field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8,
        field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16,
        field_17, field_18, field_19, field_20, field_21, field_22, field_23
    ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""
    cur.execute(sql, row)
    conn.commit()

# Insert one row at a time in an independent loop
while True:
    fake_row = generate_fake_row()  # Generate fake data for one row
    insert_fake_row(fake_row)  # Insert the row

# Close the cursor and the connection
cur.close()
conn.close()
update.py:
import psycopg2
import random
import time

# Connect to your PostgreSQL database
conn = psycopg2.connect(
    dbname="db",
    user="user",
    password="password",
    host="127.0.0.1",
    port="5432"
)
cur = conn.cursor()

# Function to update any row in the table
def update_random_row():
    # Generate a random value for field_8 and field_16
    new_field_8 = random.randint(0, 100)
    new_field_16 = random.randint(0, 1000)

    # Choose a random row to update
    cur.execute("SELECT field_1 FROM public.table;")
    rows = cur.fetchall()
    random_row = random.choice(rows)[0]

    # Update the chosen row with the new values
    cur.execute("UPDATE public.table SET field_8 = %s, field_16 = %s WHERE field_1 = %s;", (new_field_8, new_field_16, random_row))
    conn.commit()

# Infinite loop to continuously update rows
while True:
    update_random_row()
    time.sleep(1)  # Wait for 1 second before the next update

# Close the cursor and the connection (won't be executed in an infinite loop)
cur.close()
conn.close()
5)Create subsription at DB postgres2:
CREATE SUBSCRIPTION db_name_public_subscription CONNECTION 'host=postgres port=5432 password=password user=replication_role dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:
10-postgres2-1  | 2024-05-01 16:34:16.296 GMT [38] LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] LOG:  starting logical decoding for slot "db_name_public_subscription"
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] DETAIL:  streaming transactions committing after 0/1B5FFC8, reading WAL from 0/1B5FF90
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] LOG:  logical decoding found consistent point at 0/1B5FF90
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] DETAIL:  There are no running transactions.
10-postgres-1   | 2024-05-01 16:34:16.300 GMT [39] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres2-1  | 2024-05-01 16:34:16.302 GMT [39] LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
10-postgres-1   | 2024-05-01 16:34:16.366 GMT [40] LOG:  logical decoding found consistent point at 0/1B610D8
10-postgres-1   | 2024-05-01 16:34:16.366 GMT [40] DETAIL:  There are no running transactions.
10-postgres-1   | 2024-05-01 16:34:16.366 GMT [40] STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription_16391_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] LOG:  starting logical decoding for slot "db_name_public_subscription_16391_sync_16385"
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] DETAIL:  streaming transactions committing after 0/1B61110, reading WAL from 0/1B610D8
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385" LOGICAL 0/1B61110 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] LOG:  logical decoding found consistent point at 0/1B610D8
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] DETAIL:  There are no running transactions.
10-postgres-1   | 2024-05-01 16:34:16.478 GMT [40] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385" LOGICAL 0/1B61110 (proto_version '1', publication_names '"db_name_public_publication"')
10-postgres2-1  | 2024-05-01 16:34:16.479 GMT [39] LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has finished
10-postgres2-1  | 2024-05-01 16:34:16.782 GMT [38] ERROR:  logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL
10-postgres2-1  | 2024-05-01 16:34:16.783 GMT [1] LOG:  worker process: logical replication worker for subscription 16391 (PID 38) exited with exit code 1
10-postgres2-1  | 2024-05-01 16:34:21.788 GMT [40] LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
10-postgres-1   | 2024-05-01 16:34:21.791 GMT [41] LOG:  starting logical decoding for slot "db_name_public_subscription"
After this logs I had table size in DB postgres 50 Mbytes and table size in DB postgres2 45 Mbytes
7) When I created Primary key at DB postgres2 tables at DB postgres and DB postgres2 was the same size:
ALTER TABLE ONLY public.table
    ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);
Result: I had fully consistent data at DB postgres2 and normal replication process.

Then I repeated the steps for Postgresql 13.14.

1) I ran the dockerfile creation file with PostgreSQL 13:
docker-compose.yml:
version: '3.3'
services:
  postgres:
    image: postgres:13.14
    environment:
      POSTGRES_DB: "db"
      POSTGRES_USER: "user"
      POSTGRES_PASSWORD: "password"
      PGDATA: "/var/lib/postgresql/data/pgdata"
    command: -c config_file=/etc/postgresql.conf
    volumes:
      - ./data:/var/lib/postgresql/data
      - ./postgresql.conf:/etc/postgresql.conf
      - ./pg_hba.conf:/etc/pg_hba.conf
    ports:
      - "5432:5432"
  postgres2:
    image: postgres:13.14
    environment:
      POSTGRES_DB: "db"
      POSTGRES_USER: "user"
      POSTGRES_PASSWORD: "password"
      PGDATA: "/var/lib/postgresql/data/pgdata"
    command: -c config_file=/etc/postgresql.conf
    volumes:
      - ./data2:/var/lib/postgresql/data
      - ./postgresql.conf:/etc/postgresql.conf
      - ./pg_hba.conf:/etc/pg_hba.conf
    ports:
      - "5433:5432"
So I have two Postgresql servers.
postgresql.conf:
hba_file = '/etc/pg_hba.conf'
listen_addresses = '*'
ssl = off
wal_level = logical  
pg_hba.conf:
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    all             all             all                     md5

2) I created table at DB "postgres":
CREATE TABLE public.table (
   field_1 uuid NOT NULL,
   field_2 uuid NOT NULL,
   field_3 uuid NOT NULL,
   field_4 text NOT NULL,
   field_5 text NOT NULL,
   field_6 timestamp with time zone NOT NULL,
   field_7 timestamp with time zone NOT NULL,
   field_8 smallint NOT NULL,
   field_9 text NOT NULL,
   field_10 text NOT NULL,
   field_11 uuid NOT NULL,
   field_12 uuid,
   field_13 text,
   field_14 uuid,
   field_15 text NOT NULL,
   field_16 integer,
   field_17 text NOT NULL,
   field_18 uuid,
   field_19 character varying(20),
   field_20 character varying(29),
   field_21 text,
   field_22 bigint,
   field_23 character varying(20)
);

ALTER TABLE ONLY public.table
    ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);
3) Created publication at DB "postgres":
CREATE ROLE replication_role  WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";
4) I started two Python scripts that were generating requests to"postgres". The scripts insert.py and update.py were described earlier.
5) Create subsription at DB postgres2:
CREATE SUBSCRIPTION db_name_public_subscription CONNECTION 'host=postgres port=5432 password=password user=replication_role dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:
13-postgres-1   | 2024-05-01 16:26:15.348 GMT [40] STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
13-postgres2-1  | 2024-05-01 16:26:15.378 GMT [39] LOG:  logical replication apply worker for subscription "db_name_public_subscription" has started
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] LOG:  starting logical decoding for slot "db_name_public_subscription"
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] DETAIL:  Streaming transactions committing after 0/A0E10F0, reading WAL from 0/A0E1048.
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] LOG:  logical decoding found initial starting point at 0/A0E1048
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] DETAIL:  Waiting for transactions (approximately 1) older than 66255 to end.
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] LOG:  logical decoding found consistent point at 0/A0E10B8
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] DETAIL:  There are no running transactions.
13-postgres-1   | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1  | 2024-05-01 16:26:15.384 GMT [40] LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
13-postgres-1   | 2024-05-01 16:26:15.476 GMT [42] LOG:  logical decoding found consistent point at 0/A0E45D8
13-postgres-1   | 2024-05-01 16:26:15.476 GMT [42] DETAIL:  There are no running transactions.
13-postgres-1   | 2024-05-01 16:26:15.476 GMT [42] STATEMENT:  CREATE_REPLICATION_SLOT "db_name_public_subscription_16500_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] LOG:  starting logical decoding for slot "db_name_public_subscription_16500_sync_16385"
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] DETAIL:  Streaming transactions committing after 0/A0E4610, reading WAL from 0/A0E45D8.
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] LOG:  logical decoding found consistent point at 0/A0E45D8
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] DETAIL:  There are no running transactions.
13-postgres-1   | 2024-05-01 16:26:16.993 GMT [42] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1  | 2024-05-01 16:26:16.994 GMT [40] ERROR:  logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL
13-postgres-1   | 2024-05-01 16:26:16.994 GMT [42] LOG:  could not send data to client: Connection reset by peer
13-postgres-1   | 2024-05-01 16:26:16.994 GMT [42] CONTEXT:  slot "db_name_public_subscription_16500_sync_16385", output plugin "pgoutput", in the commit callback, associated LSN 0/A0F0F58
13-postgres-1   | 2024-05-01 16:26:16.994 GMT [42] STATEMENT:  START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385" LOGICAL 0/A0E4610 (proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1  | 2024-05-01 16:26:16.996 GMT [1] LOG:  background worker "logical replication worker" (PID 40) exited with exit code 1
13-postgres2-1  | 2024-05-01 16:26:20.393 GMT [41] LOG:  logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
13-postgres-1   | 2024-05-01 16:26:20.461 GMT [43] LOG:  logical decoding found consistent point at 0/A12E068
13-postgres-1   | 2024-05-01 16:26:20.461 GMT [43] DETAIL:  There are no running transactions.
After these logs I had table size in DB postgres 50 Mbytes, table size in DB postgres2 450 Mbytes and table size in DB postgres2 was increasing.
I was checking tables size by using query:
select
    table_schema,
    table_name,
    pg_relation_size('"'||table_schema||'"."'||table_name||'"'),
    pg_size_pretty(pg_relation_size('"'||table_schema||'"."'||table_name||'"'))
from information_schema.tables
where table_schema in ('public', 'saga', 'hangfire')
order by 3 DESC
Please help to resolve this problem this will allow to migrate big-size tables.
I am looking forward to your answer.
I also attached files with logs to this email.

пн, 29 апр. 2024 г. в 13:14, Shlok Kyal <shlok.kyal.oss@gmail.com>:

Hi

> 3. This log was got for a table size of 39 GB but we have tables much bigger than this.
>
> Our workflow for the tables size 39 GB
>
> 1) Download schema from the source database instance
>
> 2) Deleted PK, FK, and Indexes from the table schema(we did it to increase data load, startup process)
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for this table
> 5) Configured Logical replication between source and destination DB this table
> 6) During catchup on this table process we got the messages in log that you can see above.
>
> 7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.
>
> I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
>

I tried the steps you provided to reproduce the issue on Postgresql 13
but, I was not able to reproduce it. I tried it for tables with size
~120GB and also tried it separately for table with size ~40GB.

Steps I used:
1. Created a sourceDB and DestinationDB.
2.
Populated sourceDB with data.
"create table t1(c1 int PRIMARY KEY, c2 int);"
"insert into t1 values(generate_series(1,1500000000), 2)"
"update t1 set c1 = 0 where c1 = 1"
3.
Deleted Primary Key on sourceDB
"alter table t1 drop constraint t1_pkey"
4.
Created schema on destinationDB
"create table t1(c1 int, c2 int);"
5.
Configured Replica identity full on sourceDB
"ALTER TABLE t1 REPLICA IDENTITY FULL"
6.
Configured Logical Replication Setup
on sourceDB: "create publication pub1_new for all tables"
on destinationDB: ""create subscription test1_new connection
'dbname=postgres host=localhost port=5432' publication pub1_new;"

For me the initial catchup was successful and I did not receive any
issue in sourceDB logs or destinationDB logs.
Can you point out if you did anything different?

Also, For me the initial catchup succeeds even if I did not configure
replica identity full on sourceDB
Can you point out how your steps were different to get error related
to replica identity full?

Also, are you running any queries concurrently on sourceDB or destinationDB?

Thanks and Regards,
Shlok Kyal

Вложения

Re: BUG #18433: Logical replication timeout

От

Bruce Momjian

Дата:

01 мая, 21:21:06

On Wed, May  1, 2024 at 08:25:35PM +0300, Kostiantyn Tomakh wrote:
> Hello, Shlok Kyal.
> 
> I was able to reproduce the problem.
> I did it on docker based platform I hope you will be able to reproduce this
> problem too.
> 1) I ran the docker-compose file with PostgreSQL 10:
> docker-compose.yml:

Postgres 10 was end-of-life on November 10, 2022:

    https://www.postgresql.org/support/versioning/

I suggest you reproduce on a supported version.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: BUG #18433: Logical replication timeout

От

Kostiantyn Tomakh

Дата:

02 мая, 09:56:13

Hello, Bruce Momjian.

I know of course that Postgres 10 reached end of life on November 10, 2022.

I compared the behavior of Postgresql 10 and Postgresql 13 in the previous email.

You can see this comparison below in the previous email.

In the previous email, I described the approach that we used for transferring big tables between (2Tbytes). That approach worked for Postgresql 10. But we had the problem with Postgresql 13 when used the approach that was described in the previous email.

Just now I completed the test with source DB Postgresql 13 and destination DB Postgresql 15.
The test was successful.

So we have the problem when source DB Postgresql 13 and destination DB Postgresql 13.

The test and problem were described in the previous email.

Bruce Momjian, Could you help with this problem?

ср, 1 мая 2024 г. в 21:21, Bruce Momjian <bruce@momjian.us>:

On Wed, May 1, 2024 at 08:25:35PM +0300, Kostiantyn Tomakh wrote:
> Hello, Shlok Kyal.
>
> I was able to reproduce the problem.
> I did it on docker based platform I hope you will be able to reproduce this
> problem too.
> 1) I ran the docker-compose file with PostgreSQL 10:
> docker-compose.yml:

Postgres 10 was end-of-life on November 10, 2022:

https://www.postgresql.org/support/versioning/

I suggest you reproduce on a supported version.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

Re: BUG #18433: Logical replication timeout

От

Shlok Kyal

Дата:

10 мая, 09:05:09

Hi,

> I was able to reproduce the problem.
> I did it on docker based platform I hope you will be able to reproduce this problem too.

Thanks for providing the detailed steps to reproduce the issue. I was
able to reproduce the issue with the steps you provided.
I noticed that the issue regarding the increased table size on the
subscriber can happen in all versions till Postgres 13 and I was able
to reproduce that. This is a timing issue and hence you may not be
getting this issue in postgres 10.

This issue occurs because tablesync worker exits (due to UPDATE
command) and restarts again as seen in logs:
2024-05-01 16:26:15.384 GMT [40] LOG:  logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started
2024-05-01 16:26:16.994 GMT [40] ERROR:  logical replication target
relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY
KEY and published relation does not have REPLICA IDENTITY FULL
2024-05-01 16:26:20.393 GMT [41] LOG:  logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started

Tablesync worker sync the initial data from publisher to subscriber
using COPY command. But in this case it exits (after copy phase is
completed) and restarts, so it will perform entire copy operation
again. And hence we can see the increased table size on the
subscriber.

This issue is not reproducible in Postgres 14 and above versions. This
issue was mitigated after the commit [1]. In this commit a new state
'FINISHEDCOPY' is introduced. So if the tablesync worker exits (after
copy phase is completed) and restarts, it donot not perform COPY
command again and proceeds directly to synchronize the WAL position
between tablesync worker and apply worker.

code:
+   else if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+   {
+       /*
+        * The COPY phase was previously done, but tablesync then crashed
+        * before it was able to finish normally.
+        */
+       StartTransactionCommand();
+
+       /*
+        * The origin tracking name must already exist. It was created first
+        * time this tablesync was launched.
+        */
+       originid = replorigin_by_name(originname, false);
+       replorigin_session_setup(originid);
+       replorigin_session_origin = originid;
+       *origin_startpos = replorigin_session_get_progress(false);
+
+       CommitTransactionCommand();
+
+       goto copy_table_done;
+   }

Backpatching commit [1] to Postgres 13 and Postgres 12 will mitigate this issue.
Thoughts?

[1] https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e

Thanks and Regards,
Shlok Kyal

Re: BUG #18433: Logical replication timeout

От

Kostiantyn Tomakh

Дата:

10 мая, 16:47:36

Hello, Shlok Kyal.

I found the solution for myself. I decided to migrate from PostgreSQL 13 to Postgresql 15.

I used the following approach Source DB PostgreSQL 13 and destination Postgresql 15.

Fortunately, this problem exists if Destination DB is PostgreSQL 13.

There are two solutions to this issue:

1) Fix this problem.

2) Inform people that they can have problems if they use PostgreSQL 13 as Destination DB during Logical replication.

I think the best choose is the first option.

Shlok Kyal, Thank you very much for your help. We really appreciate it

пт, 10 мая 2024 г. в 09:05, Shlok Kyal <shlok.kyal.oss@gmail.com>:

Hi,

> I was able to reproduce the problem.
> I did it on docker based platform I hope you will be able to reproduce this problem too.

Thanks for providing the detailed steps to reproduce the issue. I was
able to reproduce the issue with the steps you provided.
I noticed that the issue regarding the increased table size on the
subscriber can happen in all versions till Postgres 13 and I was able
to reproduce that. This is a timing issue and hence you may not be
getting this issue in postgres 10.

This issue occurs because tablesync worker exits (due to UPDATE
command) and restarts again as seen in logs:
2024-05-01 16:26:15.384 GMT [40] LOG: logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started
2024-05-01 16:26:16.994 GMT [40] ERROR: logical replication target
relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY
KEY and published relation does not have REPLICA IDENTITY FULL
2024-05-01 16:26:20.393 GMT [41] LOG: logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started

Tablesync worker sync the initial data from publisher to subscriber
using COPY command. But in this case it exits (after copy phase is
completed) and restarts, so it will perform entire copy operation
again. And hence we can see the increased table size on the
subscriber.

This issue is not reproducible in Postgres 14 and above versions. This
issue was mitigated after the commit [1]. In this commit a new state
'FINISHEDCOPY' is introduced. So if the tablesync worker exits (after
copy phase is completed) and restarts, it donot not perform COPY
command again and proceeds directly to synchronize the WAL position
between tablesync worker and apply worker.

code:
+ else if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+ {
+ /*
+ * The COPY phase was previously done, but tablesync then crashed
+ * before it was able to finish normally.
+ */
+ StartTransactionCommand();
+
+ /*
+ * The origin tracking name must already exist. It was created first
+ * time this tablesync was launched.
+ */
+ originid = replorigin_by_name(originname, false);
+ replorigin_session_setup(originid);
+ replorigin_session_origin = originid;
+ *origin_startpos = replorigin_session_get_progress(false);
+
+ CommitTransactionCommand();
+
+ goto copy_table_done;
+ }

Backpatching commit [1] to Postgres 13 and Postgres 12 will mitigate this issue.
Thoughts?

[1] https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e

Thanks and Regards,
Shlok Kyal

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #18433: Logical replication timeout

Вложения