Re: Skipping logical replication transactions on subscriber side
От | vignesh C |
---|---|
Тема | Re: Skipping logical replication transactions on subscriber side |
Дата | |
Msg-id | CALDaNm1kDoB8TigAHUJ01s17AhU5DznBfNK2ywCedi95DvNFgA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Skipping logical replication transactions on subscriber side (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: Skipping logical replication transactions on subscriber side
(Masahiko Sawada <sawada.mshk@gmail.com>)
|
Список | pgsql-hackers |
On Tue, Nov 16, 2021 at 12:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 15, 2021 at 11:43 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, Nov 15, 2021 at 2:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Nov 15, 2021 at 4:49 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > > > > > On Mon, Nov 15, 2021 at 1:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > I've attached an updated patch that incorporates all comments I got so > > > > > far. Please review it. > > > > > > > > > > > > > Thanks for the updated patch. > > > > A few minor comments: > > > > > > > > doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml > > > > > > > > (1) tab in doc updates > > > > > > > > There's a tab before "Otherwise,": > > > > > > > > + copy of the relation with <parameter>relid</parameter>. > > > > Otherwise, > > > > > > Fixed. > > > > > > > > > > > src/backend/utils/adt/pgstatfuncs.c > > > > > > > > (2) The function comment for "pg_stat_reset_subscription_worker_sub" > > > > seems a bit long and I expected it to be multi-line (did you run > > > > pg_indent?) > > > > > > I ran pg_indent on pgstatfuncs.c but it didn't become a multi-line comment. > > > > > > > > > > > src/include/pgstat.h > > > > > > > > (3) Remove PgStat_StatSubWorkerEntry.dbid? > > > > > > > > The "dbid" member of the new PgStat_StatSubWorkerEntry struct doesn't > > > > seem to be used, so I think it should be removed. > > > > (I could remove it and everything builds OK and tests pass). > > > > > > > > > > Fixed. > > > > > > Thank you for the comments! I've updated an updated version patch. > > > > Thanks for the updated patch. > > I found one issue: > > This Assert can fail in few cases: > > +void > > +pgstat_report_subworker_error(Oid subid, Oid subrelid, Oid relid, > > + > > LogicalRepMsgType command, TransactionId xid, > > + const char *errmsg) > > +{ > > + PgStat_MsgSubWorkerError msg; > > + int len; > > + > > + Assert(strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN); > > + len = offsetof(PgStat_MsgSubWorkerError, m_message[0]) + > > strlen(errmsg) + 1; > > + > > > > I could reproduce the problem with the following scenario: > > Publisher: > > create table t1 (c1 varchar); > > create publication pub1 for table t1; > > insert into t1 values(repeat('abcd', 5000)); > > > > Subscriber: > > create table t1(c1 smallint); > > create subscription sub1 connection 'dbname=postgres port=5432' > > publication pub1 with ( two_phase = true); > > postgres=# select * from pg_stat_subscription_workers; > > WARNING: terminating connection because of crash of another server process > > DETAIL: The postmaster has commanded this server process to roll back > > the current transaction and exit, because another server process > > exited abnormally and possibly corrupted shared memory. > > HINT: In a moment you should be able to reconnect to the database and > > repeat your command. > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > > > Subscriber logs: > > 2021-11-15 19:27:56.380 IST [15685] LOG: logical replication apply > > worker for subscription "sub1" has started > > 2021-11-15 19:27:56.384 IST [15687] LOG: logical replication table > > synchronization worker for subscription "sub1", table "t1" has started > > TRAP: FailedAssertion("strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN", > > File: "pgstat.c", Line: 1946, PID: 15687) > > postgres: logical replication worker for subscription 16387 sync 16384 > > (ExceptionalCondition+0xd0)[0x55a18f3c727f] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (pgstat_report_subworker_error+0x7a)[0x55a18f126417] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (ApplyWorkerMain+0x493)[0x55a18f176611] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (StartBackgroundWorker+0x23c)[0x55a18f11f7e2] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (+0x54efc0)[0x55a18f134fc0] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (+0x54f3af)[0x55a18f1353af] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (+0x54e338)[0x55a18f134338] > > /lib/x86_64-linux-gnu/libpthread.so.0(+0x141f0)[0x7feef84371f0] > > /lib/x86_64-linux-gnu/libc.so.6(__select+0x57)[0x7feef81e3ac7] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (+0x5498c2)[0x55a18f12f8c2] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (PostmasterMain+0x134c)[0x55a18f12f1dd] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (+0x43c3d4)[0x55a18f0223d4] > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xd5)[0x7feef80fd565] > > postgres: logical replication worker for subscription 16387 sync 16384 > > (_start+0x2e)[0x55a18ecaf4fe] > > 2021-11-15 19:27:56.483 IST [15645] LOG: background worker "logical > > replication worker" (PID 15687) was terminated by signal 6: Aborted > > 2021-11-15 19:27:56.483 IST [15645] LOG: terminating any other active > > server processes > > 2021-11-15 19:27:56.485 IST [15645] LOG: all server processes > > terminated; reinitializing > > > > Here it fails because of a long error message ""invalid input syntax > > for type smallint: > > Good catch! > > > \"abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabc...." > > because we try to insert varchar type data into smallint type. Maybe > > we should trim the error message in this case. > > Right. I've fixed this issue and attached an updated patch. Thanks for the updated patch. The issue is fixed in the patch provided. I found that in one of the scenarios the statistics is getting lost: Test steps: Step 1: Setup Publisher(create 100 publications pub1...pub100 for t1...t100) like below: =============================================== create table t1(c1 int); create publication pub1 for table t1; insert into t1 values(10); insert into t1 values(10); create table t2(c1 int); create publication pub1 for table t2; insert into t2 values(10); insert into t2 values(10); .... Script can be generated using: while [ $a -lt 100 ] do a=`expr $a + 1` echo "./psql -d postgres -p 5432 -c \"create table t$a(c1 int);\"" >> publisher.sh echo "./psql -d postgres -p 5432 -c \"create publication pub$a for table t$a;\"" >> publisher.sh echo "./psql -d postgres -p 5432 -c \"insert into t$a values(10);\"" >> publisher.sh echo "./psql -d postgres -p 5432 -c \"insert into t$a values(10);\"" >> publisher.sh done Step 2: Setup Subscriber(create 100 subscriptions): =============================================== create table t1(c1 int primary key); create subscription sub1 connection 'dbname=postgres port=5432' publication pub1; create table t2(c1 int primary key); create subscription sub2 connection 'dbname=postgres port=5432' publication pub2; .... Script can be generated using: while [ $a -lt 100] do a=`expr $a + 1` echo "./psql -d postgres -p 5433 -c \"create table t$a(c1 int primary key);\"" >> subscriber.sh echo "./psql -d postgres -p 5433 -c \"create subscription sub$a connection 'dbname=postgres port=5432' publication pub$a;\"" >> subscriber.sh done Step 3: postgres=# select * from pg_stat_subscription_workers order by subid; subid | subname | subrelid | relid | command | xid | error_count | error_message | first_error_time | last_error_time -------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+---------------------------------- 16389 | sub1 | 16384 | 16384 | | | 17 | duplicate key value violates unique constraint "t1_pkey" | 2021-11-17 12:01:46.141086+05:30 | 2021-11-17 12:03:13.175698+05:30 16395 | sub2 | 16390 | 16390 | | | 16 | duplicate key value violates unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 | 2021-11-17 12:03:15.512249+05:30 16401 | sub3 | 16396 | 16396 | | | 16 | duplicate key value violates unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 | 2021-11-17 12:03:15.802225+05:30 16407 | sub4 | 16402 | 16402 | | | 16 | duplicate key value violates unique constraint "t4_pkey" | 2021-11-17 12:01:51.390638+05:30 | 2021-11-17 12:03:14.709496+05:30 16413 | sub5 | 16408 | 16408 | | | 16 | duplicate key value violates unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 | 2021-11-17 12:03:15.257235+05:30 Step 4: Then restart the publisher Step 5: postgres=# select * from pg_stat_subscription_workers order by subid; subid | subname | subrelid | relid | command | xid | error_count | error_message | first_error_time | last_error_time -------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----- -----------------------------+---------------------------------- 16389 | sub1 | 16384 | 16384 | | | 1 | could not create replication slot "pg_16389_sync_16384_7031422794938304519": FATAL: terminating connection due to administrator command+| 2021 -11-17 12:03:28.201247+05:30 | 2021-11-17 12:03:28.201247+05:30 | | | | | | | server closed the connection unexpectedly +| | | | | | | | | This probably means the server terminated abnormally +| | | | | | | | | before or while proce | | 16395 | sub2 | 16390 | 16390 | | | 18 | duplicate key value violates unique constraint "t2_pkey" | 2021 -11-17 12:01:51.337055+05:30 | 2021-11-17 12:03:23.832585+05:30 16401 | sub3 | 16396 | 16396 | | | 18 | duplicate key value violates unique constraint "t3_pkey" | 2021 -11-17 12:01:51.352157+05:30 | 2021-11-17 12:03:26.567873+05:30 16407 | sub4 | 16402 | 16402 | | | 1 | could not create replication slot "pg_16407_sync_16402_7031422794938304519": FATAL: terminating connection due to administrator command+| 2021 -11-17 12:03:28.196958+05:30 | 2021-11-17 12:03:28.196958+05:30 | | | | | | | server closed the connection unexpectedly +| | | | | | | | | This probably means the server terminated abnormally +| | | | | | | | | before or while proce | | 16413 | sub5 | 16408 | 16408 | | | 18 | duplicate key value violates unique constraint "t5_pkey" | 2021 -11-17 12:01:51.418825+05:30 | 2021-11-17 12:03:25.595697+05:30 Step 6: postgres=# select * from pg_stat_subscription_workers order by subid; subid | subname | subrelid | relid | command | xid | error_count | error_message | first_error_time | last_error_time -------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+---------------------------------- 16389 | sub1 | 16384 | 16384 | | | 1 | duplicate key value violates unique constraint "t1_pkey" | 2021-11-17 12:03:33.346514+05:30 | 2021-11-17 12:03:33.346514+05:30 16395 | sub2 | 16390 | 16390 | | | 19 | duplicate key value violates unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 | 2021-11-17 12:03:33.437505+05:30 16401 | sub3 | 16396 | 16396 | | | 19 | duplicate key value violates unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 | 2021-11-17 12:03:33.482954+05:30 16407 | sub4 | 16402 | 16402 | | | 1 | duplicate key value violates unique constraint "t4_pkey" | 2021-11-17 12:03:33.327489+05:30 | 2021-11-17 12:03:33.327489+05:30 16413 | sub5 | 16408 | 16408 | | | 19 | duplicate key value violates unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 | 2021-11-17 12:03:33.374522+05:30 We can see that sub1 and sub4 statistics are lost, old error_count value is lost. I'm not sure if this behavior is ok or not. Thoughts? Regards, Vignesh
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Pavel StehuleДата:
Сообщение: Re: Schema variables - new implementation for Postgres 15