Обсуждение: Introduce wait_for_subscription_sync for TAP tests
Hi, In tap tests for logical replication, we have the following code in many places: $node_publisher->wait_for_catchup('tap_sub'); my $synced_query = "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"; $node_subscriber->poll_query_until('postgres', $synced_query) or die "Timed out while waiting for subscriber to synchronize data"; Also, we sometime forgot to check either one, like we fixed in commit 1f50918a6fb02207d151e7cb4aae4c36de9d827c. I think we can have a new function to wait for all subscriptions to synchronize data. The attached patch introduce a new function wait_for_subscription_sync(). With this function, we can replace the above code with this one function as follows: $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
On Tue, Jul 26, 2022 at 7:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > In tap tests for logical replication, we have the following code in many places: > > $node_publisher->wait_for_catchup('tap_sub'); > my $synced_query = > "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT > IN ('r', 's');"; > $node_subscriber->poll_query_until('postgres', $synced_query) > or die "Timed out while waiting for subscriber to synchronize data"; > > Also, we sometime forgot to check either one, like we fixed in commit > 1f50918a6fb02207d151e7cb4aae4c36de9d827c. > > I think we can have a new function to wait for all subscriptions to > synchronize data. The attached patch introduce a new function > wait_for_subscription_sync(). With this function, we can replace the > above code with this one function as follows: > > $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); > +1. This reduces quite some code in various tests and will make it easier to write future tests. Few comments/questions: ==================== 1. -$node_publisher->wait_for_catchup('mysub1'); - -# Also wait for initial table sync to finish. -my $synced_query = - "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"; -$node_subscriber->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; - # Also wait for initial table sync to finish. -$node_subscriber->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; +$node_subscriber->wait_for_subscription_sync($node_publisher, 'mysub1'); It seems to me without your patch there is an extra poll in the above test. If so, we can probably remove that in a separate patch? 2. + # wait for the replication to catchup if required. + if (defined($publisher)) + { + croak 'subscription name must be specified' unless defined($subname); + $publisher->wait_for_catchup($subname, 'replay'); + } + + # then, wait for all table states to be ready. + print "Waiting for all subscriptions in \"$name\" to synchronize data\n"; + my $query = qq[SELECT count(1) = 0 + FROM pg_subscription_rel + WHERE srsubstate NOT IN ('r', 's');]; + $self->poll_query_until($dbname, $query) + or croak "timed out waiting for subscriber to synchronize data"; In the tests, I noticed that a few places did wait_for_catchup after the subscription check, and at other places, we did that check before as you have it here. Ideally, I think wait_for_catchup should be after confirming the initial sync is over as without initial sync, the publisher node won't be completely in sync with the subscriber. What do you think? 3. In the code quoted in the previous point, why did you pass the second parameter as 'replay' when we have not used that in the tests otherwise? -- With Regards, Amit Kapila.
On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 26, 2022 at 7:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > In tap tests for logical replication, we have the following code in many places: > > > > $node_publisher->wait_for_catchup('tap_sub'); > > my $synced_query = > > "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT > > IN ('r', 's');"; > > $node_subscriber->poll_query_until('postgres', $synced_query) > > or die "Timed out while waiting for subscriber to synchronize data"; > > > > Also, we sometime forgot to check either one, like we fixed in commit > > 1f50918a6fb02207d151e7cb4aae4c36de9d827c. > > > > I think we can have a new function to wait for all subscriptions to > > synchronize data. The attached patch introduce a new function > > wait_for_subscription_sync(). With this function, we can replace the > > above code with this one function as follows: > > > > $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); > > > > +1. This reduces quite some code in various tests and will make it > easier to write future tests. > > Few comments/questions: > ==================== > 1. > -$node_publisher->wait_for_catchup('mysub1'); > - > -# Also wait for initial table sync to finish. > -my $synced_query = > - "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT > IN ('r', 's');"; > -$node_subscriber->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > - > # Also wait for initial table sync to finish. > -$node_subscriber->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > +$node_subscriber->wait_for_subscription_sync($node_publisher, 'mysub1'); > > It seems to me without your patch there is an extra poll in the above > test. If so, we can probably remove that in a separate patch? Agreed. > > 2. > + # wait for the replication to catchup if required. > + if (defined($publisher)) > + { > + croak 'subscription name must be specified' unless defined($subname); > + $publisher->wait_for_catchup($subname, 'replay'); > + } > + > + # then, wait for all table states to be ready. > + print "Waiting for all subscriptions in \"$name\" to synchronize data\n"; > + my $query = qq[SELECT count(1) = 0 > + FROM pg_subscription_rel > + WHERE srsubstate NOT IN ('r', 's');]; > + $self->poll_query_until($dbname, $query) > + or croak "timed out waiting for subscriber to synchronize data"; > > In the tests, I noticed that a few places did wait_for_catchup after > the subscription check, and at other places, we did that check before > as you have it here. Ideally, I think wait_for_catchup should be after > confirming the initial sync is over as without initial sync, the > publisher node won't be completely in sync with the subscriber. What do you mean by the last sentence? I thought the order doesn't matter here. Even if we do wait_for_catchup first then the subscription check, we can make sure that the apply worker caught up and table synchronization has been done, no? > > 3. In the code quoted in the previous point, why did you pass the > second parameter as 'replay' when we have not used that in the tests > otherwise? It makes sure to use the (current) default value of $mode of wait_for_catchup(). But probably it's not necessary, I've removed it. I've attached an updated patch as well as a patch to remove duplicated waits in 007_ddl.pl. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached an updated patch as well as a patch to remove duplicated > waits in 007_ddl.pl. > Thanks for your patch. Here are some comments. 1. I think some comments need to be changed in the patch. For example: # Also wait for initial table sync to finish # Wait for initial sync to finish as well Words like "Also" and "as well" can be removed now, we originally used them because we wait for catchup and "also" wait for initial sync. 2. In the following places, we can remove wait_for_catchup() and then call it in wait_for_subscription_sync(). 2.1. 030_origin.pl: @@ -128,8 +120,7 @@ $node_B->safe_psql( $node_C->wait_for_catchup($appname_B2); -$node_B->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; +$node_B->wait_for_subscription_sync; 2.2. 031_column_list.pl: @@ -385,7 +373,7 @@ $node_subscriber->safe_psql( ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 )); -wait_for_subscription_sync($node_subscriber); +$node_subscriber->wait_for_subscription_sync; $node_publisher->wait_for_catchup('sub1'); 2.3. 100_bugs.pl: @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres', $node_publisher->wait_for_catchup('tap_sub'); # Also wait for initial table sync to finish -$node_subscriber->poll_query_until('postgres', $synced_query) - or die "Timed out while waiting for subscriber to synchronize data"; +$node_subscriber->wait_for_subscription_sync; is( $node_subscriber->safe_psql( 'postgres', "SELECT * FROM tab_replidentity_index"), Regards, Shi yu
On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 2. > > + # wait for the replication to catchup if required. > > + if (defined($publisher)) > > + { > > + croak 'subscription name must be specified' unless defined($subname); > > + $publisher->wait_for_catchup($subname, 'replay'); > > + } > > + > > + # then, wait for all table states to be ready. > > + print "Waiting for all subscriptions in \"$name\" to synchronize data\n"; > > + my $query = qq[SELECT count(1) = 0 > > + FROM pg_subscription_rel > > + WHERE srsubstate NOT IN ('r', 's');]; > > + $self->poll_query_until($dbname, $query) > > + or croak "timed out waiting for subscriber to synchronize data"; > > > > In the tests, I noticed that a few places did wait_for_catchup after > > the subscription check, and at other places, we did that check before > > as you have it here. Ideally, I think wait_for_catchup should be after > > confirming the initial sync is over as without initial sync, the > > publisher node won't be completely in sync with the subscriber. > > What do you mean by the last sentence? I thought the order doesn't > matter here. Even if we do wait_for_catchup first then the > subscription check, we can make sure that the apply worker caught up > and table synchronization has been done, no? > That's right. I thought we should first ensure the subscriber has finished operations if possible, like in this case, it can ensure table sync has finished and then we can ensure whether publisher and subscriber are in sync. That sounds more logical to me. -- With Regards, Amit Kapila.
On Wed, Jul 27, 2022 at 7:08 PM shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote: > > On Tue, Jul 26, 2022 3:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached an updated patch as well as a patch to remove duplicated > > waits in 007_ddl.pl. > > > > Thanks for your patch. Here are some comments. Thank you for the comments! > > 1. > I think some comments need to be changed in the patch. > For example: > # Also wait for initial table sync to finish > # Wait for initial sync to finish as well > > Words like "Also" and "as well" can be removed now, we originally used them > because we wait for catchup and "also" wait for initial sync. Agreed. > > 2. > In the following places, we can remove wait_for_catchup() and then call it in > wait_for_subscription_sync(). > > 2.1. > 030_origin.pl: > @@ -128,8 +120,7 @@ $node_B->safe_psql( > > $node_C->wait_for_catchup($appname_B2); > > -$node_B->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > +$node_B->wait_for_subscription_sync; > > 2.2. > 031_column_list.pl: > @@ -385,7 +373,7 @@ $node_subscriber->safe_psql( > ALTER SUBSCRIPTION sub1 SET PUBLICATION pub2, pub3 > )); > > -wait_for_subscription_sync($node_subscriber); > +$node_subscriber->wait_for_subscription_sync; > > $node_publisher->wait_for_catchup('sub1'); > > 2.3. > 100_bugs.pl: > @@ -281,8 +276,7 @@ $node_subscriber->safe_psql('postgres', > $node_publisher->wait_for_catchup('tap_sub'); > > # Also wait for initial table sync to finish > -$node_subscriber->poll_query_until('postgres', $synced_query) > - or die "Timed out while waiting for subscriber to synchronize data"; > +$node_subscriber->wait_for_subscription_sync; > > is( $node_subscriber->safe_psql( > 'postgres', "SELECT * FROM tab_replidentity_index"), Agreed. I've attached updated patches that incorporated the above comments as well as the comment from Amit. BTW regarding 0001 patch to remove the duplicated wait, should we backpatch to v15? I think we can do that as it's an obvious fix and it seems to be an oversight in 8f2e2bbf145. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
On Wed, Jul 27, 2022 at 8:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 26, 2022 at 1:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 2. > > > + # wait for the replication to catchup if required. > > > + if (defined($publisher)) > > > + { > > > + croak 'subscription name must be specified' unless defined($subname); > > > + $publisher->wait_for_catchup($subname, 'replay'); > > > + } > > > + > > > + # then, wait for all table states to be ready. > > > + print "Waiting for all subscriptions in \"$name\" to synchronize data\n"; > > > + my $query = qq[SELECT count(1) = 0 > > > + FROM pg_subscription_rel > > > + WHERE srsubstate NOT IN ('r', 's');]; > > > + $self->poll_query_until($dbname, $query) > > > + or croak "timed out waiting for subscriber to synchronize data"; > > > > > > In the tests, I noticed that a few places did wait_for_catchup after > > > the subscription check, and at other places, we did that check before > > > as you have it here. Ideally, I think wait_for_catchup should be after > > > confirming the initial sync is over as without initial sync, the > > > publisher node won't be completely in sync with the subscriber. > > > > What do you mean by the last sentence? I thought the order doesn't > > matter here. Even if we do wait_for_catchup first then the > > subscription check, we can make sure that the apply worker caught up > > and table synchronization has been done, no? > > > > That's right. I thought we should first ensure the subscriber has > finished operations if possible, like in this case, it can ensure > table sync has finished and then we can ensure whether publisher and > subscriber are in sync. That sounds more logical to me. Make sense. I've incorporated it in the v3 patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Jul 28, 2022 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 27, 2022 at 7:08 PM shiy.fnst@fujitsu.com > <shiy.fnst@fujitsu.com> wrote: > > I've attached updated patches that incorporated the above comments as > well as the comment from Amit. > > BTW regarding 0001 patch to remove the duplicated wait, should we > backpatch to v15? > I think it is good to clean this test case even for PG15 even though there is no major harm in keeping it. I'll push this early next week by Tuesday unless someone thinks otherwise. -- With Regards, Amit Kapila.
On Sat, Jul 30, 2022 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jul 28, 2022 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jul 27, 2022 at 7:08 PM shiy.fnst@fujitsu.com > > <shiy.fnst@fujitsu.com> wrote: > > > > I've attached updated patches that incorporated the above comments as > > well as the comment from Amit. > > > > BTW regarding 0001 patch to remove the duplicated wait, should we > > backpatch to v15? > > > > I think it is good to clean this test case even for PG15 even though > there is no major harm in keeping it. I'll push this early next week > by Tuesday unless someone thinks otherwise. > Pushed this one and now I'll look at your other patch. -- With Regards, Amit Kapila.
On Wed, Aug 3, 2022 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jul 30, 2022 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jul 28, 2022 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Jul 27, 2022 at 7:08 PM shiy.fnst@fujitsu.com > > > <shiy.fnst@fujitsu.com> wrote: > > > > > > I've attached updated patches that incorporated the above comments as > > > well as the comment from Amit. > > > > > > BTW regarding 0001 patch to remove the duplicated wait, should we > > > backpatch to v15? > > > > > > > I think it is good to clean this test case even for PG15 even though > > there is no major harm in keeping it. I'll push this early next week > > by Tuesday unless someone thinks otherwise. > > > > Pushed this one and now I'll look at your other patch. Thanks! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Pushed this one and now I'll look at your other patch. > I have pushed the second patch as well after making minor changes in the comments. Alvaro [1] and Tom [2] suggest to back-patch this and they sound reasonable to me. Will you be able to produce back branch patches? [1] - https://www.postgresql.org/message-id/20220803104544.k2luy5hr2ugnhgr2%40alvherre.pgsql [2] - https://www.postgresql.org/message-id/2966703.1659535343%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Pushed this one and now I'll look at your other patch. > > > > I have pushed the second patch as well after making minor changes in > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > they sound reasonable to me. Will you be able to produce back branch > patches? Yes. I've attached patches for backbranches. The updates are straightforward on v11 - v15. However, on v10, we don't use wait_for_catchup() in some logical replication test cases. The commit bbd3363e128dae refactored the tests to use wait_for_catchup but it's not backpatched. So in the patch for v10, I didn't change the code that was changed by the commit. Also, since wait_for_catchup requires to specify $target_lsn, unlike the one in v11 or later, I changed wait_for_subscription_sync() accordingly. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
- REL13_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL15_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL14_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL11_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL12_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL10_v4-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > Pushed this one and now I'll look at your other patch. > > > > > > > I have pushed the second patch as well after making minor changes in > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > > they sound reasonable to me. Will you be able to produce back branch > > patches? > > Yes. I've attached patches for backbranches. The updates are > straightforward on v11 - v15. However, on v10, we don't use > wait_for_catchup() in some logical replication test cases. The commit > bbd3363e128dae refactored the tests to use wait_for_catchup but it's > not backpatched. So in the patch for v10, I didn't change the code > that was changed by the commit. Also, since wait_for_catchup requires > to specify $target_lsn, unlike the one in v11 or later, I changed > wait_for_subscription_sync() accordingly. > Thanks for your patches. In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the current change in PostgresNode.pm. Right? Regards, Shi yu
Masahiko Sawada <sawada.mshk@gmail.com> writes: > Yes. I've attached patches for backbranches. FWIW, I'd recommend waiting till after next week's wrap before pushing these. While I'm definitely in favor of doing this, the odds of introducing a bug are nonzero, so right before a release deadline doesn't seem like a good time. regards, tom lane
On Thu, Aug 4, 2022 5:49 PM shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote: > > On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila > <amit.kapila16@gmail.com> > > wrote: > > > > > > > > Pushed this one and now I'll look at your other patch. > > > > > > > > > > I have pushed the second patch as well after making minor changes in > > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > > > they sound reasonable to me. Will you be able to produce back branch > > > patches? > > > > Yes. I've attached patches for backbranches. The updates are > > straightforward on v11 - v15. However, on v10, we don't use > > wait_for_catchup() in some logical replication test cases. The commit > > bbd3363e128dae refactored the tests to use wait_for_catchup but it's > > not backpatched. So in the patch for v10, I didn't change the code > > that was changed by the commit. Also, since wait_for_catchup requires > > to specify $target_lsn, unlike the one in v11 or later, I changed > > wait_for_subscription_sync() accordingly. > > > > Thanks for your patches. > > In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the > current change in PostgresNode.pm. Right? > By the way, I notice that in 002_types.pl (on master branch), it seems the "as well" in the following comment should be removed. Is it worth being fixed? $node_subscriber->safe_psql('postgres', "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)" ); # Wait for initial sync to finish as well $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); Regards, Shi yu
On Thu, Aug 4, 2022 at 7:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > Yes. I've attached patches for backbranches. > > FWIW, I'd recommend waiting till after next week's wrap before > pushing these. While I'm definitely in favor of doing this, > the odds of introducing a bug are nonzero, so right before a > release deadline doesn't seem like a good time. > Agreed. I was planning to do it only after next week's wrap. Thanks for your suggestion. -- With Regards, Amit Kapila.
On Fri, Aug 5, 2022 at 10:39 AM shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote: > > On Thu, Aug 4, 2022 5:49 PM shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote: > > > > On Thu, Aug 4, 2022 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > > > > > On Thu, Aug 4, 2022 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > On Wed, Aug 3, 2022 at 10:21 AM Amit Kapila > > <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > Pushed this one and now I'll look at your other patch. > > > > > > > > > > > > > I have pushed the second patch as well after making minor changes in > > > > the comments. Alvaro [1] and Tom [2] suggest to back-patch this and > > > > they sound reasonable to me. Will you be able to produce back branch > > > > patches? > > > > > > Yes. I've attached patches for backbranches. The updates are > > > straightforward on v11 - v15. However, on v10, we don't use > > > wait_for_catchup() in some logical replication test cases. The commit > > > bbd3363e128dae refactored the tests to use wait_for_catchup but it's > > > not backpatched. So in the patch for v10, I didn't change the code > > > that was changed by the commit. Also, since wait_for_catchup requires > > > to specify $target_lsn, unlike the one in v11 or later, I changed > > > wait_for_subscription_sync() accordingly. > > > > > > > Thanks for your patches. > > > > In the patches for pg11 ~ pg14, it looks we need to add a "=pod" before the > > current change in PostgresNode.pm. Right? > > > > By the way, I notice that in 002_types.pl (on master branch), it seems the "as > well" in the following comment should be removed. Is it worth being fixed? > > $node_subscriber->safe_psql('postgres', > "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub WITH (slot_name = tap_sub_slot)" > ); > > # Wait for initial sync to finish as well > $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub'); > Thank you for the comments. I've attached updated version patches. Please review them. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Вложения
- REL14_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL12_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL11_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL13_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL15_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
- REL10_v5-0001-Add-wait_for_subscription_sync-for-TAP-tests.patch
On Wed, Aug 10, 2022 at 10:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Aug 5, 2022 at 10:39 AM shiy.fnst@fujitsu.com > <shiy.fnst@fujitsu.com> wrote: > > Thank you for the comments. I've attached updated version patches. > Please review them. > Pushed. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > Pushed. Recently a number of buildfarm animals have failed at the same place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: # Failed test '2x3000 rows in t' # at t/100_bugs.pl line 149. # got: '9000' # expected: '6000' # Looks like you failed 1 test of 7. [09:30:56] t/100_bugs.pl ...................... This was the last commit to touch that test script. I'm thinking maybe it wasn't adjusted quite correctly? On the other hand, since I can't find any similar failures before the last 48 hours, maybe there is some other more-recent commit to blame. Anyway, something is wrong there. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2022-09-09%2012%3A03%3A46 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2022-09-09%2011%3A16%3A36 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-09-09%2010%3A33%3A19 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-09-08%2010%3A56%3A59
On Fri, Sep 9, 2022 at 11:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > Pushed. > > Recently a number of buildfarm animals have failed at the same > place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > > # Failed test '2x3000 rows in t' > # at t/100_bugs.pl line 149. > # got: '9000' > # expected: '6000' > # Looks like you failed 1 test of 7. > [09:30:56] t/100_bugs.pl ...................... > > This was the last commit to touch that test script. I'm thinking > maybe it wasn't adjusted quite correctly? On the other hand, since > I can't find any similar failures before the last 48 hours, maybe > there is some other more-recent commit to blame. Anyway, something > is wrong there. It seems that this commit is innocent as it changed only how to wait. Rather, looking at the logs, the tablesync worker errored out at an interesting point: 022-09-09 09:30:19.630 EDT [631b3feb.840:13] pg_16400_sync_16392_7141371862484106124 ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 2022-09-09 09:30:19.630 EDT [631b3feb.840:14] pg_16400_sync_16392_7141371862484106124 STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0 (proto_version '3', origin 'any', publication_names '"testpub"') ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 2022-09-09 09:30:19.631 EDT [631b3feb.26e8:2] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D4FFF8 It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef is relevant. Regards, -- Masahiko Sawada
Masahiko Sawada <sawada.mshk@gmail.com> writes: > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Recently a number of buildfarm animals have failed at the same >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: >> >> # Failed test '2x3000 rows in t' >> # at t/100_bugs.pl line 149. >> # got: '9000' >> # expected: '6000' >> # Looks like you failed 1 test of 7. >> [09:30:56] t/100_bugs.pl ...................... >> >> This was the last commit to touch that test script. I'm thinking >> maybe it wasn't adjusted quite correctly? On the other hand, since >> I can't find any similar failures before the last 48 hours, maybe >> there is some other more-recent commit to blame. Anyway, something >> is wrong there. > It seems that this commit is innocent as it changed only how to wait. Yeah. I was wondering if it caused us to fail to wait somewhere, but I concur that's not all that likely. > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > is relevant. Noting that the errors have only appeared in the past couple of days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f (Fix recovery_prefetch with low maintenance_io_concurrency). regards, tom lane
On Sat, Sep 10, 2022 at 6:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Recently a number of buildfarm animals have failed at the same > >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > >> > >> # Failed test '2x3000 rows in t' > >> # at t/100_bugs.pl line 149. > >> # got: '9000' > >> # expected: '6000' > >> # Looks like you failed 1 test of 7. > >> [09:30:56] t/100_bugs.pl ...................... > >> > >> This was the last commit to touch that test script. I'm thinking > >> maybe it wasn't adjusted quite correctly? On the other hand, since > >> I can't find any similar failures before the last 48 hours, maybe > >> there is some other more-recent commit to blame. Anyway, something > >> is wrong there. > > > It seems that this commit is innocent as it changed only how to wait. > > Yeah. I was wondering if it caused us to fail to wait somewhere, > but I concur that's not all that likely. > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > is relevant. > > Noting that the errors have only appeared in the past couple of > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > (Fix recovery_prefetch with low maintenance_io_concurrency). Probably I found the cause of this failure[1]. The commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef didn't fix the problem properly. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj%3DsLUOn4Gd2GjUAKG-fw%40mail.gmail.com -- Masahiko Sawada
On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Recently a number of buildfarm animals have failed at the same > >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]: > >> > >> # Failed test '2x3000 rows in t' > >> # at t/100_bugs.pl line 149. > >> # got: '9000' > >> # expected: '6000' > >> # Looks like you failed 1 test of 7. > >> [09:30:56] t/100_bugs.pl ...................... > >> > >> This was the last commit to touch that test script. I'm thinking > >> maybe it wasn't adjusted quite correctly? On the other hand, since > >> I can't find any similar failures before the last 48 hours, maybe > >> there is some other more-recent commit to blame. Anyway, something > >> is wrong there. > > > It seems that this commit is innocent as it changed only how to wait. > > Yeah. I was wondering if it caused us to fail to wait somewhere, > but I concur that's not all that likely. > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > is relevant. > > Noting that the errors have only appeared in the past couple of > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > (Fix recovery_prefetch with low maintenance_io_concurrency). Yeah, I also just spotted the coincidence of those failures while monitoring the build farm. I'll look into this later today. My initial suspicion is that there was pre-existing code here that was (incorrectly?) relying on the lack of error reporting in that case. But maybe I misunderstood and it was incorrect to report the error for some reason that was not robustly covered with tests.
On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > > is relevant. > > > > Noting that the errors have only appeared in the past couple of > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > > (Fix recovery_prefetch with low maintenance_io_concurrency). > > Yeah, I also just spotted the coincidence of those failures while > monitoring the build farm. I'll look into this later today. My > initial suspicion is that there was pre-existing code here that was > (incorrectly?) relying on the lack of error reporting in that case. > But maybe I misunderstood and it was incorrect to report the error for > some reason that was not robustly covered with tests. After I wrote that I saw Sawada-san's message and waited for more information, and I see there was now a commit. I noticed that peripatus was already logging the 'missing contrecord' error even when it didn't fail the test, and still does. I'm still looking into that (ie whether I need to take that new report_invalid_record() call out and replace it with errormsg_deferred = true so that XLogReadRecord() returns NULL with no error message in this case).
On Mon, Sep 12, 2022 at 10:42 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Masahiko Sawada <sawada.mshk@gmail.com> writes: > > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > > > is relevant. > > > > > > Noting that the errors have only appeared in the past couple of > > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > > > (Fix recovery_prefetch with low maintenance_io_concurrency). > > > > Yeah, I also just spotted the coincidence of those failures while > > monitoring the build farm. I'll look into this later today. My > > initial suspicion is that there was pre-existing code here that was > > (incorrectly?) relying on the lack of error reporting in that case. > > But maybe I misunderstood and it was incorrect to report the error for > > some reason that was not robustly covered with tests. > > After I wrote that I saw Sawada-san's message and waited for more > information, and I see there was now a commit. I noticed that > peripatus was already logging the 'missing contrecord' error even when > it didn't fail the test, and still does. I'm still looking into that > (ie whether I need to take that new report_invalid_record() call out > and replace it with errormsg_deferred = true so that XLogReadRecord() > returns NULL with no error message in this case). I will go ahead and remove this new error message added by adb46615. The message was correlated with the problem on peripatus fixed by 88f48831, but not the cause of it -- but it's also not terribly helpful and might be confusing. It might be reported: (1) in pg_waldump when you hit the end of the segment with a missing contrecord after the end, arguably rightfully, but then perhaps someone might complain that they expect an error from pg_waldump only on the final live segment at the end of the WAL, and (2) in a walsender that is asked to shut down while between reads of pages with a spanning contrecord, reported by logical_read_xlog_page() with a messageless error (presumably peripatus's case?), (3) in crash recovery with wal_recycle off (whereas normally you'd expect to see the page_addr message from a recycled file), maybe more legitimately than the above. The problem I needed to solve can be solved without the message just by setting that flag as mentioned above, so I'll do that to remove the new noise.