Обсуждение: subscriptioncheck failure

Поиск
Список
Период
Сортировка

subscriptioncheck failure

От
Amit Kapila
Дата:
I noticed $SUBJECT in curculio.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23

The failure test report:
t/020_messages.pl                (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 5 tests but ran 1.
Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
cusr  9.47 csys = 31.89 CPU)
Result: FAIL
Makefile:21: recipe for target 'check' failed
gmake: *** [check] Error 1

As far as I can understand the failure happens due to error:
2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
replication slot "tap_sub" is active for PID 7354
2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
OFFSET 1 LIMIT 1

As per my initial analysis, this happens because in the test after
creating a subscription, we are not waiting for the subscriber to
catch up. I think there is a narrow window where it is possible that
after creating a subscription, the apply worker starts replication
even after we have done Alter Subscription .. DISABLE. This is also
visible in buildfarm logs:

2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
SELECT get_byte(data, 0)
FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
'proto_version', '1',
'publication_names', 'tap_pub',
'messages', 'true')
...

2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', publication_names '"tap_pub"')

Notice that we have checked the messages before receiving the start
replication command from the subscriber. Now, if we would have waited
for the subscription to catch up then our check in pg_stat_replication
wouldn't have given a false positive. I have noticed that we do have
such wait in other subscription tests but somehow missed it here.

Thoughts?

-- 
With Regards,
Amit Kapila.



Re: subscriptioncheck failure

От
vignesh C
Дата:
On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I noticed $SUBJECT in curculio.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23
>
> The failure test report:
> t/020_messages.pl                (Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 5 tests but ran 1.
> Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
> cusr  9.47 csys = 31.89 CPU)
> Result: FAIL
> Makefile:21: recipe for target 'check' failed
> gmake: *** [check] Error 1
>
> As far as I can understand the failure happens due to error:
> 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
> replication slot "tap_sub" is active for PID 7354
> 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> OFFSET 1 LIMIT 1
>
> As per my initial analysis, this happens because in the test after
> creating a subscription, we are not waiting for the subscriber to
> catch up. I think there is a narrow window where it is possible that
> after creating a subscription, the apply worker starts replication
> even after we have done Alter Subscription .. DISABLE. This is also
> visible in buildfarm logs:
>
> 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 0)
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> ...
>
> 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
> (proto_version '2', publication_names '"tap_pub"')
>
> Notice that we have checked the messages before receiving the start
> replication command from the subscriber. Now, if we would have waited
> for the subscription to catch up then our check in pg_stat_replication
> wouldn't have given a false positive. I have noticed that we do have
> such wait in other subscription tests but somehow missed it here.

I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh

Вложения

Re: subscriptioncheck failure

От
Michael Paquier
Дата:
On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> +$node_publisher->wait_for_catchup('tap_sub');
> +
>  # Ensure a transactional logical decoding message shows up on the slot
>  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
>
>  # wait for the replication connection to drop from the publisher
>  $node_publisher->poll_query_until('postgres',
> -    'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> +    "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);

There are three places in this test where a slot is disabled, followed
by a wait to make sure that the slot is gone.  Perhaps it would be
better to wrap that in a small-ish routine?
--
Michael

Вложения

Re: subscriptioncheck failure

От
vignesh C
Дата:
On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > +$node_publisher->wait_for_catchup('tap_sub');
> > +
> >  # Ensure a transactional logical decoding message shows up on the slot
> >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> >
> >  # wait for the replication connection to drop from the publisher
> >  $node_publisher->poll_query_until('postgres',
> > -     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > +     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
>
> There are three places in this test where a slot is disabled, followed
> by a wait to make sure that the slot is gone.  Perhaps it would be
> better to wrap that in a small-ish routine?

Yes that would be better, I will make the changes for this and post a patch.

Regards,
Vignesh



Re: subscriptioncheck failure

От
vignesh C
Дата:
On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > +$node_publisher->wait_for_catchup('tap_sub');
> > +
> >  # Ensure a transactional logical decoding message shows up on the slot
> >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> >
> >  # wait for the replication connection to drop from the publisher
> >  $node_publisher->poll_query_until('postgres',
> > -     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > +     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
>
> There are three places in this test where a slot is disabled, followed
> by a wait to make sure that the slot is gone.  Perhaps it would be
> better to wrap that in a small-ish routine?

Thanks for the comments, Please find the attached patch having the changes.

Regards,
Vignesh

Вложения

Re: subscriptioncheck failure

От
Michael Paquier
Дата:
On Thu, May 13, 2021 at 07:05:55PM +0530, vignesh C wrote:
> Thanks for the comments, Please find the attached patch having the changes.

Cool, thanks for the new version.  I have spent some time
understanding the initial report from Amit as well as what you are
proposing here, and refactoring the test so as the set of CREATE/ALTER
SUBSCRIPTION commands are added within this routine is a good idea.

I would have made the comment on top of setup_subscription a bit more
talkative regarding the fact that it may reuse an existing
subscription, but that's a nit.  Let's wait for Amit and see what he
thinks about what you are proposing.
--
Michael

Вложения

Re: subscriptioncheck failure

От
Amit Kapila
Дата:
On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, May 13, 2021 at 04:14:55PM +0530, vignesh C wrote:
> > > +$node_publisher->wait_for_catchup('tap_sub');
> > > +
> > >  # Ensure a transactional logical decoding message shows up on the slot
> > >  $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> > >
> > >  # wait for the replication connection to drop from the publisher
> > >  $node_publisher->poll_query_until('postgres',
> > > -     'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
> > > +     "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
> >
> > There are three places in this test where a slot is disabled, followed
> > by a wait to make sure that the slot is gone.  Perhaps it would be
> > better to wrap that in a small-ish routine?
>
> Thanks for the comments, Please find the attached patch having the changes.
>

Few comments:
1.
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");

After you have encapsulated this command in the function, the above
comment doesn't make sense because we do this for both transactional
and non-transactional messages. I suggest we can change it to
something like: "This is done to ensure a logical decoding message is
shown up on the slot".

2.
+# Setup the subscription before checking pg_logical_slot_peek_binary_changes
+sub setup_subscription

I think here the functionality is more for the catchup of
subscription, so it might be better to name the function as
subscription_catchup or catchup_subscription. I think you can expand
the comments atop this function a bit as suggested by Michael.

-- 
With Regards,
Amit Kapila.



Re: subscriptioncheck failure

От
Amit Kapila
Дата:
On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> Few comments:
> 1.
> + # Ensure a transactional logical decoding message shows up on the slot
> + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
>
> After you have encapsulated this command in the function, the above
> comment doesn't make sense because we do this for both transactional
> and non-transactional messages. I suggest we can change it to
> something like: "This is done to ensure a logical decoding message is
> shown up on the slot".
>
> 2.
> +# Setup the subscription before checking pg_logical_slot_peek_binary_changes
> +sub setup_subscription
>
> I think here the functionality is more for the catchup of
> subscription, so it might be better to name the function as
> subscription_catchup or catchup_subscription. I think you can expand
> the comments atop this function a bit as suggested by Michael.
>

One more point:
+ $node_publisher->wait_for_catchup('tap_sub');
+
+ # Ensure a transactional logical decoding message shows up on the slot
+ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+ # wait for the replication connection to drop from the publisher
+ $node_publisher->poll_query_until('postgres',
+ "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
slot_name = 'tap_sub' AND active='f'", 1);

In the above sequence, wait_for_catchup will query pg_stat_replication
whereas after disabling subscription we are checking
pg_replication_slots. I understand from your explanation why we can't
rely on pg_stat_replication after DISABLE but it might be better to
check that the slot is active before disabling it. I think currently,
the test assumes that, isn't it better to have an explicit check for
that?

-- 
With Regards,
Amit Kapila.



Re: subscriptioncheck failure

От
vignesh C
Дата:
On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, May 13, 2021 at 7:06 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > Few comments:
> > 1.
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> >
> > After you have encapsulated this command in the function, the above
> > comment doesn't make sense because we do this for both transactional
> > and non-transactional messages. I suggest we can change it to
> > something like: "This is done to ensure a logical decoding message is
> > shown up on the slot".
> >
> > 2.
> > +# Setup the subscription before checking pg_logical_slot_peek_binary_changes
> > +sub setup_subscription
> >
> > I think here the functionality is more for the catchup of
> > subscription, so it might be better to name the function as
> > subscription_catchup or catchup_subscription. I think you can expand
> > the comments atop this function a bit as suggested by Michael.
> >
>
> One more point:
> + $node_publisher->wait_for_catchup('tap_sub');
> +
> + # Ensure a transactional logical decoding message shows up on the slot
> + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> +
> + # wait for the replication connection to drop from the publisher
> + $node_publisher->poll_query_until('postgres',
> + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> slot_name = 'tap_sub' AND active='f'", 1);
>
> In the above sequence, wait_for_catchup will query pg_stat_replication
> whereas after disabling subscription we are checking
> pg_replication_slots. I understand from your explanation why we can't
> rely on pg_stat_replication after DISABLE but it might be better to
> check that the slot is active before disabling it. I think currently,
> the test assumes that, isn't it better to have an explicit check for
> that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed.  If we are going ahead with this approach the above
comments provided are no more valid.

I have made the changes in similar lines in the attached patch. If you
are ok we can go ahead with the new approach which will simplify the
changes required.
Thoughts?

Regards,
Vignesh

Вложения

Re: subscriptioncheck failure

От
Amit Kapila
Дата:
On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > One more point:
> > + $node_publisher->wait_for_catchup('tap_sub');
> > +
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> > +
> > + # wait for the replication connection to drop from the publisher
> > + $node_publisher->poll_query_until('postgres',
> > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> > slot_name = 'tap_sub' AND active='f'", 1);
> >
> > In the above sequence, wait_for_catchup will query pg_stat_replication
> > whereas after disabling subscription we are checking
> > pg_replication_slots. I understand from your explanation why we can't
> > rely on pg_stat_replication after DISABLE but it might be better to
> > check that the slot is active before disabling it. I think currently,
> > the test assumes that, isn't it better to have an explicit check for
> > that?
>
> I felt this is not required, wait_for_catchup will poll_query_until
> the state = 'streaming', even if START_REPLICATION takes time, state
> will be in 'startup' state, this way poll_query_until will take care
> of handling this.
>

makes sense, but let's add some comments to clarify the same.

> On further analysis I found that we need to do "Alter subscription
> tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
> time, Instead we can change pg_logical_slot_peek_binary_changes to
> pg_logical_slot_get_binary_changes at appropriate steps. This way the
> common function can be removed and the enable/disable multiple times
> can be removed.
>

I think that is a valid point. This was probably kept so that we can
peek multiple times for the same message to test various things but
that can be achieved with the way you have changed the test.

One more thing, shouldn't we make auto_vacuum=off for this test by
using 'append_conf' before starting the publisher. That will avoid the
risk of empty transactions.

-- 
With Regards,
Amit Kapila.



Re: subscriptioncheck failure

От
vignesh C
Дата:
On Tue, May 18, 2021 at 9:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 5:48 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, May 17, 2021 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > One more point:
> > > + $node_publisher->wait_for_catchup('tap_sub');
> > > +
> > > + # Ensure a transactional logical decoding message shows up on the slot
> > > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
> > > +
> > > + # wait for the replication connection to drop from the publisher
> > > + $node_publisher->poll_query_until('postgres',
> > > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> > > slot_name = 'tap_sub' AND active='f'", 1);
> > >
> > > In the above sequence, wait_for_catchup will query pg_stat_replication
> > > whereas after disabling subscription we are checking
> > > pg_replication_slots. I understand from your explanation why we can't
> > > rely on pg_stat_replication after DISABLE but it might be better to
> > > check that the slot is active before disabling it. I think currently,
> > > the test assumes that, isn't it better to have an explicit check for
> > > that?
> >
> > I felt this is not required, wait_for_catchup will poll_query_until
> > the state = 'streaming', even if START_REPLICATION takes time, state
> > will be in 'startup' state, this way poll_query_until will take care
> > of handling this.
> >
>
> makes sense, but let's add some comments to clarify the same.
>

Modified.

> > On further analysis I found that we need to do "Alter subscription
> > tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
> > time, Instead we can change pg_logical_slot_peek_binary_changes to
> > pg_logical_slot_get_binary_changes at appropriate steps. This way the
> > common function can be removed and the enable/disable multiple times
> > can be removed.
> >
>
> I think that is a valid point. This was probably kept so that we can
> peek multiple times for the same message to test various things but
> that can be achieved with the way you have changed the test.
>
> One more thing, shouldn't we make auto_vacuum=off for this test by
> using 'append_conf' before starting the publisher. That will avoid the
> risk of empty transactions.

I felt that makes sense, added it.

Thanks for the comments, the attached patch has the changes for the same.

Regards,
Vignesh

Вложения

Re: subscriptioncheck failure

От
Amit Kapila
Дата:
On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for the comments, the attached patch has the changes for the same.
>

Thanks, I have pushed your patch after making minor changes in the comments.

-- 
With Regards,
Amit Kapila.



Re: subscriptioncheck failure

От
vignesh C
Дата:
On Wed, May 19, 2021 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 11:25 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Thanks for the comments, the attached patch has the changes for the same.
> >
>
> Thanks, I have pushed your patch after making minor changes in the comments.

Thanks for pushing this patch.

Regards,
Vignesh