Re: subscriptioncheck failure
От | vignesh C |
---|---|
Тема | Re: subscriptioncheck failure |
Дата | |
Msg-id | CALDaNm02JV5bfOWWKLPJN3wCMvFRL-AdzyUEQ3CyQqDUfygneg@mail.gmail.com обсуждение исходный текст |
Ответ на | subscriptioncheck failure (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: subscriptioncheck failure
|
Список | pgsql-hackers |
On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I noticed $SUBJECT in curculio. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23 > > The failure test report: > t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 5 tests but ran 1. > Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36 > cusr 9.47 csys = 31.89 CPU) > Result: FAIL > Makefile:21: recipe for target 'check' failed > gmake: *** [check] Error 1 > > As far as I can understand the failure happens due to error: > 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR: > replication slot "tap_sub" is active for PID 7354 > 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT: > SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape') > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > OFFSET 1 LIMIT 1 > > As per my initial analysis, this happens because in the test after > creating a subscription, we are not waiting for the subscriber to > catch up. I think there is a narrow window where it is possible that > after creating a subscription, the apply worker starts replication > even after we have done Alter Subscription .. DISABLE. This is also > visible in buildfarm logs: > > 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT: > SELECT get_byte(data, 0) > FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL, > 'proto_version', '1', > 'publication_names', 'tap_pub', > 'messages', 'true') > ... > > 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 > (proto_version '2', publication_names '"tap_pub"') > > Notice that we have checked the messages before receiving the start > replication command from the subscriber. Now, if we would have waited > for the subscription to catch up then our check in pg_stat_replication > wouldn't have given a false positive. I have noticed that we do have > such wait in other subscription tests but somehow missed it here. I agree with your analysis, apart from this issue there is one other small problem. Currently after disabling subscription pg_stat_replication is used to verify if the walsender is exited. The steps of the walsender exit includes cleaning up of walsender and then releasing a replication slot. There will be random test failure if pg_logical_slot_peek_binary_changes is called in this narrow time window. I think it is better to fix it by checking the active column in pg_replication_slot instead of pg_stat_replication which is more reliable. Attached patch which has the fixes for the same. Thoughts? Regards, Vignesh
Вложения
В списке pgsql-hackers по дате отправления: