Обсуждение: BF animal dikkop reported a failure in 035_standby_logical_decoding

Поиск
Список
Период
Сортировка

BF animal dikkop reported a failure in 035_standby_logical_decoding

От
"Yu Shi (Fujitsu)"
Дата:
Hi hackers,

I saw a buildfarm failure on "dikkop"[1]. It failed in
035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot and
row_removal_activeslot are not invalidated after vacuum.

regress_log_035_standby_logical_decoding:
```
[12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class
[12:15:05.945](0.003s) 
[12:15:05.946](0.000s) #   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
#   at t/035_standby_logical_decoding.pl line 238.
[12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class
[12:15:05.949](0.001s) 
[12:15:05.950](0.000s) #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
#   at t/035_standby_logical_decoding.pl line 244.
[13:38:26.977](5001.028s) # poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated
[13:38:26.982](0.002s) 
[13:38:26.982](0.000s) #   Failed test 'confl_active_logicalslot updated'
#   at t/035_standby_logical_decoding.pl line 251.
Timed out waiting confl_active_logicalslot to be updated at t/035_standby_logical_decoding.pl line 251.
```

035_standby_logical_decoding.pl:
```
# This should trigger the conflict
$node_primary->safe_psql(
    'testdb', qq[
  CREATE TABLE conflict_test(x integer, y text);
  DROP TABLE conflict_test;
  VACUUM pg_class;
  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
]);

$node_primary->wait_for_replay_catchup($node_standby);

# Check invalidation in the logfile and in pg_stat_database_conflicts
check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
```

Is it possible that the vacuum command didn't remove tuples and then the
conflict was not triggered? It seems we can't confirm this because there is not
enough information. Maybe "vacuum verbose" can be used to provide more
information.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2023-05-24%2006%3A16%3A18

Regards,
Shi Yu

Re: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote:
> Hi hackers,
> 
> I saw a buildfarm failure on "dikkop"[1]. It failed in
> 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot and
> row_removal_activeslot are not invalidated after vacuum.

Thanks for sharing!

> 
> regress_log_035_standby_logical_decoding:
> ```
> [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class
> [12:15:05.945](0.003s)
> [12:15:05.946](0.000s) #   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
> #   at t/035_standby_logical_decoding.pl line 238.
> [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class
> [12:15:05.949](0.001s)
> [12:15:05.950](0.000s) #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> #   at t/035_standby_logical_decoding.pl line 244.
> [13:38:26.977](5001.028s) # poll_query_until timed out executing this query:
> # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated
> [13:38:26.982](0.002s)
> [13:38:26.982](0.000s) #   Failed test 'confl_active_logicalslot updated'
> #   at t/035_standby_logical_decoding.pl line 251.
> Timed out waiting confl_active_logicalslot to be updated at t/035_standby_logical_decoding.pl line 251.
> ```
> 
> 035_standby_logical_decoding.pl:
> ```
> # This should trigger the conflict
> $node_primary->safe_psql(
>     'testdb', qq[
>    CREATE TABLE conflict_test(x integer, y text);
>    DROP TABLE conflict_test;
>    VACUUM pg_class;
>    INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
> ]);
> 
> $node_primary->wait_for_replay_catchup($node_standby);
> 
> # Check invalidation in the logfile and in pg_stat_database_conflicts
> check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
> ```
> 
> Is it possible that the vacuum command didn't remove tuples and then the
> conflict was not triggered? 

The flush_wal table added by Andres should guarantee that the WAL is flushed, so
the only reason I can think about is indeed that the vacuum did not remove tuples (
but I don't get why/how that could be the case).

> It seems we can't confirm this because there is not
> enough information. 

Right, and looking at its status history most of the time the test is green (making it
even more difficult to diagnose).

> Maybe "vacuum verbose" can be used to provide more
> information.

I can see that dikkop "belongs" to Tomas (adding Tomas to this thread).
Tomas, do you think it would be possible to run some 035_standby_logical_decoding.pl
manually with "vacuum verbose" in the test mentioned above? (or any other way you can think
about that would help diagnose this random failure?).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
"Yu Shi (Fujitsu)"
Дата:
On Monday, May 29, 2023 5:22 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:
> 
> Hi,
> 
> On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote:
> > Hi hackers,
> >
> > I saw a buildfarm failure on "dikkop"[1]. It failed in
> > 035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot
> and
> > row_removal_activeslot are not invalidated after vacuum.
> 
> Thanks for sharing!
> 
> >
> > regress_log_035_standby_logical_decoding:
> > ```
> > [12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged with
> vacuum on pg_class
> > [12:15:05.945](0.003s)
> > [12:15:05.946](0.000s) #   Failed test 'inactiveslot slot invalidation is logged
> with vacuum on pg_class'
> > #   at t/035_standby_logical_decoding.pl line 238.
> > [12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with
> vacuum on pg_class
> > [12:15:05.949](0.001s)
> > [12:15:05.950](0.000s) #   Failed test 'activeslot slot invalidation is logged with
> vacuum on pg_class'
> > #   at t/035_standby_logical_decoding.pl line 244.
> > [13:38:26.977](5001.028s) # poll_query_until timed out executing this query:
> > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where
> datname = 'testdb'
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > [13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated
> > [13:38:26.982](0.002s)
> > [13:38:26.982](0.000s) #   Failed test 'confl_active_logicalslot updated'
> > #   at t/035_standby_logical_decoding.pl line 251.
> > Timed out waiting confl_active_logicalslot to be updated at
> t/035_standby_logical_decoding.pl line 251.
> > ```
> >
> > 035_standby_logical_decoding.pl:
> > ```
> > # This should trigger the conflict
> > $node_primary->safe_psql(
> >     'testdb', qq[
> >    CREATE TABLE conflict_test(x integer, y text);
> >    DROP TABLE conflict_test;
> >    VACUUM pg_class;
> >    INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
> > ]);
> >
> > $node_primary->wait_for_replay_catchup($node_standby);
> >
> > # Check invalidation in the logfile and in pg_stat_database_conflicts
> > check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
> > ```
> >
> > Is it possible that the vacuum command didn't remove tuples and then the
> > conflict was not triggered?
> 
> The flush_wal table added by Andres should guarantee that the WAL is flushed,
> so
> the only reason I can think about is indeed that the vacuum did not remove
> tuples (
> but I don't get why/how that could be the case).
> 
> > It seems we can't confirm this because there is not
> > enough information.
> 
> Right, and looking at its status history most of the time the test is green (making
> it
> even more difficult to diagnose).
> 
> > Maybe "vacuum verbose" can be used to provide more
> > information.
> 
> I can see that dikkop "belongs" to Tomas (adding Tomas to this thread).
> Tomas, do you think it would be possible to run some
> 035_standby_logical_decoding.pl
> manually with "vacuum verbose" in the test mentioned above? (or any other
> way you can think
> about that would help diagnose this random failure?).
> 

Thanks for your reply.

I saw another failure on "drongo" [1], which looks like a similar problem. 

Maybe a temporary patch can be committed to dump the result of "vacuum verbose".
And we can check this when the test fails.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2023-05-26%2018%3A05%3A57

Regards,
Shi Yu

Re: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
Tom Lane
Дата:
"Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> writes:
> On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote:
>> Is it possible that the vacuum command didn't remove tuples and then the
>> conflict was not triggered?

> The flush_wal table added by Andres should guarantee that the WAL is flushed, so
> the only reason I can think about is indeed that the vacuum did not remove tuples (
> but I don't get why/how that could be the case).

This test is broken on its face:

  CREATE TABLE conflict_test(x integer, y text);
  DROP TABLE conflict_test;
  VACUUM full pg_class;

There will be something VACUUM can remove only if there were no other
transactions holding back global xmin --- and there's not even a delay
here to give any such transaction a chance to finish.

Background autovacuum is the most likely suspect for breaking that,
but I wouldn't be surprised if something in the logical replication
mechanism itself could be running a transaction at the wrong instant.

Some of the other recovery tests set
autovacuum = off
to try to control such problems, but I'm not sure how much of
a solution that really is.

            regards, tom lane



Re: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/29/23 1:03 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> writes:
>> On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote:
>>> Is it possible that the vacuum command didn't remove tuples and then the
>>> conflict was not triggered?
> 
>> The flush_wal table added by Andres should guarantee that the WAL is flushed, so
>> the only reason I can think about is indeed that the vacuum did not remove tuples (
>> but I don't get why/how that could be the case).
> 
> This test is broken on its face:
> 
>    CREATE TABLE conflict_test(x integer, y text);
>    DROP TABLE conflict_test;
>    VACUUM full pg_class;
> 
> There will be something VACUUM can remove only if there were no other
> transactions holding back global xmin --- and there's not even a delay
> here to give any such transaction a chance to finish.
> 
> Background autovacuum is the most likely suspect for breaking that,

Oh right, I did not think autovacuum could start during this test, but yeah there
is no reasons it could not.

> but I wouldn't be surprised if something in the logical replication
> mechanism itself could be running a transaction at the wrong instant.
> 
> Some of the other recovery tests set
> autovacuum = off
> to try to control such problems, but I'm not sure how much of
> a solution that really is.

One option I can think of is to:

1) set autovacuum = off (as it looks like the usual suspect).
2) trigger the vacuum in verbose mode (as suggested by Shi-san) and
depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output
and so on.. (n times max). If n is reached, then skip this test.

As this test is currently failing randomly (and it looks like there is more success that failures, even without
autovacuum = off), then the test should still validate that the invalidation works as expected for the large
majority of runs (and skipping the test should be pretty rare then).

Would that make sense?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
Andres Freund
Дата:
Hi,

On 2023-05-29 14:31:24 +0200, Drouvot, Bertrand wrote:
> On 5/29/23 1:03 PM, Tom Lane wrote:
> > but I wouldn't be surprised if something in the logical replication
> > mechanism itself could be running a transaction at the wrong instant.
> > 
> > Some of the other recovery tests set
> > autovacuum = off
> > to try to control such problems, but I'm not sure how much of
> > a solution that really is.
> 
> One option I can think of is to:
> 
> 1) set autovacuum = off (as it looks like the usual suspect).
> 2) trigger the vacuum in verbose mode (as suggested by Shi-san) and
> depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output
> and so on.. (n times max). If n is reached, then skip this test.

I think the best fix would be to wait for a new snapshot that has a newer
horizon, before doing the vacuum full.

Greetings,

Andres Freund



Re: BF animal dikkop reported a failure in 035_standby_logical_decoding

От
"Drouvot, Bertrand"
Дата:
Hi,

On 5/30/23 5:24 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-05-29 14:31:24 +0200, Drouvot, Bertrand wrote:
>> On 5/29/23 1:03 PM, Tom Lane wrote:
>>> but I wouldn't be surprised if something in the logical replication
>>> mechanism itself could be running a transaction at the wrong instant.
>>>
>>> Some of the other recovery tests set
>>> autovacuum = off
>>> to try to control such problems, but I'm not sure how much of
>>> a solution that really is.
>>
>> One option I can think of is to:
>>
>> 1) set autovacuum = off (as it looks like the usual suspect).
>> 2) trigger the vacuum in verbose mode (as suggested by Shi-san) and
>> depending of its output run the "invalidation" test or: re-launch the vacuum, re-check the output
>> and so on.. (n times max). If n is reached, then skip this test.
> 
> I think the best fix would be to wait for a new snapshot that has a newer
> horizon, before doing the vacuum full.
> 

Thanks for the proposal! I think that's a great idea, I'll look at it
and update the patch I've submitted in [1] accordingly.


[1]: https://www.postgresql.org/message-id/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com