Обсуждение: Stabilizing the test_decoding checks, take N

Поиск

Список

Период

Сортировка

Stabilizing the test_decoding checks, take N

От

Tom Lane

Дата:

16 апреля 2022 г., 20:11:59

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]:

diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out    2022-04-15
07:59:17.000000000-0400 
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out    2022-04-15 09:06:36.000000000
-0400
@@ -77,10 +77,12 @@
  streaming change for transaction
  streaming change for transaction
  streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
  streaming change for transaction
  closing a streamed block for transaction
  committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one.  That could only happen if the tests ran long enough
for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions.  But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day.  Since test_decoding already adds
some custom settings to postgresql.conf, this'll take just a one-line
addition to test_decoding/logical.conf.

I wonder whether we ought to then revert these tests' use of
skip-empty-xacts, or at least start having a mix of cases.
It seems to me that we'd rather know about it if there are unexpected
empty transactions.  Is there anything we're using that for other than
to hide the effects of autovacuum?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-04-15%2011%3A59%3A16

[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-12%2010%3A24%3A22

[3] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b779d7d8fdae088d70da5ed9fcd8205035676df3

Re: Stabilizing the test_decoding checks, take N

От

Andres Freund

Дата:

17 апреля 2022 г., 17:31:04

Hi,

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
> My pet dinosaur prairiedog just failed in the contrib/test_decoding
> tests [1]:
> 
> diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out    2022-04-15
07:59:17.000000000-0400
 
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out    2022-04-15
09:06:36.000000000-0400
 
> @@ -77,10 +77,12 @@
>   streaming change for transaction
>   streaming change for transaction
>   streaming change for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
>   streaming change for transaction
>   closing a streamed block for transaction
>   committing streamed transaction
> -(13 rows)
> +(15 rows)
> 
> Looking at the postmaster log, it's obvious where this extra transaction
> came from: auto-analyze ran on pg_type concurrently with the test step
> just before this one.  That could only happen if the tests ran long enough
> for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> (And I hasten to point out that some other animals, such as those running
> valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
> 
> We've seen this sort of problem before [2], and attempted to fix it [3]
> by making these tests ignore empty transactions.  But of course
> auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes.  I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.


> I think the most expedient way to prevent this type of failure is to run
> the test_decoding tests with autovacuum_naptime cranked up so far as to
> make it a non-issue, like maybe a day.  Since test_decoding already adds
> some custom settings to postgresql.conf, this'll take just a one-line
> addition to test_decoding/logical.conf.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

Greetings,

Andres Freund

Re: Stabilizing the test_decoding checks, take N

От

Andres Freund

Дата:

17 апреля 2022 г., 17:32:36

Hi,

Adding Amit, I think this is stuff he worked on...

On 2022-04-17 07:31:04 -0700, Andres Freund wrote:
> On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
> > My pet dinosaur prairiedog just failed in the contrib/test_decoding
> > tests [1]:
> > 
> > diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> > --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out    2022-04-15
07:59:17.000000000-0400
 
> > +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out    2022-04-15
09:06:36.000000000-0400
 
> > @@ -77,10 +77,12 @@
> >   streaming change for transaction
> >   streaming change for transaction
> >   streaming change for transaction
> > + closing a streamed block for transaction
> > + opening a streamed block for transaction
> >   streaming change for transaction
> >   closing a streamed block for transaction
> >   committing streamed transaction
> > -(13 rows)
> > +(15 rows)
> > 
> > Looking at the postmaster log, it's obvious where this extra transaction
> > came from: auto-analyze ran on pg_type concurrently with the test step
> > just before this one.  That could only happen if the tests ran long enough
> > for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> > (And I hasten to point out that some other animals, such as those running
> > valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
> > 
> > We've seen this sort of problem before [2], and attempted to fix it [3]
> > by making these tests ignore empty transactions.  But of course
> > auto-analyze's transaction wasn't empty, so that didn't help.
> 
> I don't quite understand this bit - the logic test_decoding uses to
> decide if a transaction is "empty" is just whether a tuple was
> output. And there shouldn't be any as part of auto-analyze, because we
> don't decode catalog changes.  I suspect there's something broken in the
> streaming logic (potentially just in test_decoding) around
> skip_empty_xacts.
> 
> 
> > I think the most expedient way to prevent this type of failure is to run
> > the test_decoding tests with autovacuum_naptime cranked up so far as to
> > make it a non-issue, like maybe a day.  Since test_decoding already adds
> > some custom settings to postgresql.conf, this'll take just a one-line
> > addition to test_decoding/logical.conf.
> 
> I'm a bit worried about this approach - we've IIRC had past bugs that
> came only to light because of autovacuum starting. I wonder if we rather
> should do the opposite and reduce naptime so it'll be seen on fast
> machines, rather than very slow ones.

Greetings,

Andres Freund

Re: Stabilizing the test_decoding checks, take N

От

Tom Lane

Дата:

17 апреля 2022 г., 19:01:53

Andres Freund <andres@anarazel.de> writes:
>> On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
>>> We've seen this sort of problem before [2], and attempted to fix it [3]
>>> by making these tests ignore empty transactions.  But of course
>>> auto-analyze's transaction wasn't empty, so that didn't help.

>> I don't quite understand this bit - the logic test_decoding uses to
>> decide if a transaction is "empty" is just whether a tuple was
>> output. And there shouldn't be any as part of auto-analyze, because we
>> don't decode catalog changes.  I suspect there's something broken in the
>> streaming logic (potentially just in test_decoding) around
>> skip_empty_xacts.

Hmm, I'll defer to somebody who knows that code better about whether
there's an actual bug.  However ...

>>> I think the most expedient way to prevent this type of failure is to run
>>> the test_decoding tests with autovacuum_naptime cranked up so far as to
>>> make it a non-issue, like maybe a day.

>> I'm a bit worried about this approach - we've IIRC had past bugs that
>> came only to light because of autovacuum starting. I wonder if we rather
>> should do the opposite and reduce naptime so it'll be seen on fast
>> machines, rather than very slow ones.

It seems likely to me that trying to make a test like this one blind to
autovacuum/autoanalyze activity will make it less useful, not more so.
Why is such blindness desirable?

            regards, tom lane

Re: Stabilizing the test_decoding checks, take N

От

Andres Freund

Дата:

18 апреля 2022 г., 00:52:03

Hi

On 2022-04-17 12:01:53 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >> On 2022-04-16 13:11:59 -0400, Tom Lane wrote:
> >>> We've seen this sort of problem before [2], and attempted to fix it [3]
> >>> by making these tests ignore empty transactions.  But of course
> >>> auto-analyze's transaction wasn't empty, so that didn't help.
> 
> >> I don't quite understand this bit - the logic test_decoding uses to
> >> decide if a transaction is "empty" is just whether a tuple was
> >> output. And there shouldn't be any as part of auto-analyze, because we
> >> don't decode catalog changes.  I suspect there's something broken in the
> >> streaming logic (potentially just in test_decoding) around
> >> skip_empty_xacts.
> 
> Hmm, I'll defer to somebody who knows that code better about whether
> there's an actual bug.  However ...
> 
> >>> I think the most expedient way to prevent this type of failure is to run
> >>> the test_decoding tests with autovacuum_naptime cranked up so far as to
> >>> make it a non-issue, like maybe a day.
> 
> >> I'm a bit worried about this approach - we've IIRC had past bugs that
> >> came only to light because of autovacuum starting. I wonder if we rather
> >> should do the opposite and reduce naptime so it'll be seen on fast
> >> machines, rather than very slow ones.
> 
> It seems likely to me that trying to make a test like this one blind to
> autovacuum/autoanalyze activity will make it less useful, not more so.
> Why is such blindness desirable?

Maybe I misunderstood - I thought you were proposing to prevent
autovacuum by increasing naptime? Won't that precisely blind us to
autovacuum/analyze?  Hiding empty xacts happens "very late", so all the
decoding etc still happens.

Greetings,

Andres Freund

Re: Stabilizing the test_decoding checks, take N

От

Tom Lane

Дата:

18 апреля 2022 г., 00:55:43

Andres Freund <andres@anarazel.de> writes:
> On 2022-04-17 12:01:53 -0400, Tom Lane wrote:
>> It seems likely to me that trying to make a test like this one blind to
>> autovacuum/autoanalyze activity will make it less useful, not more so.
>> Why is such blindness desirable?

> Maybe I misunderstood - I thought you were proposing to prevent
> autovacuum by increasing naptime? Won't that precisely blind us to
> autovacuum/analyze?  Hiding empty xacts happens "very late", so all the
> decoding etc still happens.

My concern is basically that if we hack the code so it does not report
autovacuum activity, that might result in it also not reporting other
things that are more interesting.  So I think an external method of
suppressing test noise due to autovac is more advisable.

            regards, tom lane

Re: Stabilizing the test_decoding checks, take N

От

Amit Kapila

Дата:

18 апреля 2022 г., 05:55:14

On Sun, Apr 17, 2022 at 8:02 PM Andres Freund <andres@anarazel.de> wrote:
>
> Adding Amit, I think this is stuff he worked on...
>

I'll look into this and share my findings.

-- 
With Regards,
Amit Kapila.

Re: Stabilizing the test_decoding checks, take N

От

Amit Kapila

Дата:

18 апреля 2022 г., 08:49:14

On Sat, Apr 16, 2022 at 10:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> My pet dinosaur prairiedog just failed in the contrib/test_decoding
> tests [1]:
>
> diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out 2022-04-15 07:59:17.000000000
-0400
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out  2022-04-15 09:06:36.000000000
-0400
> @@ -77,10 +77,12 @@
>   streaming change for transaction
>   streaming change for transaction
>   streaming change for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
>   streaming change for transaction
>   closing a streamed block for transaction
>   committing streamed transaction
> -(13 rows)
> +(15 rows)
>
> Looking at the postmaster log, it's obvious where this extra transaction
> came from: auto-analyze ran on pg_type concurrently with the test step
> just before this one.  That could only happen if the tests ran long enough
> for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> (And I hasten to point out that some other animals, such as those running
> valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
>
> We've seen this sort of problem before [2], and attempted to fix it [3]
> by making these tests ignore empty transactions.  But of course
> auto-analyze's transaction wasn't empty, so that didn't help.
>

The possible reason here is that this extra (auto-analyze) transaction
causes the logical decoding work mem to reach before the last change
of the test's transaction. As can be seen from the logs, it just
closed the stream before the last change and then opened a new stream
for the last change. Now, it is true that the auto-analyze changes
won't be decoded as they don't perform DML operation on any
non-catalog table but it could generate some invalidation message
which needs to be processed even though we won't send anything related
to it to the downstream.

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

-- 
With Regards,
Amit Kapila.

Re: Stabilizing the test_decoding checks, take N

От

Dilip Kumar

Дата:

18 апреля 2022 г., 12:59:37

On Mon, Apr 18, 2022 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Apr 16, 2022 at 10:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> My pet dinosaur prairiedog just failed in the contrib/test_decoding
> tests [1]:
>
> diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out 2022-04-15 07:59:17.000000000 -0400
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out 2022-04-15 09:06:36.000000000 -0400
> @@ -77,10 +77,12 @@
> streaming change for transaction
> streaming change for transaction
> streaming change for transaction
> + closing a streamed block for transaction
> + opening a streamed block for transaction
> streaming change for transaction
> closing a streamed block for transaction
> committing streamed transaction
> -(13 rows)
> +(15 rows)
>
> Looking at the postmaster log, it's obvious where this extra transaction
> came from: auto-analyze ran on pg_type concurrently with the test step
> just before this one. That could only happen if the tests ran long enough
> for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
> (And I hasten to point out that some other animals, such as those running
> valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)
>
> We've seen this sort of problem before [2], and attempted to fix it [3]
> by making these tests ignore empty transactions. But of course
> auto-analyze's transaction wasn't empty, so that didn't help.
>

The possible reason here is that this extra (auto-analyze) transaction
causes the logical decoding work mem to reach before the last change
of the test's transaction. As can be seen from the logs, it just
closed the stream before the last change and then opened a new stream
for the last change. Now, it is true that the auto-analyze changes
won't be decoded as they don't perform DML operation on any
non-catalog table but it could generate some invalidation message
which needs to be processed even though we won't send anything related
to it to the downstream.

This analysis seems right to me.

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that continuously performs DDL-only transactions.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Stabilizing the test_decoding checks, take N

От

Dilip Kumar

Дата:

19 апреля 2022 г., 09:08:14

On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

>
> This needs to be verified once by doing some manual testing as it may
> not be easily reproducible every time. If this happens to be true then
> I think your suggestion related to increasing autovacuum_naptime would
> work.
>
>
> I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that
continuouslyperforms DDL-only transactions.
 

I have reproduced it [1] by repeatedly running the attached
script(stream.sql) from one session and parallely running the vacuum
analysis from the another session.

I have also changed the config for testing decoding to set the
autovacuum_naptime to 1d (patch attached)

[1]
Result without vacuum analyze:
                 data
------------------------------------------
 opening a streamed block for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 closing a streamed block for transaction
 committing streamed transaction
(13 rows)

Result with parallely running VACUUM ANALYZE

                   data
------------------------------------------
 opening a streamed block for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 streaming change for transaction
 closing a streamed block for transaction
 opening a streamed block for transaction
 streaming change for transaction
 closing a streamed block for transaction
 committing streamed transaction
(15 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: Stabilizing the test_decoding checks, take N

От

Amit Kapila

Дата:

19 апреля 2022 г., 12:46:11

On Tue, Apr 19, 2022 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> >
> > This needs to be verified once by doing some manual testing as it may
> > not be easily reproducible every time. If this happens to be true then
> > I think your suggestion related to increasing autovacuum_naptime would
> > work.
> >
> >
> > I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that
continuouslyperforms DDL-only transactions.
 
>
> I have reproduced it [1] by repeatedly running the attached
> script(stream.sql) from one session and parallely running the vacuum
> analysis from the another session.
>
> I have also changed the config for testing decoding to set the
> autovacuum_naptime to 1d (patch attached)
>

Thanks, I am also able to see similar results. This shows the analysis
was right. I will push the autovacuum_naptime change in HEAD and 14
(as both contains this test) tomorrow unless someone thinks otherwise.


-- 
With Regards,
Amit Kapila.

Re: Stabilizing the test_decoding checks, take N

От

Amit Kapila

Дата:

20 апреля 2022 г., 09:00:02

On Tue, Apr 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > >
> > > This needs to be verified once by doing some manual testing as it may
> > > not be easily reproducible every time. If this happens to be true then
> > > I think your suggestion related to increasing autovacuum_naptime would
> > > work.
> > >
> > >
> > > I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that
continuouslyperforms DDL-only transactions.
 
> >
> > I have reproduced it [1] by repeatedly running the attached
> > script(stream.sql) from one session and parallely running the vacuum
> > analysis from the another session.
> >
> > I have also changed the config for testing decoding to set the
> > autovacuum_naptime to 1d (patch attached)
> >
>
> Thanks, I am also able to see similar results. This shows the analysis
> was right. I will push the autovacuum_naptime change in HEAD and 14
> (as both contains this test) tomorrow unless someone thinks otherwise.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Stabilizing the test_decoding checks, take N

Вложения