Обсуждение: Re: Improve error reporting in 027_stream_regress test

Поиск
Список
Период
Сортировка

Re: Improve error reporting in 027_stream_regress test

От
Andres Freund
Дата:
Hi,

On 2025-06-30 16:01:04 +0300, Nazir Bilal Yavuz wrote:
> This patch aims to improve 027_stream_regress test's regression test
> error reporting per Andres' suggestion [1].

Thanks for working on that!


One thing I don't yet like is that I think we should report if the primary is
alive *before* reporting the diff and skipping reporting it if the primary
crashed. It's not interesting to report a long diff if the server crashed IMO.

Greetings,

Andres Freund



Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Tue, Jul 01, 2025 at 10:57:11AM +0300, Nazir Bilal Yavuz wrote:
> On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:
>> One thing I don't yet like is that I think we should report if the primary is
>> alive *before* reporting the diff and skipping reporting it if the primary
>> crashed. It's not interesting to report a long diff if the server crashed IMO.

Right.  That's just more noise one needs to dig into to find out
what's wrong.

> I agree with you. So, the current logic is:
>
> If primary is not alive: Do not report anything.

Okay.

> If only primary is alive: Report the entire diff file.

Hmm.  Is that actually useful as we know that the standby has been
stalen down when running the test?  Even if we report something, we
could always trim the output, like the standby case.

> If both primary and standby are alive: Report entire diff file and add
> head+tail of diff to the failure message.

+=item $node->is_alive()
+
+Check if the node is alive.
+
+=cut
+
+sub is_alive {
+    my ($self) = @_;
+
+    my $host = $self->host;
+    my $port = $self->port;
+    my $null = File::Spec->devnull;
+
+    my $cmd = "pg_isready -h $host -p $port";
+        return !system("$cmd >$null 2>&1");
+}

It's strange to re-add a dependency to system() while we have wrappers
around it, like system_log() in Utils.pm.

You may want to add a "local %ENV = $self->_get_env();"  to make sure
that we use a command related to the build of the node initialized.

+sub regression_log_helper
+{
+    my ($diff_file, $lines_to_dump) = @_;
+    my @lines;
+
+    open my $fh, '<', $diff_file or die "couldn't open file: $diff_file\n";
+
+    # Read all lines to process them below
+    while (my $line = <$fh>)
+    {
+        push @lines, $line;
+    }
+    close $fh;

We could use that as a routine of its own in Utils.pm?  It's not
really something only for diff files, more something that we can use
to trim the contents of a file, which could be the tail of a server
log file as well..  It's always difficult to measure what a "good"
number of lines is, but perhaps we could use an environment variable
to make that configurable rather than enforce a policy of 50 lines
because we think it's good enough?

The fact that we expect the primary and the standby to be alive once
the tests are run and the addition of two tests related to them seems
unrelated to the diff report improvements, so I'd suggest to do both
things separately.
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

Thanks for looking into this!

On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 01, 2025 at 10:57:11AM +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:
> >> One thing I don't yet like is that I think we should report if the primary is
> >> alive *before* reporting the diff and skipping reporting it if the primary
> >> crashed. It's not interesting to report a long diff if the server crashed IMO.
>
> Right.  That's just more noise one needs to dig into to find out
> what's wrong.
>
> > I agree with you. So, the current logic is:
> >
> > If primary is not alive: Do not report anything.
>
> Okay.
>
> > If only primary is alive: Report the entire diff file.
>
> Hmm.  Is that actually useful as we know that the standby has been
> stalen down when running the test?  Even if we report something, we
> could always trim the output, like the standby case.

I guess you are right. Also, I tried killing standby while the
027_stream_regress test was running and the test did not fail; instead
it continued until timeout. If that is always the case, then it makes
sense to report if and only if both primary and standby are alive.

> > If both primary and standby are alive: Report entire diff file and add
> > head+tail of diff to the failure message.
>
> +=item $node->is_alive()
> +
> +Check if the node is alive.
> +
> +=cut
> +
> +sub is_alive {
> +    my ($self) = @_;
> +
> +    my $host = $self->host;
> +    my $port = $self->port;
> +    my $null = File::Spec->devnull;
> +
> +    my $cmd = "pg_isready -h $host -p $port";
> +        return !system("$cmd >$null 2>&1");
> +}
>
> It's strange to re-add a dependency to system() while we have wrappers
> around it, like system_log() in Utils.pm.

system_log() logs which command is run. I actually do not want to log
the command because it needs to be run after the regression tests and
before the 'regression tests pass'. If I log the command it looks like
this:

# All 228 tests passed.
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
(21.939s) ok 2 - regression tests pass
(0.000s) ok 3 - primary is alive after the regression tests
(0.000s) ok 4 - standby is alive after the regression tests

I think it looks like these pg_isready commands run randomly.

> You may want to add a "local %ENV = $self->_get_env();"  to make sure
> that we use a command related to the build of the node initialized.

You are right, I will apply this.

> +sub regression_log_helper
> +{
> +       my ($diff_file, $lines_to_dump) = @_;
> +       my @lines;
> +
> +       open my $fh, '<', $diff_file or die "couldn't open file: $diff_file\n";
> +
> +       # Read all lines to process them below
> +       while (my $line = <$fh>)
> +       {
> +               push @lines, $line;
> +       }
> +       close $fh;
>
> We could use that as a routine of its own in Utils.pm?  It's not
> really something only for diff files, more something that we can use
> to trim the contents of a file, which could be the tail of a server
> log file as well..  It's always difficult to measure what a "good"
> number of lines is, but perhaps we could use an environment variable
> to make that configurable rather than enforce a policy of 50 lines
> because we think it's good enough?

I think this is a good idea. How about a function called file_trim()
with three arguments: the file name, a mode ('head' or 'tail'), and
the number of lines to trim from that end. This approach may require
reading the file more than once, but it is easy to use.

> The fact that we expect the primary and the standby to be alive once
> the tests are run and the addition of two tests related to them seems
> unrelated to the diff report improvements, so I'd suggest to do both
> things separately.

Correct, I will separate them.

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Wed, Jul 16, 2025 at 02:32:53PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:
>> Hmm.  Is that actually useful as we know that the standby has been
>> stalen down when running the test?  Even if we report something, we
>> could always trim the output, like the standby case.

My fingers and brain have flipped here.  I meant likely
s/stalen/taken/.

> I guess you are right. Also, I tried killing standby while the
> 027_stream_regress test was running and the test did not fail; instead
> it continued until timeout. If that is always the case, then it makes
> sense to report if and only if both primary and standby are alive.

Okay.  Understood.  I'm pretty sure that somebody around you has also
run the test and crashed the standby on replay, to note that the
pattern on HEAD was bad.

> system_log() logs which command is run. I actually do not want to log
> the command because it needs to be run after the regression tests and
> before the 'regression tests pass'. If I log the command it looks like
> this:
>
> # All 228 tests passed.
> # Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
> # Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
> (21.939s) ok 2 - regression tests pass
> (0.000s) ok 3 - primary is alive after the regression tests
> (0.000s) ok 4 - standby is alive after the regression tests
>
> I think it looks like these pg_isready commands run randomly.

FWIW, I'm usually in favor of a bit more logging, to understand what
happens in the test script under-the-hood.  That's less guessing when
calling these routines.  If I'm outvoted, that's fine.

>> We could use that as a routine of its own in Utils.pm?  It's not
>> really something only for diff files, more something that we can use
>> to trim the contents of a file, which could be the tail of a server
>> log file as well..  It's always difficult to measure what a "good"
>> number of lines is, but perhaps we could use an environment variable
>> to make that configurable rather than enforce a policy of 50 lines
>> because we think it's good enough?
>
> I think this is a good idea. How about a function called file_trim()
> with three arguments: the file name, a mode ('head' or 'tail'), and
> the number of lines to trim from that end. This approach may require
> reading the file more than once, but it is easy to use.

Sounds fine to me.  Thanks!
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 17 Jul 2025 at 03:08, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jul 16, 2025 at 02:32:53PM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:
> >> Hmm.  Is that actually useful as we know that the standby has been
> >> stalen down when running the test?  Even if we report something, we
> >> could always trim the output, like the standby case.
>
> My fingers and brain have flipped here.  I meant likely
> s/stalen/taken/.
>
> > I guess you are right. Also, I tried killing standby while the
> > 027_stream_regress test was running and the test did not fail; instead
> > it continued until timeout. If that is always the case, then it makes
> > sense to report if and only if both primary and standby are alive.
>
> Okay.  Understood.  I'm pretty sure that somebody around you has also
> run the test and crashed the standby on replay, to note that the
> pattern on HEAD was bad.

Updated, now error reporting happens only if regression tests are
failed and both primary and standby are alive.

> > system_log() logs which command is run. I actually do not want to log
> > the command because it needs to be run after the regression tests and
> > before the 'regression tests pass'. If I log the command it looks like
> > this:
> >
> > # All 228 tests passed.
> > # Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
> > # Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
> > (21.939s) ok 2 - regression tests pass
> > (0.000s) ok 3 - primary is alive after the regression tests
> > (0.000s) ok 4 - standby is alive after the regression tests
> >
> > I think it looks like these pg_isready commands run randomly.
>
> FWIW, I'm usually in favor of a bit more logging, to understand what
> happens in the test script under-the-hood.  That's less guessing when
> calling these routines.  If I'm outvoted, that's fine.

I don't have a strong opinion on this. My point was 'pg_isready' being
on top of 'regression test  pass' and not being on top of
'primary|standby is alive after the regression tests' gives a false
impression to me. However, if you say this is okay; then I can update
it.

> >> We could use that as a routine of its own in Utils.pm?  It's not
> >> really something only for diff files, more something that we can use
> >> to trim the contents of a file, which could be the tail of a server
> >> log file as well..  It's always difficult to measure what a "good"
> >> number of lines is, but perhaps we could use an environment variable
> >> to make that configurable rather than enforce a policy of 50 lines
> >> because we think it's good enough?
> >
> > I think this is a good idea. How about a function called file_trim()
> > with three arguments: the file name, a mode ('head' or 'tail'), and
> > the number of lines to trim from that end. This approach may require
> > reading the file more than once, but it is easy to use.
>
> Sounds fine to me.  Thanks!

I added that as 0001. I used a shifting method for the 'tail'
direction to not use too much memory. I found that there is
'File::ReadBackwards' in Perl but you need to install it, so I didn't
use it.

Now patch is divided into three:

0001: 'Add trim_file() helper to Utils.pm' -> Which effectively does
nothing, just adds a function to be used for a subsequent patch. This
function accepts 'line_count' as an argument but
'PG_TEST_FILE_TRIM_LINES' environment variable overrides it. Should I
document 'PG_TEST_FILE_TRIM_LINES' somewhere?

0002: 'Improve error reporting in 027_stream_regress test' -> Uses
trim_file() function to improve error reporting by including the head
and tail of regression.diffs directly in the failure message.

0003: 'Check primary and standby are alive after regression tests in
027_stream_regress' -> Add test for checking status of primary and
standby after the regression tests in 027_stream_regress. Also, it
makes error reporting happen only if regression tests are failed and
both primary and standby are alive.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Fri, Jul 18, 2025 at 11:57:07AM +0300, Nazir Bilal Yavuz wrote:
> I added that as 0001. I used a shifting method for the 'tail'
> direction to not use too much memory. I found that there is
> 'File::ReadBackwards' in Perl but you need to install it, so I didn't
> use it.

Some old buildfarm may complain when it comes to that, like some of
Tom's stuff (?).

> 0001: 'Add trim_file() helper to Utils.pm' -> Which effectively does
> nothing, just adds a function to be used for a subsequent patch. This
> function accepts 'line_count' as an argument but
> 'PG_TEST_FILE_TRIM_LINES' environment variable overrides it. Should I
> document 'PG_TEST_FILE_TRIM_LINES' somewhere?

Documentation is required in regress.sgml, yes.

> 0002: 'Improve error reporting in 027_stream_regress test' -> Uses
> trim_file() function to improve error reporting by including the head
> and tail of regression.diffs directly in the failure message.

The structure is strange, it seems to me that we should target things
so as we have only one PG_TEST_FILE_TRIM_LINES defined in the tree,
not two with one local to 027.

> 0003: 'Check primary and standby are alive after regression tests in
> 027_stream_regress' -> Add test for checking status of primary and
> standby after the regression tests in 027_stream_regress. Also, it
> makes error reporting happen only if regression tests are failed and
> both primary and standby are alive.

This one looks acceptable to me, so applied to begin with something,
splitting things into two pieces for clarity with some tweaks.  I have
changed things to use system_log() at the end, with fat-commas to link
the long options and their arguments.
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Jul 18, 2025 at 11:57:07AM +0300, Nazir Bilal Yavuz wrote:
>> I added that as 0001. I used a shifting method for the 'tail'
>> direction to not use too much memory. I found that there is
>> 'File::ReadBackwards' in Perl but you need to install it, so I didn't
>> use it.

> Some old buildfarm may complain when it comes to that, like some of
> Tom's stuff (?).

We're trying to avoid requiring anything that's not in a "standard"
Perl installation, with the exception of course of IPC::Run which
there's no way to avoid using.  Now, the "standard" set is a bit
fuzzy, but certainly if a module is not present in some major
distro's minimum Perl package set then using it is a problem.

            regards, tom lane



Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Sat, 19 Jul 2025 at 09:06, Michael Paquier <michael@paquier.xyz> wrote:
>
> This one looks acceptable to me, so applied to begin with something,
> splitting things into two pieces for clarity with some tweaks.  I have
> changed things to use system_log() at the end, with fat-commas to link
> the long options and their arguments.

Thanks!

I realized that we actually don't trim the file, we do the opposite;
read the file from both ends. Sorry for not realizing earlier. I will
update the remaining patches according to that but I think trim_file()
is helpful, too. What do you think about adding both

```
trim_file() -> trims $n lines from head or tail of the file and
returns the remaining lines.
read_file_ends() -> returns $n lines from head or tail of the file.
```

although trim_file() will not be used in these particular patches?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Sat, 19 Jul 2025 at 09:06, Michael Paquier <michael@paquier.xyz> wrote:
>
> The structure is strange, it seems to me that we should target things
> so as we have only one PG_TEST_FILE_TRIM_LINES defined in the tree,
> not two with one local to 027.

I see your point. Then the problem is regression_log_helper() can't
check if $lines_to_dump * 2 is bigger than the number of lines in the
regression.diffs file. So, it can't decide whether or not to dump all
of the regression.diffs. Would it be okay if we repeat some lines in
both the head and tail lines?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:
> I realized that we actually don't trim the file, we do the opposite;
> read the file from both ends. Sorry for not realizing earlier. I will
> update the remaining patches according to that but I think trim_file()
> is helpful, too. What do you think about adding both

I did not review the contents of the patch yet, as I was rather unsure
about the right semantics here.

> ```
> trim_file() -> trims $n lines from head or tail of the file and
> returns the remaining lines.
> read_file_ends() -> returns $n lines from head or tail of the file.
> ```
>
> although trim_file() will not be used in these particular patches?

And this made me wonder over the weekend if only showing the head
and/or tail of a file is always the best set of properties.
Then I came up that  this could be something close to what git-grep
-A/-B is able to do.  For example, for a crash, it would be much
cheaper to target a log entry that matches with what we see in the
usual crash stacks, then print the surroundings.  The "trim" behavior
you have proposed is a subset of that, where the matching patterns are
the end and/or the beginning of the file to show.

So the API could com down to a pattern matching, with "head" and
"tail" being the optional number of lines we'd want to print around
the pattern we are looking for.
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 22 Jul 2025 at 03:56, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:
> > I realized that we actually don't trim the file, we do the opposite;
> > read the file from both ends. Sorry for not realizing earlier. I will
> > update the remaining patches according to that but I think trim_file()
> > is helpful, too. What do you think about adding both
>
> I did not review the contents of the patch yet, as I was rather unsure
> about the right semantics here.
>
> > ```
> > trim_file() -> trims $n lines from head or tail of the file and
> > returns the remaining lines.
> > read_file_ends() -> returns $n lines from head or tail of the file.
> > ```
> >
> > although trim_file() will not be used in these particular patches?
>
> And this made me wonder over the weekend if only showing the head
> and/or tail of a file is always the best set of properties.
> Then I came up that  this could be something close to what git-grep
> -A/-B is able to do.  For example, for a crash, it would be much
> cheaper to target a log entry that matches with what we see in the
> usual crash stacks, then print the surroundings.  The "trim" behavior
> you have proposed is a subset of that, where the matching patterns are
> the end and/or the beginning of the file to show.
>
> So the API could com down to a pattern matching, with "head" and
> "tail" being the optional number of lines we'd want to print around
> the pattern we are looking for.

That makes sense and could be really useful in more general cases but
I think that discussion might be better suited for a separate thread.

For this specific case, we are dealing with regression.diffs; which
already contains the relevant diff output. We don't need to search for
patterns here; we just want to include a small portion of the file in
the error message, to spare users from having to open the file
manually.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Thu, 24 Jul 2025 at 13:34, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Tue, 22 Jul 2025 at 03:56, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:
> > > I realized that we actually don't trim the file, we do the opposite;
> > > read the file from both ends. Sorry for not realizing earlier. I will
> > > update the remaining patches according to that but I think trim_file()
> > > is helpful, too. What do you think about adding both
> >
> > I did not review the contents of the patch yet, as I was rather unsure
> > about the right semantics here.
> >
> > > ```
> > > trim_file() -> trims $n lines from head or tail of the file and
> > > returns the remaining lines.
> > > read_file_ends() -> returns $n lines from head or tail of the file.
> > > ```
> > >
> > > although trim_file() will not be used in these particular patches?
> >
> > And this made me wonder over the weekend if only showing the head
> > and/or tail of a file is always the best set of properties.
> > Then I came up that  this could be something close to what git-grep
> > -A/-B is able to do.  For example, for a crash, it would be much
> > cheaper to target a log entry that matches with what we see in the
> > usual crash stacks, then print the surroundings.  The "trim" behavior
> > you have proposed is a subset of that, where the matching patterns are
> > the end and/or the beginning of the file to show.
> >
> > So the API could com down to a pattern matching, with "head" and
> > "tail" being the optional number of lines we'd want to print around
> > the pattern we are looking for.
>
> That makes sense and could be really useful in more general cases but
> I think that discussion might be better suited for a separate thread.
>
> For this specific case, we are dealing with regression.diffs; which
> already contains the relevant diff output. We don't need to search for
> patterns here; we just want to include a small portion of the file in
> the error message, to spare users from having to open the file
> manually.

I wanted to show what is in my mind, v4 is attached. Summary is:

- 0001 introduces the read_file_ends() function, which reads lines
from either the beginning or end of a given file. It includes a
force_line_count argument that, when set to true, ensures that the
specified number of lines is read from the file regardless of whether
the PG_TEST_FILE_READ_LINES environment variable is set.

- 0002 is the actual patch that improves error reporting in the
027_stream_regress test by using the read_file_ends() function. It
adds a regression_log_helper() function, which reads the
PG_TEST_FILE_READ_LINES environment variable and then calls
read_file_ends() with force_line_count set to true. This approach
avoids any potential race condition where the environment variable
might be modified after being read in the regression_log_helper() and
before used in the read_file_ends().

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Mon, Jul 28, 2025 at 03:34:06PM +0300, Nazir Bilal Yavuz wrote:
> I wanted to show what is in my mind, v4 is attached. Summary is:
>
> - 0001 introduces the read_file_ends() function, which reads lines
> from either the beginning or end of a given file. It includes a
> force_line_count argument that, when set to true, ensures that the
> specified number of lines is read from the file regardless of whether
> the PG_TEST_FILE_READ_LINES environment variable is set.

Hmm.  I am really wondering if 'head' and 'tail' should just be
shortcuts of some regexp patterns based on ^ and $, and that we could
allow the subroutine to use a regexp for the pattern matching.  I'll
look at that in details, but it may take me a couple of days before
getting down to it.

> - 0002 is the actual patch that improves error reporting in the
> 027_stream_regress test by using the read_file_ends() function. It
> adds a regression_log_helper() function, which reads the
> PG_TEST_FILE_READ_LINES environment variable and then calls
> read_file_ends() with force_line_count set to true. This approach
> avoids any potential race condition where the environment variable
> might be modified after being read in the regression_log_helper() and
> before used in the read_file_ends().

Why do you think the environment variable could be changed mid-flight
in the context of a single test run?
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Andres Freund
Дата:
Hi,

On 2025-07-29 10:46:16 +0900, Michael Paquier wrote:
> On Mon, Jul 28, 2025 at 03:34:06PM +0300, Nazir Bilal Yavuz wrote:
> > I wanted to show what is in my mind, v4 is attached. Summary is:
> > 
> > - 0001 introduces the read_file_ends() function, which reads lines
> > from either the beginning or end of a given file. It includes a
> > force_line_count argument that, when set to true, ensures that the
> > specified number of lines is read from the file regardless of whether
> > the PG_TEST_FILE_READ_LINES environment variable is set.
> 
> Hmm.  I am really wondering if 'head' and 'tail' should just be
> shortcuts of some regexp patterns based on ^ and $, and that we could
> allow the subroutine to use a regexp for the pattern matching.  I'll
> look at that in details, but it may take me a couple of days before
> getting down to it.

I don't really get what the point of designing that mechanism is before we
have a usecase. If we need it, we can expand it at that time.

Greetings,

Andres Freund



Re: Improve error reporting in 027_stream_regress test

От
Alexander Lakhin
Дата:
Hello Nazir and Michael!

01.07.2025 10:57, Nazir Bilal Yavuz wrote:
> I agree with you. So, the current logic is:
>
> If primary is not alive: Do not report anything.
> If only primary is alive: Report the entire diff file.
> If both primary and standby are alive: Report entire diff file and add
> head+tail of diff to the failure message.
>
> Done like above in v2.

The new check has failed on mamba [1], apparently because this animal is
too slow for pg_isready:

regress_log_027_stream_regress:
...
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) #   Failed test 'standby alive after regression test run'
#   at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) #          got: '0'
#     expected: '1'


027_stream_regress_standby_1.log:
2025-07-28 07:37:27.237 EDT [22920:1] LOG:  starting PostgreSQL 19devel on powerpc-unknown-netbsd10.1, compiled by cc 
(nb3 20231008) 10.5.0, 32-bit
2025-07-28 07:37:27.239 EDT [22920:2] LOG:  listening on Unix socket
"/home/buildfarm/bf-data/tmp/ep8AOH4m7l/.s.PGSQL.24781"
2025-07-28 07:37:27.281 EDT [25395:1] LOG:  database system was interrupted; last known up at 2025-07-28 07:36:48 EDT
2025-07-28 07:37:27.282 EDT [25395:2] LOG:  starting backup recovery with redo LSN 0/02000028, checkpoint LSN 
0/02000080, on timeline ID 1
2025-07-28 07:37:27.283 EDT [25395:3] LOG:  entering standby mode
2025-07-28 07:37:27.287 EDT [25395:4] LOG:  redo starts at 0/02000028
...
2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG:  connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG:  connection authenticated: user="buildfarm" method=trust 

(/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG:  connection authorized: user=buildfarm database=postgres 
application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG: could not send data to client: Broken pipe

### 3 seconds is the pg_isready's default timeout

2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL: connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.670 user=buildfarm

database=postgres host=[local]
...

What do you think of increasing the timeout (e.g. , to 10 seconds)?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-07-28%2007%3A46%3A26

Best regards,
Alexander



Re: Improve error reporting in 027_stream_regress test

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> The new check has failed on mamba [1], apparently because this animal is
> too slow for pg_isready:

There is something strange happening on mamba --- not sure what,
but its cycle time for the past week has been a lot more than normal.
I plan to power-cycle it tomorrow and see if that does anything.
In the meantime, I'd not put a lot of stock in that failure.

            regards, tom lane



Re: Improve error reporting in 027_stream_regress test

От
Michael Paquier
Дата:
On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
> > The new check has failed on mamba [1], apparently because this animal is
> > too slow for pg_isready:
>
> There is something strange happening on mamba --- not sure what,
> but its cycle time for the past week has been a lot more than normal.
> I plan to power-cycle it tomorrow and see if that does anything.
> In the meantime, I'd not put a lot of stock in that failure.

As far as I can see, based on the logs, the standby seems to be
lagging behind in terms of replay.  Anyway, a consistent state is
reached way before the pg_isready call is done (07:37:27 vs 08:01:50),
so pg_isready should report something as the standby is ready for
connections.  And it's true that 3s would be very short in smallish
environments.

We are getting PQPING_NO_RESPONSE meaning a lack of report activity
from the postmaster.  An increase in timeout may help, but the host
seems like it's facing a high workload so it's not really possible to
come with a perfect number, just an estimation.  How about adding a
--timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
attached?  At least that would be more in line with the other tests,
and we'd have more leverage over the timing of is_alive().  Default is
180s.
--
Michael

Вложения

Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

On Tue, 29 Jul 2025 at 04:46, Michael Paquier <michael@paquier.xyz> wrote:
>
> > - 0002 is the actual patch that improves error reporting in the
> > 027_stream_regress test by using the read_file_ends() function. It
> > adds a regression_log_helper() function, which reads the
> > PG_TEST_FILE_READ_LINES environment variable and then calls
> > read_file_ends() with force_line_count set to true. This approach
> > avoids any potential race condition where the environment variable
> > might be modified after being read in the regression_log_helper() and
> > before used in the read_file_ends().
>
> Why do you think the environment variable could be changed mid-flight
> in the context of a single test run?

I mean it is a run-time variable, so it is up to the user. The timing
between these two operations are extremely tight, so it is very
unlikely but not impossible. Do you think this edge case is not worth
accounting for?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Nazir Bilal Yavuz
Дата:
Hi,

Thank you all for reporting the failure and sharing your inputs!

On Tue, 29 Jul 2025 at 09:07, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:
> > Alexander Lakhin <exclusion@gmail.com> writes:
> > > The new check has failed on mamba [1], apparently because this animal is
> > > too slow for pg_isready:
> >
> > There is something strange happening on mamba --- not sure what,
> > but its cycle time for the past week has been a lot more than normal.
> > I plan to power-cycle it tomorrow and see if that does anything.
> > In the meantime, I'd not put a lot of stock in that failure.
>
> As far as I can see, based on the logs, the standby seems to be
> lagging behind in terms of replay.  Anyway, a consistent state is
> reached way before the pg_isready call is done (07:37:27 vs 08:01:50),
> so pg_isready should report something as the standby is ready for
> connections.  And it's true that 3s would be very short in smallish
> environments.
>
> We are getting PQPING_NO_RESPONSE meaning a lack of report activity
> from the postmaster.  An increase in timeout may help, but the host
> seems like it's facing a high workload so it's not really possible to
> come with a perfect number, just an estimation.  How about adding a
> --timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
> attached?  At least that would be more in line with the other tests,
> and we'd have more leverage over the timing of is_alive().  Default is
> 180s.

I think it makes sense to add this timeout, as it would be more in
line with the other tests like you said.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Improve error reporting in 027_stream_regress test

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:
>> There is something strange happening on mamba --- not sure what,
>> but its cycle time for the past week has been a lot more than normal.

> We are getting PQPING_NO_RESPONSE meaning a lack of report activity
> from the postmaster.  An increase in timeout may help, but the host
> seems like it's facing a high workload so it's not really possible to
> come with a perfect number, just an estimation.  How about adding a
> --timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
> attached?

Yeah, looking closely at the failure logs [1], the standby postmaster
reported pg_isready's connection request in good order at about
08:01:47, but then nothing happened for a few seconds:

2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG:  connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG:  connection authenticated: user="buildfarm" method=trust
(/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG:  could not send data to client: Broken pipe
2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL:  connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG:  disconnection: session time: 0:00:03.670
user=buildfarmdatabase=postgres host=[local] 

The client side reported failure at 08:01:50 and a bit, matching
pg_isready's default 3-second timeout:

# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24780
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24780 - accepting connections
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) #   Failed test 'standby alive after regression test run'
#   at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) #          got: '0'
#     expected: '1'

So I agree that this looks a lot like pg_isready timed out.  It's not
too surprising if the machine was heavily loaded at that point, since
the standby would probably still be replaying the last actions of the
regression tests.

+1 for lengthening the timeout, as I see you already did.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-07-28%2007%3A46%3A26