Обсуждение: [HACKERS] Testlib.pm vs msys
It turns out I was wrong about the problem jacana has been having with the pg_ctl tests hanging. The problem was not the use of select as a timeout mechanism, although I think the change to using Time::Hires::usleep() is correct and shouldn't be reverted. The problem is command_like's use of redirection to strings. Why this should be a problem for this particular use is a matter of speculation. I suspect it's to do with the fact that in this instance pg_ctl is leaving behind some child processes (i.e. postmaster and children) after it exits, and so on this particular path IPC::Run isn't detecting the exit properly. The workaround I have found to work is to redirect command_like's output instead to a couple of files and then slurp in those files and delete them. A bit hacky, I know, so I'm open to other suggestions. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> It turns out I was wrong about the problem jacana has been having with
> the pg_ctl tests hanging. The problem was not the use of select as a
> timeout mechanism, although I think the change to using
> Time::Hires::usleep() is correct and shouldn't be reverted.
> The problem is command_like's use of redirection to strings. Why this
> should be a problem for this particular use is a matter of speculation.
> I suspect it's to do with the fact that in this instance pg_ctl is
> leaving behind some child processes (i.e. postmaster and children) after
> it exits, and so on this particular path IPC::Run isn't detecting the
> exit properly. The workaround I have found to work is to redirect
> command_like's output instead to a couple of files and then slurp in
> those files and delete them. A bit hacky, I know, so I'm open to other
> suggestions.
Yeah, I'd been eyeing that behavior of IPC::Run a month or so back,
though from the opposite direction. If you are reading either stdout
or stderr of the executed command into Perl, then it detects command
completion by waiting till it gets EOF on those stream(s). If you
are reading neither, then it goes into this wonky backoff behavior
where it sleeps a bit and then checks waitpid(WNOHANG), with the
value of "a bit" continually increasing until it reaches a fairly
large value, half a second or a second (I forget). So you have
potentially some sizable fraction of a second that's just wasted after
command termination. I'd been able to make a small but noticeable
improvement in the runtime of some of our TAP test suites by forcing
the first behavior, ie reading stdout even if we were going to throw
it away.
So I'm not really that excited about going in the other direction ;-).
It shouldn't matter much time-wise for short-lived commands, but it's
disturbing if the EOF technique fails entirely for some cases.
I looked at jacana's two recent pg_ctlCheck failures, and they both
seem to have failed on this:
command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",'-l', "$TestLib::log_path/001_start_stop_server.log"
],qr/done.*serverstarted/s, 'pg_ctl start');
That is redirecting the postmaster's stdout/stderr into a file,
for sure, so the child processes shouldn't impact EOF detection AFAICS.
It's also hard to explain this way why it only fails some of the time.
I think we need to look at what the recent changes were in this area
and try to form a better theory of why it's started to fail here.
regards, tom lane
I wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> The problem is command_like's use of redirection to strings. Why this
>> should be a problem for this particular use is a matter of speculation.
> I looked at jacana's two recent pg_ctlCheck failures, and they both
> seem to have failed on this:
> command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",
> '-l', "$TestLib::log_path/001_start_stop_server.log" ],
> qr/done.*server started/s, 'pg_ctl start');
> That is redirecting the postmaster's stdout/stderr into a file,
> for sure, so the child processes shouldn't impact EOF detection AFAICS.
> It's also hard to explain this way why it only fails some of the time.
I reflected a bit more on this issue, and realized that there's a pretty
obvious mechanism for this to happen. While on Unix, we have pg_ctl
fork-and-exec the postmaster so that no extra processes are laying about,
that's not the case on Windows. The Windows code in pg_ctl.c creates a
subprocess that runs CMD.EXE, which in turn runs the postmaster as a
subprocess. The CMD.EXE process doesn't disappear until the postmaster
exits. Now, we tell CMD to redirect the postmaster's stdout and stderr
into files, but there's no way to tell CMD to redirect its own handles.
So if the CMD process inherits pg_ctl's stdout and stderr, then the prove
process would not see EOF on those pipes after pg_ctl exits.
Now, this theory still has a Mack-truck-sized hole in it, which is
if that's the failure mechanism then why isn't it 100% effective?
Seems like the TAP test should fail every time, if this is a full
description. But maybe it's part of the answer, so it seems worth
experimenting in this direction.
A bit of googling Microsoft's documentation suggests that maybe all
we have to do is pass CreateProcessAsUser's bInheritHandles parameter
as FALSE not TRUE in pg_ctl.c. It's not apparent to me that there
are any handles we do need the CMD process to inherit.
regards, tom lane
On 07/24/2017 09:33 PM, Tom Lane wrote: > > [good theory about why pg_ctl hangs in TAP test] > > Now, this theory still has a Mack-truck-sized hole in it, which is > if that's the failure mechanism then why isn't it 100% effective? > Seems like the TAP test should fail every time, if this is a full > description. But maybe it's part of the answer, so it seems worth > experimenting in this direction. The test that hangs is the only case where we call pg_ctl via command_like. If we use other mechanisms such as command_ok that don't try to read the output there is no problem. Another data point is that this test doesn't hang on bowerbird, which is an animal on the same machine but doing an MSVC build. Thus my thesis that it's probably to do with the imteraction with the MSys perl and shell. A simple workaround might be to provide two flavors of command_like, one that uses files it then slurps in and one that uses direct scalars and EOF detection. cheers andrew > > A bit of googling Microsoft's documentation suggests that maybe all > we have to do is pass CreateProcessAsUser's bInheritHandles parameter > as FALSE not TRUE in pg_ctl.c. It's not apparent to me that there > are any handles we do need the CMD process to inherit. > > Maybe. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/24/2017 09:33 PM, Tom Lane wrote:
>> Seems like the TAP test should fail every time, if this is a full
>> description. But maybe it's part of the answer, so it seems worth
>> experimenting in this direction.
> The test that hangs is the only case where we call pg_ctl via
> command_like. If we use other mechanisms such as command_ok that don't
> try to read the output there is no problem.
What I'm on about is that jacana still succeeds entirely, more than half
the time. If this is a handle-duplication problem, how could it ever
succeed?
regards, tom lane
On 07/25/2017 11:11 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/24/2017 09:33 PM, Tom Lane wrote: >>> Seems like the TAP test should fail every time, if this is a full >>> description. But maybe it's part of the answer, so it seems worth >>> experimenting in this direction. >> The test that hangs is the only case where we call pg_ctl via >> command_like. If we use other mechanisms such as command_ok that don't >> try to read the output there is no problem. > What I'm on about is that jacana still succeeds entirely, more than half > the time. If this is a handle-duplication problem, how could it ever > succeed? > No, it hangs 100% of the time. The only time you see a result at all is if I manually intervene. The pg_ctl test is thus currently disabled on jacana. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/25/2017 11:11 AM, Tom Lane wrote:
>> What I'm on about is that jacana still succeeds entirely, more than half
>> the time. If this is a handle-duplication problem, how could it ever
>> succeed?
> No, it hangs 100% of the time. The only time you see a result at all is
> if I manually intervene. The pg_ctl test is thus currently disabled on
> jacana.
Oh. So when was the last time it worked? And why do the buildfarm
archives contain apparently-successful log_stage entries for pg_ctl-check
on jacana, up through yesterday when I looked?
regards, tom lane
On 07/25/2017 11:25 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/25/2017 11:11 AM, Tom Lane wrote: >>> What I'm on about is that jacana still succeeds entirely, more than half >>> the time. If this is a handle-duplication problem, how could it ever >>> succeed? >> No, it hangs 100% of the time. The only time you see a result at all is >> if I manually intervene. The pg_ctl test is thus currently disabled on >> jacana. > Oh. So when was the last time it worked? And why do the buildfarm > archives contain apparently-successful log_stage entries for pg_ctl-check > on jacana, up through yesterday when I looked? > > There was a buildfarm bug, since corrected, which was not properly picking up errors in a corner case. You will see the errors if you look at all the logs for pg_ctl-check from 12 days back if they exist. The last known actual good run of this test was 33 days ago, i.e. 2017-06-22 cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/25/2017 11:25 AM, Tom Lane wrote:
>> Oh. So when was the last time it worked? And why do the buildfarm
>> archives contain apparently-successful log_stage entries for pg_ctl-check
>> on jacana, up through yesterday when I looked?
> There was a buildfarm bug, since corrected, which was not properly
> picking up errors in a corner case. You will see the errors if you look
> at all the logs for pg_ctl-check from 12 days back if they exist. The
> last known actual good run of this test was 33 days ago, i.e. 2017-06-22
Hm. This failure presumably started with commit f13ea95f9, which
introduced use of command_like() in the pg_ctl TAP test; but that
didn't go in until 2017-06-28. Was jacana not running in the
interim?
Anyway, if we believe that it broke with f13ea95f9, hen it's plausible
that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we
just had not noticed before. Could you check what happens if we
change the bInheritHandles parameter as I suggested upthread?
regards, tom lane
On 07/25/2017 01:41 PM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/25/2017 11:25 AM, Tom Lane wrote: >>> Oh. So when was the last time it worked? And why do the buildfarm >>> archives contain apparently-successful log_stage entries for pg_ctl-check >>> on jacana, up through yesterday when I looked? >> There was a buildfarm bug, since corrected, which was not properly >> picking up errors in a corner case. You will see the errors if you look >> at all the logs for pg_ctl-check from 12 days back if they exist. The >> last known actual good run of this test was 33 days ago, i.e. 2017-06-22 > Hm. This failure presumably started with commit f13ea95f9, which > introduced use of command_like() in the pg_ctl TAP test; but that > didn't go in until 2017-06-28. Was jacana not running in the > interim? Looks like not, possibly because one of the other machines on this box was hung (probably lorikeet). > > Anyway, if we believe that it broke with f13ea95f9, hen it's plausible > that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we > just had not noticed before. Could you check what happens if we > change the bInheritHandles parameter as I suggested upthread? > > I'll try when I get all the animals caught up. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/25/2017 02:45 PM, Andrew Dunstan wrote: >> Anyway, if we believe that it broke with f13ea95f9, hen it's plausible >> that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we >> just had not noticed before. Could you check what happens if we >> change the bInheritHandles parameter as I suggested upthread? >> >> > > I'll try when I get all the animals caught up. > This made no difference. And I'm not really surprised, since the test is not hanging when run from an MSVC build. The latter fact makes me theorize that the problem arises from the fairly ancient perl that Msys provides, and which we need to use to run prove so the TAP tests understand the environment's virtualized file paths. The attached patch should get around the problem without upsetting the good work you've been doing in reducing TAP test times generally. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> This made no difference. And I'm not really surprised, since the test is
> not hanging when run from an MSVC build.
Oh well.
> The latter fact makes me
> theorize that the problem arises from the fairly ancient perl that Msys
> provides, and which we need to use to run prove so the TAP tests
> understand the environment's virtualized file paths.
It might be interesting to look into its copy of IPC::Run and see if
the wait technology is different from later versions.
> The attached patch should get around the problem without upsetting the
> good work you've been doing in reducing TAP test times generally.
I still find this pretty ugly :-(.
regards, tom lane
On 07/26/2017 09:33 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > >> The latter fact makes me >> theorize that the problem arises from the fairly ancient perl that Msys >> provides, and which we need to use to run prove so the TAP tests >> understand the environment's virtualized file paths. > It might be interesting to look into its copy of IPC::Run and see if > the wait technology is different from later versions. It's using 0.94 just like my Linux box. > >> The attached patch should get around the problem without upsetting the >> good work you've been doing in reducing TAP test times generally. > I still find this pretty ugly :-(. > > I'm open to alternative suggestions. But I don't want to spend a heck of a lot of time on this. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/26/2017 09:33 AM, Tom Lane wrote:
>> It might be interesting to look into its copy of IPC::Run and see if
>> the wait technology is different from later versions.
> It's using 0.94 just like my Linux box.
Oh well, I hoped maybe we could update that.
>> I still find this pretty ugly :-(.
> I'm open to alternative suggestions. But I don't want to spend a heck of
> a lot of time on this.
Well, let's at least do the temp files a little more safely.
Maybe like this?
regards, tom lane
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 9c3551f..3acc80e 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -32,9 +32,17 @@ else
print $conf "listen_addresses = '127.0.0.1'\n";
}
close $conf;
-command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",
- '-l', "$TestLib::log_path/001_start_stop_server.log" ],
- qr/done.*server started/s, 'pg_ctl start');
+my $ctlcmd = [ 'pg_ctl', 'start', '-D', "$tempdir/data",
+ '-l', "$TestLib::log_path/001_start_stop_server.log" ];
+if ($Config{osname} ne 'msys')
+{
+ command_like($ctlcmd, qr/done.*server started/s, 'pg_ctl start');
+}
+else
+{
+ # use the version of command_like that doesn't hang on Msys here
+ command_like_safe($ctlcmd, qr/done.*server started/s, 'pg_ctl start');
+}
# sleep here is because Windows builds can't check postmaster.pid exactly,
# so they may mistake a pre-existing postmaster.pid for one created by the
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index fe09689..758c920 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -37,6 +37,7 @@ our @EXPORT = qw(
program_version_ok
program_options_handling_ok
command_like
+ command_like_safe
command_fails_like
$windows_os
@@ -300,6 +301,24 @@ sub command_like
like($stdout, $expected_stdout, "$test_name: matches");
}
+sub command_like_safe
+{
+ # Doesn't rely on detecting end of file on the file descriptors,
+ # which can fail, causing the process to hang, notably on Msys
+ # when used with 'pg_ctl start'
+ my ($cmd, $expected_stdout, $test_name) = @_;
+ my ($stdout, $stderr);
+ my $stdoutfile = File::Temp->new();
+ my $stderrfile = File::Temp->new();
+ print("# Running: " . join(" ", @{$cmd}) . "\n");
+ my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile;
+ $stdout = slurp_file($stdoutfile);
+ $stderr = slurp_file($stderrfile);
+ ok($result, "$test_name: exit code 0");
+ is($stderr, '', "$test_name: no stderr");
+ like($stdout, $expected_stdout, "$test_name: matches");
+}
+
sub command_fails_like
{
my ($cmd, $expected_stderr, $test_name) = @_;
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/26/2017 11:12 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > >>> I still find this pretty ugly :-(. >> I'm open to alternative suggestions. But I don't want to spend a heck of >> a lot of time on this. > Well, let's at least do the temp files a little more safely. > Maybe like this? > > OK tested with that and it works. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services