Обсуждение: Re: wrong query results on bf leafhopper

Поиск
Список
Период
Сортировка

Re: wrong query results on bf leafhopper

От
David Rowley
Дата:
On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Failures like this one [1]:
>
> @@ -340,9 +340,13 @@
>  create function myinthash(myint) returns integer strict immutable language
>    internal as 'hashint4';
>  NOTICE:  argument type myint is only a shell
> +ERROR:  ROWS is not applicable when function does not return a set
>
> are hard to explain as anything besides "that machine is quite
> broken".  Whether it's flaky hardware, broken compiler, or what is
> undeterminable from here, but I don't believe it's our bug.  So I'm
> unexcited about putting effort into it.

There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?

Are there any Prime95-like stress testers for ARM that could be run on
this machine?

It would be good to kick this one out the pool if there's hardware issues.

David



Re: wrong query results on bf leafhopper

От
Tomas Vondra
Дата:

On 5/20/25 07:50, David Rowley wrote:
> On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Failures like this one [1]:
>>
>> @@ -340,9 +340,13 @@
>>  create function myinthash(myint) returns integer strict immutable language
>>    internal as 'hashint4';
>>  NOTICE:  argument type myint is only a shell
>> +ERROR:  ROWS is not applicable when function does not return a set
>>
>> are hard to explain as anything besides "that machine is quite
>> broken".  Whether it's flaky hardware, broken compiler, or what is
>> undeterminable from here, but I don't believe it's our bug.  So I'm
>> unexcited about putting effort into it.
> 
> There are certainly much fewer moving parts in PostgreSQL code for
> that one as this failure doesn't seem to rely on anything stored in
> any tables or the catalogues.
> 
> I'd have thought it would be unlikely to be a compiler bug as wouldn't
> that mean it'd fail every time?
> 
> Are there any Prime95-like stress testers for ARM that could be run on
> this machine?
> 
> It would be good to kick this one out the pool if there's hardware issues.
> 

There are tools like "stress" and "stressant", etc. Works on my rpi5,
but depends on the packager.

I'd probably just look at dmesg first. In my experience hardware issues
are often pretty visible there - reports of failed I/O requests, thermal
issues on the CPU, that kind of stuff.


regards

-- 
Tomas Vondra




Re: wrong query results on bf leafhopper

От
Robins Tharakan
Дата:

On Tue, 20 May 2025 at 15:20, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Failures like this one [1]:
>
> @@ -340,9 +340,13 @@
>  create function myinthash(myint) returns integer strict immutable language
>    internal as 'hashint4';
>  NOTICE:  argument type myint is only a shell
> +ERROR:  ROWS is not applicable when function does not return a set
>
> are hard to explain as anything besides "that machine is quite
> broken".  Whether it's flaky hardware, broken compiler, or what is
> undeterminable from here, but I don't believe it's our bug.  So I'm
> unexcited about putting effort into it.

There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?


Recently leafhopper failed again on the same test. For now I've paused it.
To rule out the compiler (and its maturity on the architecture), I'll upgrade
gcc (to nightly, or something more recent) and then re-enable to see if it
changes anything.

I didn't dive in deeper but I see that indri failed recently [1] on what seems
like the exact same test / line-number (at t/027_stream_regress.pl line 95)
that leafhopper has been tripping on recently. The error is not verbatim,
but it was a little too coincidental to not highlight here.

-
robins

Ref:

Re: wrong query results on bf leafhopper

От
Tom Lane
Дата:
Robins Tharakan <tharakan@gmail.com> writes:
> I didn't dive in deeper but I see that indri failed recently [1] on what
> seems
> like the exact same test / line-number (at t/027_stream_regress.pl line 95)
> that leafhopper has been tripping on recently. The error is not verbatim,
> but it was a little too coincidental to not highlight here.

027_stream_regress.pl is quite a large/complicated test, and for
reasons that are not clear to me it seems more prone to intermittent
timing problems than most other tests.  I would not read very much
into that being the test that failed for you, especially since the
detailed symptoms are not like indri's.

            regards, tom lane



Re: wrong query results on bf leafhopper

От
Andres Freund
Дата:
Hi,

On 2025-05-28 22:51:14 +0930, Robins Tharakan wrote:
> On Tue, 20 May 2025 at 15:20, David Rowley <dgrowleyml@gmail.com> wrote:
> 
> > On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Failures like this one [1]:
> > >
> > > @@ -340,9 +340,13 @@
> > >  create function myinthash(myint) returns integer strict immutable
> > language
> > >    internal as 'hashint4';
> > >  NOTICE:  argument type myint is only a shell
> > > +ERROR:  ROWS is not applicable when function does not return a set
> > >
> > > are hard to explain as anything besides "that machine is quite
> > > broken".  Whether it's flaky hardware, broken compiler, or what is
> > > undeterminable from here, but I don't believe it's our bug.  So I'm
> > > unexcited about putting effort into it.
> >
> > There are certainly much fewer moving parts in PostgreSQL code for
> > that one as this failure doesn't seem to rely on anything stored in
> > any tables or the catalogues.
> >
> > I'd have thought it would be unlikely to be a compiler bug as wouldn't
> > that mean it'd fail every time?
> >
> 
> 
> Recently leafhopper failed again on the same test. For now I've paused it.
> To rule out the compiler (and its maturity on the architecture), I'll
> upgrade
> gcc (to nightly, or something more recent) and then re-enable to see if it
> changes anything.

+1 to a gcc upgrade, gcc 11 is rather old and out of upstream support.

A kernel upgrade would be good too. My completely baseless gut feeling is that
some SIMD registers occassionally get corrupted, e.g. due to a kernel
interrupt / context switch not properly storing & restoring them. Weirdly
enought the instrumentation code is among the pieces of PG code most
vulnerable to that because we mostly don't do enough auto-vectorizable math,
but InstrEndLoop(), InstrStopNode() etc are trivially auto-vectorizable.  I'm
pretty sure I've previously analyzed problems around this, but don't remember
the details (IA64 maybe?).


> I didn't dive in deeper but I see that indri failed recently [1] on what
> seems
> like the exact same test / line-number (at t/027_stream_regress.pl line 95)
> that leafhopper has been tripping on recently. The error is not verbatim,
> but it was a little too coincidental to not highlight here.

For 027_stream_regress.pl you really need to look at
regress_log_027_stream_regress.log, as that specific line just tests whether
the standard regression tests passed. The failure on indri is rather different
than your issue, I doubt there's an overlap between the problems...

I think we should spruce up 027_stream_regress.pl a bit around this. Before
the "regression tests pass" check we should
a) check if primary is still alive
b) check if standby is still alive

and then, iff a) & b) pass, in addition to printing the entire regression test
file, we should add the head and tail of regression.diffs to the failure
message, so one can quickly glean what went wrong.

Greetings,

Andres Freund



Re: wrong query results on bf leafhopper

От
Robins Tharakan
Дата:
Hi,

On Thu, 29 May 2025 at 02:32, Andres Freund <andres@anarazel.de> wrote:
On 2025-05-28 22:51:14 +0930, Robins Tharakan wrote: 
> Recently leafhopper failed again on the same test. For now I've paused it.
> To rule out the compiler (and its maturity on the architecture), I'll
> upgrade
> gcc (to nightly, or something more recent) and then re-enable to see if it
> changes anything.

+1 to a gcc upgrade, gcc 11 is rather old and out of upstream support.


Ack. I've updated leafhopper to gcc master. For now (to get the machine
green / running), I've disabled some flags, which I'll revisit in some time,
but hopefully that's not about compiler maturity - which is what I'm after here.

 
A kernel upgrade would be good too. My completely baseless gut feeling is that
some SIMD registers occassionally get corrupted, e.g. due to a kernel
interrupt / context switch not properly storing & restoring them. Weirdly
enought the instrumentation code is among the pieces of PG code most
vulnerable to that because we mostly don't do enough auto-vectorizable math,
but InstrEndLoop(), InstrStopNode() etc are trivially auto-vectorizable.  I'm
pretty sure I've previously analyzed problems around this, but don't remember
the details (IA64 maybe?).

Fair point, I'll keep that option open. Originally, the machine was spun up to
evaluate the graviton4 ec2 instance and I'd like to explore whether the
stock-kernel / kernel-updates are able to keep the instance green (and resort
to updating the kernel only if I exhaust all other options - pg / compiler etc.).

-
robins