Обсуждение: wrong query results on bf leafhopper

Поиск
Список
Период
Сортировка

wrong query results on bf leafhopper

От
Andres Freund
Дата:
Hi,

I noticed this recent BF failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-15%2008%3A10%3A04

=== dumping /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs ===
diff -U3 /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out
/home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out
--- /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out    2025-05-15
08:10:04.211926695+0000
 
+++ /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out    2025-05-15
08:18:29.117733601+0000
 
@@ -42,7 +42,7 @@
    ->  Nested Loop (actual rows=1000.00 loops=N)
          ->  Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N)
                Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8982
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t2.twenty
                Cache Mode: logical
@@ -178,7 +178,7 @@
    ->  Nested Loop (actual rows=1000.00 loops=N)
          ->  Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N)
                Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8981
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t1.two, t1.twenty
                Cache Mode: binary


For a moment I thought this could be a bug in memoize, but that doesn't
actually make sense - the failure isn't in memoize, it's the seqscan.

Subsequently I got worried that this is an AIO bug or such causing wrong query
results. But there are instances of this error well before AIO was
merged. E.g.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-18%2023%3A35%3A04

The same error is also present down to 16.

In 15, I saw a potentially related error
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03


There have been other odd things on leafhopper, see e.g.:
https://www.postgresql.org/message-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987%40gmail.com
https://postgr.es/m/Z4npAKvchWzKfb_r%40paquier.xyz

Greetings,

Andres Freund



Re: wrong query results on bf leafhopper

От
David Rowley
Дата:
On Sat, 17 May 2025 at 01:19, Andres Freund <andres@anarazel.de> wrote:
> @@ -42,7 +42,7 @@
>     ->  Nested Loop (actual rows=1000.00 loops=N)
>           ->  Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N)
>                 Filter: (unique1 < 1000)
> -               Rows Removed by Filter: 9000
> +               Rows Removed by Filter: 8982
>           ->  Memoize (actual rows=1.00 loops=N)
>                 Cache Key: t2.twenty
>                 Cache Mode: logical
> @@ -178,7 +178,7 @@
>     ->  Nested Loop (actual rows=1000.00 loops=N)
>           ->  Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N)
>                 Filter: (unique1 < 1000)
> -               Rows Removed by Filter: 9000
> +               Rows Removed by Filter: 8981
>           ->  Memoize (actual rows=1.00 loops=N)
>                 Cache Key: t1.two, t1.twenty
>                 Cache Mode: binary

Note that the actual row count is 1000 still, so that pretty much
discounts corruption with the stored unique1 values. Unfortunately,
that doesn't reduce the number of possible other reasons by very much.

> For a moment I thought this could be a bug in memoize, but that doesn't
> actually make sense - the failure isn't in memoize, it's the seqscan.

I don't have any bright ideas what the cause might be right now, but I
agree that it seems unlikely to be anything related to Memoize.

It might be worth adding a query like: "select count(odd),min(ctid)
from tenk1;" that should use a Seq Scan plan (ideally max(ctid) too,
but that won't be stable over CPU architectures). Maybe also "select
unique1/1000,count(odd) from tenk1 group by 1 order by 1;" so we can
see if there's any sort of consistency or pattern as to which tuples
are missing. Maybe those will provoke some ideas.

David



Re: wrong query results on bf leafhopper

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> Note that the actual row count is 1000 still, so that pretty much
> discounts corruption with the stored unique1 values. Unfortunately,
> that doesn't reduce the number of possible other reasons by very much.

Failures like this one [1]:

@@ -340,9 +340,13 @@
 create function myinthash(myint) returns integer strict immutable language
   internal as 'hashint4';
 NOTICE:  argument type myint is only a shell
+ERROR:  ROWS is not applicable when function does not return a set

are hard to explain as anything besides "that machine is quite
broken".  Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug.  So I'm
unexcited about putting effort into it.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-19%2007%3A07%3A04