Обсуждение: wrong query results on bf leafhopper
Hi, I noticed this recent BF failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-15%2008%3A10%3A04 === dumping /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs === diff -U3 /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out --- /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out 2025-05-15 08:10:04.211926695+0000 +++ /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out 2025-05-15 08:18:29.117733601+0000 @@ -42,7 +42,7 @@ -> Nested Loop (actual rows=1000.00 loops=N) -> Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N) Filter: (unique1 < 1000) - Rows Removed by Filter: 9000 + Rows Removed by Filter: 8982 -> Memoize (actual rows=1.00 loops=N) Cache Key: t2.twenty Cache Mode: logical @@ -178,7 +178,7 @@ -> Nested Loop (actual rows=1000.00 loops=N) -> Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N) Filter: (unique1 < 1000) - Rows Removed by Filter: 9000 + Rows Removed by Filter: 8981 -> Memoize (actual rows=1.00 loops=N) Cache Key: t1.two, t1.twenty Cache Mode: binary For a moment I thought this could be a bug in memoize, but that doesn't actually make sense - the failure isn't in memoize, it's the seqscan. Subsequently I got worried that this is an AIO bug or such causing wrong query results. But there are instances of this error well before AIO was merged. E.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-18%2023%3A35%3A04 The same error is also present down to 16. In 15, I saw a potentially related error https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03 There have been other odd things on leafhopper, see e.g.: https://www.postgresql.org/message-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987%40gmail.com https://postgr.es/m/Z4npAKvchWzKfb_r%40paquier.xyz Greetings, Andres Freund
On Sat, 17 May 2025 at 01:19, Andres Freund <andres@anarazel.de> wrote: > @@ -42,7 +42,7 @@ > -> Nested Loop (actual rows=1000.00 loops=N) > -> Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N) > Filter: (unique1 < 1000) > - Rows Removed by Filter: 9000 > + Rows Removed by Filter: 8982 > -> Memoize (actual rows=1.00 loops=N) > Cache Key: t2.twenty > Cache Mode: logical > @@ -178,7 +178,7 @@ > -> Nested Loop (actual rows=1000.00 loops=N) > -> Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N) > Filter: (unique1 < 1000) > - Rows Removed by Filter: 9000 > + Rows Removed by Filter: 8981 > -> Memoize (actual rows=1.00 loops=N) > Cache Key: t1.two, t1.twenty > Cache Mode: binary Note that the actual row count is 1000 still, so that pretty much discounts corruption with the stored unique1 values. Unfortunately, that doesn't reduce the number of possible other reasons by very much. > For a moment I thought this could be a bug in memoize, but that doesn't > actually make sense - the failure isn't in memoize, it's the seqscan. I don't have any bright ideas what the cause might be right now, but I agree that it seems unlikely to be anything related to Memoize. It might be worth adding a query like: "select count(odd),min(ctid) from tenk1;" that should use a Seq Scan plan (ideally max(ctid) too, but that won't be stable over CPU architectures). Maybe also "select unique1/1000,count(odd) from tenk1 group by 1 order by 1;" so we can see if there's any sort of consistency or pattern as to which tuples are missing. Maybe those will provoke some ideas. David
David Rowley <dgrowleyml@gmail.com> writes: > Note that the actual row count is 1000 still, so that pretty much > discounts corruption with the stored unique1 values. Unfortunately, > that doesn't reduce the number of possible other reasons by very much. Failures like this one [1]: @@ -340,9 +340,13 @@ create function myinthash(myint) returns integer strict immutable language internal as 'hashint4'; NOTICE: argument type myint is only a shell +ERROR: ROWS is not applicable when function does not return a set are hard to explain as anything besides "that machine is quite broken". Whether it's flaky hardware, broken compiler, or what is undeterminable from here, but I don't believe it's our bug. So I'm unexcited about putting effort into it. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-19%2007%3A07%3A04