Обсуждение: pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum
pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum
Hello everyone!
I've been examining a floating bug in pg_visibility. To reproduce it, you have to:
1) apply a patch to pg_visibility's tests (you can find it attached)
2) run test-icx.sh (attached) from the root of the postgresql's source tree
pg_vis_clean.patch adds one additional test to pg_visibility test suite and normally it is expected to pass. However, when (I assume so) it happens to run in parallel with autovacuum, the test is failed. test-icx.sh runs several parallel installchecks on contrib/pg_visibility tests and there is a chance that some of them will catch this error.
This is the set-up for a bug reproduction:
create table vacuum_test as select 42 i;
vacuum (disable_page_skipping) vacuum_test;
This is the actual test result:
diff -U3 /tmp/icx/ic01-1/contrib/pg_visibility/expected/pg_visibility.out /tmp/icx/ic01-1/contrib/pg_visibility/results/pg_visibility.out
--- /tmp/icx/ic01-1/contrib/pg_visibility/expected/pg_visibility.out 2022-02-17 23:46:26.005500630 +0300
+++ /tmp/icx/ic01-1/contrib/pg_visibility/results/pg_visibility.out 2022-02-17 23:47:36.066838872 +0300
@@ -247,7 +247,7 @@
select count(*) > 0 from pg_check_visible('vacuum_test');
?column?
----------
- f
+ t
(1 row)
-- cleanup
The thing is that the underlying file is one page long which should contain the only all-visible tuple and pg_check_visible should return nothing. Is this a real bug or do I miss something? Thank you in advance for the response
P. S.: SELECT version(); result: PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
Вложения
Hi, On 2022-02-18 00:09:48 +0300, Daniel Shelepanov wrote: > I've been examining a floating bug in pg_visibility. To reproduce it, you > have to: > > 1) apply a patch to pg_visibility's tests (you can find it attached) > > 2) run test-icx.sh (attached) from the root of the postgresql's source tree Could you try to minimize the script? A 300 line reproducer is quite long. And it looks like it won't even work in non postgres-pro tree. One thing to do would be to modify pg_visibility to elog(PANIC, "something") when it encounters corruption. Then you would have a chance of inspecting the state of the tuple/page in that moment. Greetings, Andres Freund
Hi, On 2022-02-18 08:56:37 -0800, Andres Freund wrote: > Could you try to minimize the script? A 300 line reproducer is quite long. And > it looks like it won't even work in non postgres-pro tree. > > One thing to do would be to modify pg_visibility to elog(PANIC, "something") > when it encounters corruption. Then you would have a chance of inspecting the > state of the tuple/page in that moment. Oh, I have been able to reliably reproduce this on HEAD. I modified record_corrupt_item() to PANIC and then: psql regression: BEGIN ;SELECT txid_current(); <leave open> psql postgres DROP TABLE IF EXISTS vacuum_test_0; create table vacuum_test_0 as select 42 i; vacuum (disable_page_skipping) vacuum_test_0; select * from pg_check_visible('vacuum_test_0'); At which point there immediately is a crash. This reproduces in earlier versions too, at least back to 10. I *think* this is a false positive: - PGPROC->xmin is computed without regard for the database in which the other sessions are running. Due to the the txid_current() session this includes an older xid. - During the VACUUM in vis.sql the only connection to the database pgbench connects to is VACUUM and thus ignored when determining horizons (due to PROC_IN_VACUUM). Therefore the horizon is computed to latestCompletedXid + 1. - But during pg_check_visible(), the current session is *not* marked as PROC_IN_VACUUM. So the horizon is the xid from the txid_current(). Boom. Greetings, Andres Freund