Обсуждение: BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

Поиск
Список
Период
Сортировка

BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18127
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.0
Operating system:   Ubuntu 22.04
Description:

A server compiled --with-blocksize=1 produces an assertion failure on
`make check`. More specifically, the failure triggered on a query like:
CREATE TABLE concur_reindex_tab (c1 int PRIMARY KEY);
CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
REINDEX INDEX CONCURRENTLY concur_reindex_tab_pkey;

Core was generated by `postgres: law regression [local] REINDEX
                        '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/3738467' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=139868701853504)
    at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=139868701853504) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139868701853504) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139868701853504, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x00007f35b7af0476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x00007f35b7ad67f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000559cd6118131 in ExceptionalCondition (conditionName=0x559cd61a5470
"HaveRegisteredOrActiveSnapshot()", fileName=0x559cd61a50c0
"toast_internals.c", lineNumber=670) at assert.c:66
#6  0x0000559cd59ad7ff in init_toast_snapshot
(toast_snapshot=0x7ffdf91b4d00) at toast_internals.c:670
#7  0x0000559cd59ad22c in toast_delete_datum (rel=0x7f35b8466c58,
value=139868524341189, is_speculative=false) at toast_internals.c:429
#8  0x0000559cd5a625e0 in toast_tuple_cleanup (ttc=0x7ffdf91b4e50) at
toast_helper.c:309
#9  0x0000559cd5a0c31c in heap_toast_insert_or_update (rel=0x7f35b8466c58,
newtup=0x559cd69dae18, oldtup=0x7ffdf91c2440, options=0) at
heaptoast.c:333
#10 0x0000559cd59f7418 in heap_update (relation=0x7f35b8466c58,
otid=0x559cd69dae1c, newtup=0x559cd69dae18, cid=0, crosscheck=0x0,
wait=true, tmfd=0x7ffdf91c24d0, lockmode=0x7ffdf91c24c8,
update_indexes=0x7ffdf91c252c) at heapam.c:3595
#11 0x0000559cd59f8109 in simple_heap_update (relation=0x7f35b8466c58,
otid=0x559cd69dae1c, tup=0x559cd69dae18, update_indexes=0x7ffdf91c252c) at
heapam.c:4053
#12 0x0000559cd5ad3b45 in CatalogTupleUpdate (heapRel=0x7f35b8466c58,
otid=0x559cd69dae1c, tup=0x559cd69dae18) at indexing.c:322
#13 0x0000559cd5aceda6 in index_concurrently_swap (newIndexId=20330,
oldIndexId=20299, oldName=0x559cd69dc368 "concur_reindex_ind1_ccold") at
index.c:1659
#14 0x0000559cd5be2573 in ReindexRelationConcurrently (relationOid=20299,
params=0x7ffdf91c2b68) at indexcmds.c:4088
#15 0x0000559cd5bdfe3a in ReindexIndex (indexRelation=0x559cd691a100,
params=0x7ffdf91c2b68, isTopLevel=true) at indexcmds.c:2809
#16 0x0000559cd5bdfc7a in ExecReindex (pstate=0x559cd6c3ae28,
stmt=0x559cd691a150, isTopLevel=true) at indexcmds.c:2739
#17 0x0000559cd5f38b7c in standard_ProcessUtility (pstmt=0x559cd691a2a0,
queryString=0x559cd69196e8 "REINDEX INDEX CONCURRENTLY
concur_reindex_ind1;", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, 
    dest=0x559cd691a560, qc=0x7ffdf91c3010) at utility.c:964
#18 0x0000559cd5f37e7d in ProcessUtility (pstmt=0x559cd691a2a0,
queryString=0x559cd69196e8 "REINDEX INDEX CONCURRENTLY
concur_reindex_ind1;", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, 
    dest=0x559cd691a560, qc=0x7ffdf91c3010) at utility.c:530
#19 0x0000559cd5f3674f in PortalRunUtility (portal=0x559cd698c8f8,
pstmt=0x559cd691a2a0, isTopLevel=true, setHoldSnapshot=false,
dest=0x559cd691a560, qc=0x7ffdf91c3010) at pquery.c:1158
#20 0x0000559cd5f369c6 in PortalRunMulti (portal=0x559cd698c8f8,
isTopLevel=true, setHoldSnapshot=false, dest=0x559cd691a560,
altdest=0x559cd691a560, qc=0x7ffdf91c3010) at pquery.c:1315
#21 0x0000559cd5f35e10 in PortalRun (portal=0x559cd698c8f8,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x559cd691a560, altdest=0x559cd691a560, qc=0x7ffdf91c3010) at
pquery.c:791
#22 0x0000559cd5f2eb1f in exec_simple_query (query_string=0x559cd69196e8
"REINDEX INDEX CONCURRENTLY concur_reindex_ind1;") at postgres.c:1274
#23 0x0000559cd5f33b8d in PostgresMain (dbname=0x559cd6950ac8 "regression",
username=0x559cd69159c8 "law") at postgres.c:4637
#24 0x0000559cd5e54bf3 in BackendRun (port=0x559cd69452f0) at
postmaster.c:4464
#25 0x0000559cd5e5447f in BackendStartup (port=0x559cd69452f0) at
postmaster.c:4192
#26 0x0000559cd5e507c4 in ServerLoop () at postmaster.c:1782
#27 0x0000559cd5e5006e in PostmasterMain (argc=8, argv=0x559cd6913850) at
postmaster.c:1466
#28 0x0000559cd5d046f9 in main (argc=8, argv=0x559cd6913850) at main.c:198

(gdb) frame 10
#10 0x0000559cd59f7418 in heap_update (relation=0x7f35b8466c58,
otid=0x559cd69dae1c, 
    newtup=0x559cd69dae18, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffdf91c24d0, 
    lockmode=0x7ffdf91c24c8, update_indexes=0x7ffdf91c252c) at
heapam.c:3595
3595                            heaptup =
heap_toast_insert_or_update(relation, newtup, &oldtup, 0);
(gdb) p need_toast
$1 = true
(gdb) p newtupsize
$2 = 256

(gdb) frame 6
#6  0x0000559cd59ad7ff in init_toast_snapshot
(toast_snapshot=0x7ffdf91b4d00) at toast_internals.c:670
670             Assert(HaveRegisteredOrActiveSnapshot());
(gdb) p RegisteredSnapshots
$4 = {ph_compare = 0x559cd61775e0 <xmin_cmp>, ph_arg = 0x0, 
  ph_root = 0x559cd64e35a8 <CatalogSnapshotData+72>}
(gdb)  p *RegisteredSnapshots.ph_root
$5 = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x0}
(gdb) p ActiveSnapshot
$6 = (ActiveSnapshotElt *) 0x0

I think that the blocksize matters here just because it allows to reach
toast_delete_datum() inside index_concurrently_swap(). Perhaps the same
effect could be seen with the default block size, but with larger tuples
in pg_constraint (I haven't tried to construct such tuples yet).

This issue is not the same as [1], because in this case I really see no
registered or active snapshots.

Reproduced on REL_15_STABLE .. master.

[1]
https://www.postgresql.org/message-id/dc9dd229-ed30-6c62-4c41-d733ffff776b%40xs4all.nl


Re: BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

От
Alexander Lakhin
Дата:
21.09.2023 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      18127
>
> A server compiled --with-blocksize=1 produces an assertion failure on
> `make check`. More specifically, the failure triggered on a query like:
> CREATE TABLE concur_reindex_tab (c1 int PRIMARY KEY);
> CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
> REINDEX INDEX CONCURRENTLY concur_reindex_tab_pkey;

With the attached patch, which, as I believe, represents a real
possibility during code execution, I could perform check-world without
changing the block size. It gave me other code paths to reach that
assertion failure:
1) ALTER TABLE range_parted2 DETACH PARTITION part_rp CONCURRENTLY
#5  0x0000555e8b668a01 in ExceptionalCondition (conditionName=conditionName@entry=0x555e8b6ca710 
"HaveRegisteredOrActiveSnapshot()", fileName=fileName@entry=0x555e8b6dbe02 "toast_helper.c", 
lineNumber=lineNumber@entry=281)
     at assert.c:66
...
#9  0x0000555e8b15c637 in simple_heap_update (...) at heapam.c:4044
#10 0x0000555e8b20510c in CatalogTupleUpdate (...) at indexing.c:322
#11 0x0000555e8b2e1bf6 in DetachPartitionFinalize (...) at tablecmds.c:19428
...

2) src/test/subscription/t/021_twophase.pl
Core was generated by `postgres: subscriber: logical replication apply worker for subscription 16389 '.
...
#5  0x000055dae548da01 in ExceptionalCondition (conditionName=conditionName@entry=0x55dae54ef710 
"HaveRegisteredOrActiveSnapshot()", fileName=fileName@entry=0x55dae5500e02 "toast_helper.c", 
lineNumber=lineNumber@entry=281)
     at assert.c:66
...
#9  0x000055dae4f81637 in simple_heap_update (...) at heapam.c:4044
#10 0x000055dae502a10c in CatalogTupleUpdate (...) at indexing.c:322
#11 0x000055dae52c70ca in UpdateTwoPhaseState (...) at tablesync.c:1752
#12 0x000055dae52ce2e1 in run_apply_worker () at worker.c:4539

3) src/test/subscription/t/029_on_error.pl
Core was generated by `postgres: subscriber: logical replication tablesync worker for subscription 163'.
...
#5  0x0000556c27961a01 in ExceptionalCondition (conditionName=conditionName@entry=0x556c279c3710 
"HaveRegisteredOrActiveSnapshot()", fileName=fileName@entry=0x556c279d4e02 "toast_helper.c", 
lineNumber=lineNumber@entry=281)
     at assert.c:66
...
#9  0x0000556c27455637 in simple_heap_update (...) at heapam.c:4044
#10 0x0000556c274fe10c in CatalogTupleUpdate (...) at indexing.c:322
#11 0x0000556c2751fc98 in DisableSubscription (...) at pg_subscription.c:196
#12 0x0000556c277a1d03 in DisableSubscriptionAndExit () at worker.c:4725
#13 0x0000556c2779a935 in start_table_sync (...) at tablesync.c:1623

These are all code paths that were detected during check-world.

Andres, could you please look at this and determine whether the state,
highlighted by the assert, is unexpected?

Best regards,
Alexander
Вложения

Re: BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

От
Michael Paquier
Дата:
On Fri, Sep 22, 2023 at 03:00:01PM +0300, Alexander Lakhin wrote:
> These are all code paths that were detected during check-world.

Hmm, yeah.  That's annoying.  I am not sure what to think here yet.
--
Michael

Вложения

Re: BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

От
Andres Freund
Дата:
Hi,

On 2023-09-21 11:00:01 +0000, PG Bug reporting form wrote:
> A server compiled --with-blocksize=1 produces an assertion failure on
> `make check`. More specifically, the failure triggered on a query like:
> CREATE TABLE concur_reindex_tab (c1 int PRIMARY KEY);
> CREATE TABLE concur_reindex_tab2 (c1 int REFERENCES concur_reindex_tab);
> REINDEX INDEX CONCURRENTLY concur_reindex_tab_pkey;

> I think that the blocksize matters here just because it allows to reach
> toast_delete_datum() inside index_concurrently_swap(). Perhaps the same
> effect could be seen with the default block size, but with larger tuples
> in pg_constraint (I haven't tried to construct such tuples yet).

It seems we ought to move some assertions further up the call stacks, so that
they're hit independent of whether we actually end up toasting or
not. Otherwise it's too hard to find these problems.

Seems like we ought to have assertions in places like pg_detoast_datum(),
heap_insert(), heap_update(), heap_delete()? They all can lead to needing
toasting.

It's possible that there are some bootstrap cases or such where it's ok that
we don't have a snapshot, and that we need to weaken the assertions for that,
but that'd be ok.


> This issue is not the same as [1], because in this case I really see no
> registered or active snapshots.

Yea, it indeed looks like a real issue.

Greetings,

Andres Freund



Re: BUG #18127: Assertion HaveRegisteredOrActiveSnapshot failed on REINDEX CONCURRENTLY when blocksize=1

От
Andres Freund
Дата:
Hi,

On 2023-09-22 15:00:01 +0300, Alexander Lakhin wrote:
> Andres, could you please look at this and determine whether the state,
> highlighted by the assert, is unexpected?

Yes, I think these are a problem.  I don't really see another solution than
making the assertions trigger more widely and going through and fixing the
cases one by one.

Greetings,

Andres Freund