pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue
Дата
Msg-id ZKy4AdrLEfbqrxGJ@telsasoft.com
обсуждение исходный текст
Ответы Re: pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-hackers
An instance compiled locally, without assertions, failed like this:

< 2023-07-09 22:04:46.470 UTC  >LOG:  process 30002 detected deadlock while waiting for ShareLock on transaction
813219577after 333.228 ms
 
< 2023-07-09 22:04:46.470 UTC  >DETAIL:  Process holding the lock: 2103. Wait queue: 30001.
< 2023-07-09 22:04:46.470 UTC  >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
< 2023-07-09 22:04:46.470 UTC  >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.474 UTC  >ERROR:  deadlock detected
< 2023-07-09 22:04:46.474 UTC  >DETAIL:  Process 30001 waits for ShareLock on transaction 813219577; blocked by process
2103.
        Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
        Process 30001: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
        Process 2103: autovacuum: ANALYZE child.ericsson_sgsn_rac_202307
< 2023-07-09 22:04:46.474 UTC  >HINT:  See server log for query details.
< 2023-07-09 22:04:46.474 UTC  >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
< 2023-07-09 22:04:46.474 UTC  >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.483 UTC  >LOG:  background worker "parallel worker" (PID 30002) exited with exit code 1
< 2023-07-09 22:04:46.487 UTC postgres >ERROR:  deadlock detected
< 2023-07-09 22:04:46.487 UTC postgres >DETAIL:  Process 30001 waits for ShareLock on transaction 813219577; blocked by
process2103.
 
        Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
< 2023-07-09 22:04:46.487 UTC postgres >HINT:  See server log for query details.
< 2023-07-09 22:04:46.487 UTC postgres >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
        parallel worker
< 2023-07-09 22:04:46.487 UTC postgres >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.848 UTC  >LOG:  server process (PID 30001) was terminated by signal 11: Segmentation fault
< 2023-07-09 22:04:46.848 UTC  >DETAIL:  Failed process was running: REINDEX INDEX
pg_statistic_ext_data_stxoid_inh_index

=> REINDEX was running, with parallel workers, but deadlocked with
ANALYZE, and then crashed.

It looks like parallel workers are needed to hit this issue.
I'm not sure if the issue is specific to extended stats - probably not.

I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
were not hit), and on a more recent commit (1124cb2cf).  The crash is hit about
30% of the time when running a loop around REINDEX and then also running
ANALYZE.

I hope someone has a hunch where to look; so far, I wasn't able to create a
minimal reproducer.  

Core was generated by `postgres: pryzbyj ts [local] REINDEX                 '.
Program terminated with signal 11, Segmentation fault.
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
1898            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
#1  0x00000000007ab56b in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2  0x0000000000548a7e in AbortTransaction () at ../src/backend/access/transam/xact.c:2735
#3  0x0000000000549405 in AbortCurrentTransaction () at ../src/backend/access/transam/xact.c:3414
#4  0x00000000007b6414 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at
../src/backend/tcop/postgres.c:4352
#5  0x0000000000730e9a in BackendRun (port=<optimized out>, port=<optimized out>) at
../src/backend/postmaster/postmaster.c:4461
#6  BackendStartup (port=0x12a8a50) at ../src/backend/postmaster/postmaster.c:4189
#7  ServerLoop () at ../src/backend/postmaster/postmaster.c:1779
#8  0x000000000073207d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x127a230) at
../src/backend/postmaster/postmaster.c:1463
#9  0x00000000004b5535 in main (argc=3, argv=0x127a230) at ../src/backend/main/main.c:198

(gdb) l
1893    RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
1894    {
1895            LOCK       *waitLock = proc->waitLock;
1896            PROCLOCK   *proclock = proc->waitProcLock;
1897            LOCKMODE        lockmode = proc->waitLockMode;
1898            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
1899
1900            /* Make sure proc is waiting */
1901            Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
1902            Assert(proc->links.next != NULL);

(gdb) p waitLock
$1 = (LOCK *) 0x0

Another variant on this crash:

Jul 11 00:55:19 telsa kernel: postgres[25415]: segfault at f ip 000000000081111a sp 00007ffdbc01ea90 error 4 in
postgres[400000+8df000]

Core was generated by `postgres: parallel worker for PID 27096  waiting     '.

(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc154040, hashcode=2029421528) at ../src/backend/storage/lmgr/lock.c:1874
#1  0x000000000081de2f in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2  0x0000000000826990 in ProcessInterrupts () at ../src/backend/tcop/postgres.c:3207
#3  0x000000000081e355 in ProcSleep (locallock=locallock@entry=0x252a9d0,
lockMethodTable=lockMethodTable@entry=0xee1260<default_lockmethod>) at ../src/backend/storage/lmgr/proc.c:1295
 
#4  0x000000000080eff1 in WaitOnLock (locallock=locallock@entry=0x252a9d0, owner=owner@entry=0x253b548) at
../src/backend/storage/lmgr/lock.c:1818
#5  0x00000000008107ce in LockAcquireExtended (locktag=locktag@entry=0x7ffdbc01ee10, lockmode=lockmode@entry=5,
sessionLock=sessionLock@entry=false,dontWait=dontWait@entry=false,
 
    reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x0) at
../src/backend/storage/lmgr/lock.c:1082
#6  0x00000000008110a4 in LockAcquire (locktag=locktag@entry=0x7ffdbc01ee10, lockmode=lockmode@entry=5,
sessionLock=sessionLock@entry=false,dontWait=dontWait@entry=false) at ../src/backend/storage/lmgr/lock.c:740
 
#7  0x000000000080e316 in XactLockTableWait (xid=xid@entry=816478533, rel=rel@entry=0x7f7332090468,
ctid=ctid@entry=0x2596374,oper=oper@entry=XLTW_InsertIndexUnique) at ../src/backend/storage/lmgr/lmgr.c:702
 
#8  0x00000000005190bb in heapam_index_build_range_scan (heapRelation=0x7f7332090468, indexRelation=0x7f7332099008,
indexInfo=0x2596888,allow_sync=<optimized out>, anyvisible=false, progress=false, start_blockno=0,
 
    numblocks=4294967295, callback=0x53c8c0 <_bt_build_callback>, callback_state=0x7ffdbc01f310, scan=0x2596318) at
../src/backend/access/heap/heapam_handler.c:1496
#9  0x000000000053ca77 in table_index_build_scan (scan=<optimized out>, callback_state=0x7ffdbc01f310,
callback=0x53c8c0<_bt_build_callback>, progress=false, allow_sync=true, index_info=0x2596888, index_rel=<optimized
out>,
    table_rel=<optimized out>) at ../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool@entry=0x2596d08, btspool2=btspool2@entry=0x2596d38,
btshared=btshared@entry=0x2aaaaad423a0,sharedsort=sharedsort@entry=0x2aaaaad42340,
 
    sharedsort2=sharedsort2@entry=0x2aaaaad422e0, sortmem=<optimized out>, progress=progress@entry=false) at
../src/backend/access/nbtree/nbtsort.c:1985
#11 0x000000000053ef6f in _bt_parallel_build_main (seg=<optimized out>, toc=0x2aaaaad42080) at
../src/backend/access/nbtree/nbtsort.c:1888
#12 0x0000000000564ec8 in ParallelWorkerMain (main_arg=<optimized out>) at
../src/backend/access/transam/parallel.c:1520
#13 0x00000000007892f8 in StartBackgroundWorker () at ../src/backend/postmaster/bgworker.c:861
#14 0x0000000000493269 in do_start_bgworker (rw=0x2531fc0) at ../src/backend/postmaster/postmaster.c:5762
#15 maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:5986
#16 0x000000000078dc5a in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5149
#17 ServerLoop () at ../src/backend/postmaster/postmaster.c:1770
#18 0x0000000000790635 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x24fe230) at
../src/backend/postmaster/postmaster.c:1463
#19 0x00000000004b80c5 in main (argc=3, argv=0x24fe230) at ../src/backend/main/main.c:198

-- 
Justin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Add more sanity checks around callers of changeDependencyFor()
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: Performance degradation on concurrent COPY into a single relation in PG16.