Обсуждение: BUG #18404: Select from pg_stat_activity in a plpgsql block can lead to a global locking issue

Поиск
Список
Период
Сортировка

BUG #18404: Select from pg_stat_activity in a plpgsql block can lead to a global locking issue

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18404
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

The following script:
createuser u

cat << 'EOF' | psql -U u &
DO '
BEGIN
  PERFORM count(*) FROM pg_stat_activity;
  RAISE NOTICE ''sleeping...'';
  PERFORM pg_sleep(1800);
END';
EOF
sleep 0.5

cat << EOF | psql &
VACUUM;
REINDEX SYSTEM;
EOF
sleep 0.5

cat << EOF | psql -U u
SELECT 1
EOF

ends with the last session stuck on startup:
law      3318525 3318511  0 15:00 ?        00:00:00 postgres: u regression
[local] DO
law      3318530 3318511  0 15:00 ?        00:00:00 postgres: law regression
[local] REINDEX waiting
law      3318533 3318511  0 15:00 ?        00:00:00 postgres: u regression
[local] startup waiting
law      3318628 3318511  0 15:01 ?        00:00:00 postgres: autovacuum
worker  waiting
law      3318654 3318511  0 15:01 ?        00:00:00 postgres: autovacuum
worker  waiting
law      3318676 3318511  0 15:02 ?        00:00:00 postgres: autovacuum
worker  waiting

The backtrace of the last session backend is:
(gdb) bt
#0  0x00007fa5514ecf9a in epoll_wait (epfd=11, events=0x55970cae3d38,
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055970b04064b in WaitEventSetWaitBlock (set=0x55970cae3cd8,
cur_timeout=-1, occurred_events=0x7ffff6af57f0, nevents=1) at latch.c:1529
#2  0x000055970b040535 in WaitEventSetWait (set=0x55970cae3cd8, timeout=-1,
occurred_events=0x7ffff6af57f0, nevents=1, wait_event_info=50331648) at
latch.c:1475
#3  0x000055970b03f7d2 in WaitLatch (latch=0x7fa54e361734, wakeEvents=33,
timeout=0, wait_event_info=50331648) at latch.c:513
#4  0x000055970b06d8ea in ProcSleep (locallock=0x55970cb107d0,
lockMethodTable=0x55970b5e88e0 <default_lockmethod>) at proc.c:1294
#5  0x000055970b059f62 in WaitOnLock (locallock=0x55970cb107d0,
owner=0x55970cb73988) at lock.c:1818
#6  0x000055970b058a82 in LockAcquireExtended (locktag=0x7ffff6af5ba0,
lockmode=1, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffff6af5b98) at lock.c:1082
#7  0x000055970b055724 in LockRelationOid (relid=2676, lockmode=1) at
lmgr.c:117
#8  0x000055970b25319d in load_critical_index (indexoid=2676, heapoid=1260)
at relcache.c:4335
#9  0x000055970b252cee in RelationCacheInitializePhase3 () at
relcache.c:4150
#10 0x000055970b27d9be in InitPostgres (in_dbname=0x55970cb21978
"regression", dboid=16384, username=0x55970cae57c8 "u", useroid=0,
load_session_libraries=true, override_allow_connections=false,
out_dbname=0x0) at postinit.c:1180
#11 0x000055970b07dca4 in PostgresMain (dbname=0x55970cb21978 "regression",
username=0x55970cae57c8 "u") at postgres.c:4195
#12 0x000055970af9f014 in BackendRun (port=0x55970cb11c90) at
postmaster.c:4464
#13 0x000055970af9e8a0 in BackendStartup (port=0x55970cb11c90) at
postmaster.c:4192

The backtrace of the REINDEXing backend is:
(gdb) bt
#0  0x00007fa5514ecf9a in epoll_wait (epfd=11, events=0x55970cae3d38,
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055970b04064b in WaitEventSetWaitBlock (set=0x55970cae3cd8,
cur_timeout=-1, occurred_events=0x7ffff6af5090, nevents=1) at latch.c:1529
#2  0x000055970b040535 in WaitEventSetWait (set=0x55970cae3cd8, timeout=-1,
occurred_events=0x7ffff6af5090, nevents=1, wait_event_info=50331648) at
latch.c:1475
#3  0x000055970b03f7d2 in WaitLatch (latch=0x7fa54e361aa4, wakeEvents=33,
timeout=0, wait_event_info=50331648) at latch.c:513
#4  0x000055970b06d8ea in ProcSleep (locallock=0x55970cb105c0,
lockMethodTable=0x55970b5e88e0 <default_lockmethod>) at proc.c:1294
#5  0x000055970b059f62 in WaitOnLock (locallock=0x55970cb105c0,
owner=0x55970cb77a68) at lock.c:1818
#6  0x000055970b058a82 in LockAcquireExtended (locktag=0x7ffff6af5440,
lockmode=8, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffff6af5438) at lock.c:1082
#7  0x000055970b055724 in LockRelationOid (relid=2676, lockmode=8) at
lmgr.c:117
#8  0x000055970aaef8de in try_relation_open (relationId=2676, lockmode=8) at
relation.c:97
#9  0x000055970ab64a85 in try_index_open (relationId=2676, lockmode=8) at
indexam.c:161
#10 0x000055970ac1b146 in reindex_index (indexId=2676,
skip_constraint_checks=false, persistence=112 'p', params=0x7ffff6af56d8) at
index.c:3637
#11 0x000055970ac1bbe5 in reindex_relation (relid=1260, flags=5,
params=0x7ffff6af56d8) at index.c:3989
#12 0x000055970ad29ec6 in ReindexMultipleInternal (relids=0x55970cb17ef8,
params=0x7ffff6af5838) at indexcmds.c:3341
#13 0x000055970ad29923 in ReindexMultipleTables (objectName=0x0,
objectKind=REINDEX_OBJECT_SYSTEM, params=0x7ffff6af5838) at
indexcmds.c:3138
#14 0x000055970ad28e4e in ExecReindex (pstate=0x55970cb13958,
stmt=0x55970cae9ee0, isTopLevel=true) at indexcmds.c:2734
#15 0x000055970b0832c8 in standard_ProcessUtility (pstmt=0x55970cae9f90,
queryString=0x55970cae9518 "REINDEX SYSTEM;", readOnlyTree=false,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55970caea250, 
    qc=0x7ffff6af5ce0) at utility.c:964
#16 0x000055970b0825c9 in ProcessUtility (pstmt=0x55970cae9f90,
queryString=0x55970cae9518 "REINDEX SYSTEM;", readOnlyTree=false,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55970caea250, qc=0x7ffff6af5ce0)
    at utility.c:530
#17 0x000055970b080e9b in PortalRunUtility (portal=0x55970cb63a18,
pstmt=0x55970cae9f90, isTopLevel=true, setHoldSnapshot=false,
dest=0x55970caea250, qc=0x7ffff6af5ce0) at pquery.c:1158
#18 0x000055970b081112 in PortalRunMulti (portal=0x55970cb63a18,
isTopLevel=true, setHoldSnapshot=false, dest=0x55970caea250,
altdest=0x55970caea250, qc=0x7ffff6af5ce0) at pquery.c:1315
#19 0x000055970b08055c in PortalRun (portal=0x55970cb63a18,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x55970caea250, altdest=0x55970caea250, qc=0x7ffff6af5ce0) at
pquery.c:791
#20 0x000055970b07926b in exec_simple_query (query_string=0x55970cae9518
"REINDEX SYSTEM;") at postgres.c:1274
#21 0x000055970b07e2d9 in PostgresMain (dbname=0x55970cb21978 "regression",
username=0x55970cae57c8 "law") at postgres.c:4637
#22 0x000055970af9f014 in BackendRun (port=0x55970cb11a90) at
postmaster.c:4464
#23 0x000055970af9e8a0 in BackendStartup (port=0x55970cb11a90) at
postmaster.c:4192

Without VACUUM in session 2, session 3 fails on connection due to
authentication timeout, but with VACUUM it hangs until the plpgsql
block in session 1 ends.


On Fri, Mar 22, 2024 at 01:00:01PM +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      18404
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16.2
> Operating system:   Ubuntu 22.04
> Description:        
> 
> The following script:
> createuser u
> 
> cat << 'EOF' | psql -U u &
> DO '
> BEGIN
>   PERFORM count(*) FROM pg_stat_activity;
>   RAISE NOTICE ''sleeping...'';
>   PERFORM pg_sleep(1800);
> END';
> EOF
> sleep 0.5
> 
> cat << EOF | psql &
> VACUUM;
> REINDEX SYSTEM;
> EOF
> sleep 0.5
> 
> cat << EOF | psql -U u
> SELECT 1
> EOF
> 
> ends with the last session stuck on startup:
> law      3318525 3318511  0 15:00 ?        00:00:00 postgres: u regression
> [local] DO
> law      3318530 3318511  0 15:00 ?        00:00:00 postgres: law regression
> [local] REINDEX waiting
> law      3318533 3318511  0 15:00 ?        00:00:00 postgres: u regression
> [local] startup waiting
> law      3318628 3318511  0 15:01 ?        00:00:00 postgres: autovacuum
> worker  waiting
> law      3318654 3318511  0 15:01 ?        00:00:00 postgres: autovacuum
> worker  waiting
> law      3318676 3318511  0 15:02 ?        00:00:00 postgres: autovacuum
> worker  waiting
> 
> The backtrace of the last session backend is:
[waiting for AccessShareLock on pg_authid_rolname_index]

> The backtrace of the REINDEXing backend is:
[waiting for AccessExclusiveLock on pg_authid_rolname_index]


As a partial workaround, you can set lock_timeout before REINDEX SYSTEM.

Each backend is doing the right thing in isolation, under current objectives.
The first backend holds AccessShareLock on pg_authid_rolname_index, because
pg_stat_activity is a view that joins to pg_authid.  We follow
https://en.wikipedia.org/wiki/Two-phase_locking#Strong_strict_two-phase_locking
and hold relation locks till end of xact.  That could be the easiest part to
change, but it's still hard.  REINDEX acquires AccessExclusiveLock to make
trivial the code for making readers transition to the replacement index.
Backend startup needs to read catalogs; that is least likely to change.  I
don't see a general way to keep the rest of the system productive when
AccessExclusiveLock of a system relation is happening.  I can imagine mostly
one-off changes, like making pg_stat_activity release locks earlier.