Обсуждение: Segfault due to NULL ParamExecData value

Поиск

Список

Период

Сортировка

Segfault due to NULL ParamExecData value

От

Anthonin Bonnefoy

Дата:

04 декабря 2025 г., 17:25:55

Hi,

We had multiple segfaults happening on PG14.17. All coredumps showed the following backtrace:

#0: postgres`toast_raw_datum_size(value=0) at detoast.c:550:6
#1: postgres`textne(fcinfo=0x0000c823e4759338) at varlena.c:1848:10
#2: postgres`ExecInterpExpr(state=0x0000c823e4758a48, econtext=0x0000c823e4757b88, isnull=<unavailable>) at execExprInterp.c:749:8
#3: postgres`ExecScan at executor.h:342:13
#4: postgres`ExecScan at executor.h:411:8
#5: postgres`ExecScan(node=0x0000c823e4757978, accessMtd=(postgres`FunctionNext at nodeFunctionscan.c:61:1), recheckMtd=(postgres`FunctionRecheck)) at execScan.c:226:23
#6: postgres`ExecSubPlan [inlined] ExecProcNode(node=0x0000c823e4757978) at executor.h:260:9
#7: postgres`ExecSubPlan at nodeSubplan.c:302:14
#8: postgres`ExecSubPlan(node=0x0000c823e47814e8, econtext=0x0000c823e475a658, isNull=0x0000c823e47814c0) at nodeSubplan.c:89:12
#9: postgres`ExecInterpExpr at execExprInterp.c:3954:18
#10: postgres`ExecInterpExpr(state=0x0000c823e47813c0, econtext=0x0000c823e475a658, isnull=<unavailable>) at execExprInterp.c:1576:4
#11: postgres`ExecNestLoop [inlined] ExecEvalExprSwitchContext(isNull=0x0000ffffebb9d637, econtext=0x0000c823e475a658, state=<unavailable>) at executor.h:342:13
#12: postgres`ExecNestLoop [inlined] ExecProject(projInfo=<unavailable>) at executor.h:376:9
#13: postgres`ExecNestLoop(pstate=<unavailable>) at nodeNestloop.c:241:12
#14: postgres`EvalPlanQual at executor.h:260:9
#15: postgres`ExecUpdate(mtstate=0x0000c823e4651a98, resultRelInfo=0x0000c823e4651ca8, tupleid=0x0000ffffebb9d858, oldtuple=0x0000000000000000, slot=<unavailable>, planSlot=0x0000c823e4661800, epqstate=0x0000c823e4651b80, estate=0x0000c823e46ace18, canSetTag=<unavailable>) at nodeModifyTable.c:2007:18
#16: postgres`ExecModifyTable(pstate=0x0000c823e4651a98) at nodeModifyTable.c:2760:12
#17: postgres`standard_ExecutorRun [inlined] ExecProcNode(node=0x0000c823e4651a98) at executor.h:260:9
#18: postgres`standard_ExecutorRun at execMain.c:1555:10
#19: postgres`standard_ExecutorRun(queryDesc=0x0000c823e45d51a0, direction=<unavailable>, count=0, execute_once=<unavailable>) at execMain.c:360:3

textne's arg2 is null, leading to the segfault in toast_raw_datum_size. The segfaults all happened with the following query.

WITH RECURSIVE
params AS (SELECT $1::text AS schema, $2::text AS name, $3::text AS version),
seed AS (SELECT p.schema || E'\t^\t' || p.name AS node FROM params p),
reach AS (SELECT v."schema" || E'\t^\t' || v."name" AS node
FROM definitions v WHERE (SELECT node FROM seed) = ANY(v.used_tables)
UNION
SELECT v."schema" || E'\t^\t' || v."name" AS node
FROM definitions v, reach r WHERE r.node = ANY(v.used_tables)),
to_update AS (SELECT DISTINCT split_part(r.node, E'\t^\t', 1) AS "schema",
split_part(r.node, E'\t^\t', 2) AS "name" FROM reach r),
kv AS (SELECT (p.schema || '.' || p.name) AS key_str,
(p.schema || '.' || p.name || ':' || p.version) AS new_entry
FROM params p)
UPDATE definitions v SET dependencies = array_cat(
COALESCE(ARRAY(SELECT e
FROM unnest(COALESCE(v.dependencies, ARRAY[]::text[])) AS e
WHERE split_part(e, ':', 1) <> (SELECT key_str FROM kv)
), ARRAY[]::text[]
), ARRAY[(SELECT new_entry FROM kv)]
)
FROM to_update u
WHERE v."schema" = u."schema" AND v."name" = u."name";

Which has the following plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on definitions v (cost=63318199175.74..63318200084.69 rows=0 width=0)
CTE params
-> Result (cost=0.00..0.01 rows=1 width=96)
CTE reach
-> Recursive Union (cost=0.03..61108645382.88 rows=73651793079 width=32)
-> Seq Scan on definitions v_1 (cost=0.03..370199.45 rows=27139 width=32)
Filter: ($2 = ANY (used_tables))
InitPlan 2 (returns $2)
-> CTE Scan on params p (cost=0.00..0.03 rows=1 width=32)
-> Nested Loop (cost=0.00..5963523932.18 rows=7365176594 width=32)
Join Filter: (r_1.node = ANY (v_2.used_tables))
-> Seq Scan on definitions v_2 (cost=0.00..363124.99 rows=555099 width=315)
-> WorkTable Scan on reach r_1 (cost=0.00..5427.80 rows=271390 width=32)
CTE kv
-> CTE Scan on params p_1 (cost=0.00..0.04 rows=1 width=64)
InitPlan 7 (returns $7)
-> CTE Scan on kv kv_1 (cost=0.00..0.02 rows=1 width=32)
-> Nested Loop (cost=2209553792.80..2209554701.75 rows=100 width=126)
-> Subquery Scan on u (cost=2209553792.37..2209553797.37 rows=200 width=152)
-> HashAggregate (cost=2209553792.37..2209553795.37 rows=200 width=64)
Group Key: split_part(r.node, ' ^ '::text, 1), split_part(r.node, ' ^ '::text, 2)
-> CTE Scan on reach r (cost=0.00..1841294826.97 rows=73651793079 width=64)
-> Index Scan using definitions_schema_name_idx on definitions v (cost=0.42..4.40 rows=3 width=68)
Index Cond: ((schema = u.schema) AND (name = u.name))
SubPlan 6
-> Function Scan on unnest e (cost=0.02..0.17 rows=9 width=32)
Filter: (split_part(e, ':'::text, 1) <> $5)
InitPlan 5 (returns $5)
-> CTE Scan on kv (cost=0.00..0.02 rows=1 width=32)

Unfortunately, I wasn't able to reproduce the segfault, so the only available information I have are the coredumps.

The failure happens when textne of 'WHERE split_part(e, ':', 1) <> (SELECT key_str FROM kv)' is evaluated. Looking at ExprState, there are 7 steps with the following opcodes:
0: SCAN_FETCHSOME
1: SCAN_VAR
2: FUNC_EXPR_STRICT
3: PARAM_EXEC
4: FUNC_EXPR_STRICT
5: EEOP_QUAL
6: EEOP_DONE

Step 3 runs the subplan InitPlan 5 to fill the arg2 for textne (step 4). If I look at step 3's param:
p state->steps[3].d.param
((unnamed struct)) $219 = (paramid = 5, paramtype = 25)

Then, looking at the matching ParamExecData:
p econtext->ecxt_param_exec_vals[5]
(ParamExecData) $220 = (execPlan = 0x0000000000000000, value = 0, isnull = false)

When looking at the matching WAL records, we also see at least two updates before the segfault is triggered:
rmgr: Heap len (rec/tot): 59/ 2139, tx: 2549003939, lsn: B4D/21956518, prev B4D/219564E8, desc: LOCK off 1: xid 2549003939: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/16386/16899 blk 160730 FPW
rmgr: Heap len (rec/tot): 2055/ 2055, tx: 2549003939, lsn: B4D/21956D78, prev B4D/21956518, desc: UPDATE off 1 xmax 2549003939 flags 0x11 KEYS_UPDATED ; new off 3 xmax 0, blkref #0: rel 1663/16386/16899 blk 160730
rmgr: Btree len (rec/tot): 55/ 1971, tx: 2549003939, lsn: B4D/21957698, prev B4D/21957668, desc: INSERT_LEAF off 122, blkref #0: rel 1663/16386/16905 blk 3261 FPW
rmgr: Btree len (rec/tot): 104/ 104, tx: 2549003939, lsn: B4D/21957E50, prev B4D/21957698, desc: INSERT_LEAF off 18, blkref #0: rel 1663/16386/16907 blk 1517
rmgr: Btree len (rec/tot): 104/ 104, tx: 2549003939, lsn: B4D/21957EB8, prev B4D/21957E50, desc: INSERT_LEAF off 90, blkref #0: rel 1663/16386/53784856 blk 1020
rmgr: Btree len (rec/tot): 55/ 1156, tx: 2549003939, lsn: B4D/21957F20, prev B4D/21957EB8, desc: INSERT_LEAF off 11, blkref #0: rel 1663/16386/57258051 blk 7015 FPW
rmgr: Btree len (rec/tot): 55/ 209, tx: 2549003939, lsn: B4D/219583C0, prev B4D/21957F20, desc: INSERT_LEAF off 4, blkref #0: rel 1663/16386/57459940 blk 1921 FPW
rmgr: Gin len (rec/tot): 566/ 566, tx: 2549003939, lsn: B4D/21958498, prev B4D/219583C0, desc: UPDATE_META_PAGE , blkref #0: rel 1663/16386/57459942 blk 0, blkref #1: rel 1663/16386/57459942 blk 808
rmgr: Heap len (rec/tot): 54/ 54, tx: 2549003939, lsn: B4D/25A4C7F0, prev B4D/25A4C7B8, desc: LOCK off 9: xid 2549003939: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16386/16899 blk 40
rmgr: Heap len (rec/tot): 1827/ 1827, tx: 2549003939, lsn: B4D/25A4CAC8, prev B4D/25A4CA88, desc: HOT_UPDATE off 9 xmax 2549003939 flags 0x10 ; new off 10 xmax 2549003939, blkref #0: rel 1663/16386/16899 blk 40
rmgr: Heap2 len (rec/tot): 56/ 56, tx: 2549003939, lsn: B4D/25A4D1F0, prev B4D/25A4CAC8, desc: PRUNE latestRemovedXid 0 nredirected 0 ndead 0, blkref #0: rel 1663/16386/16899 blk 100
rmgr: Heap len (rec/tot): 54/ 54, tx: 2549003939, lsn: B4D/25A4D228, prev B4D/25A4D1F0, desc: LOCK off 1: xid 2549003939: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16386/16899 blk 19749

On the logs side, we see row lock contentions happening before the segfault:
2025-11-04T17:02:56.507Z,process 289871 still waiting for ShareLock on transaction 2549003939 after 1000.053 ms
2025-11-04T17:02:56.507Z,Process holding the lock: 292365. Wait queue: 289871.
2025-11-04T17:02:58.938Z,process 292365 still waiting for ShareLock on transaction 2549003931 after 1000.052 ms
2025-11-04T17:02:58.938Z,Process holding the lock: 292716. Wait queue: 285801, 292365.
2025-11-04T17:02:58.938Z,while updating tuple (40,8) in relation "definitions"
2025-11-04T17:03:00.041Z,process 292365 acquired ShareLock on transaction 2549003931 after 2102.985 ms
2025-11-04T17:03:00.041Z,while updating tuple (40,8) in relation "definitions"
2025-11-04T17:03:00.041Z,process 283964 acquired ExclusiveLock on tuple (40,8) of relation 16899 of database 16386 after 1997.621 ms
2025-11-04T17:03:00.201Z,server process (PID 292365) was terminated by signal 11: Segmentation fault

So it looks like the ParamExec for the InitPlan 5 was correctly executed (since execPlan is null) and the value was probably used during the first two updates. But for the third update, the ParamExecData's value was null leading to the segfault.
All coredumps (or rather WAL records) show a similar pattern of 2 updates before segfaults.
I haven't been able to reproduce the segfault so I wasn't able to pinpoint what could have set ParamExecData's value to null.

Regards,
Anthonin Bonnefoy

Re: Segfault due to NULL ParamExecData value

От

Tom Lane

Дата:

04 декабря 2025 г., 18:35:36

Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> writes:
> So it looks like the ParamExec for the InitPlan 5 was correctly executed
> (since execPlan is null) and the value was probably used during the first
> two updates. But for the third update, the ParamExecData's value was null
> leading to the segfault.
> All coredumps (or rather WAL records) show a similar pattern of 2 updates
> before segfaults.
> I haven't been able to reproduce the segfault so I wasn't able to pinpoint
> what could have set ParamExecData's value to null.

I'm not volunteering to look into this without a reproducer.
However, seeing that EvalPlanQual is in the stack trace,
my gut feeling is that the EPQ mechanism is somehow mis-managing
output Params for InitPlans.  I vaguely recall some definitional
issues around whether it'd be okay to pass down already-computed
InitPlan results into the EPQ sub-evaluation, or whether we should
force the sub-evaluation to do those afresh.  It was awhile back
and I don't remember what was decided.

Don't suppose you can try to reproduce this on something newer
than 14.17?

            regards, tom lane

Re: Segfault due to NULL ParamExecData value

От

Anthonin Bonnefoy

Дата:

05 декабря 2025 г., 16:33:26

On Thu, Dec 4, 2025 at 4:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not volunteering to look into this without a reproducer.

However, seeing that EvalPlanQual is in the stack trace,
my gut feeling is that the EPQ mechanism is somehow mis-managing
output Params for InitPlans. I vaguely recall some definitional
issues around whether it'd be okay to pass down already-computed
InitPlan results into the EPQ sub-evaluation, or whether we should
force the sub-evaluation to do those afresh. It was awhile back
and I don't remember what was decided.

That sounds like an interesting lead. The impacted cluster definitely had a lot of long transactions updating the same rows with the occasional deadlocks.

Don't suppose you can try to reproduce this on something newer
than 14.17?

That would be hard. On the production cluster, we've stopped the segfaults by rewriting the query (faster execution time, so less recheck and EPQ?) and I don't have much leeway to experiment on those.

I'm currently working on a backup of the cluster trying to redo the same queries using 14.17, since the issue was happening with this version. If I manage to have a reproducer, I will check newer versions. I will focus on triggering the EPQ since that looks like a good lead.

Regards,

Anthonin Bonnefoy

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Segfault due to NULL ParamExecData value

Segfault due to NULL ParamExecData value

Re: Segfault due to NULL ParamExecData value

Re: Segfault due to NULL ParamExecData value