Обсуждение: Stack-based tracking of per-node WAL/buffer usage
Hi,
Please find attached a patch series that introduces a new paradigm for how per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated distinct buffer hits [0].
Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in InstrStopNode we call the rather expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity happened within a given node cycle.
This proposal instead uses a stack, where each time we enter a node (InstrStartNode) we point a new global (pgInstrStack) to the current stack entry. Whilst we're in that node we increment buffer/WAL usage statistics to the stack entry. On exit (InstrStopNode) we restore the previous entry.
This change provides about a 10% performance benefit for EXPLAIN ANALYZE on paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):
CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);
master (124ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms
patched (109ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)
I have also prototyped a more ambitious approach that completely removes pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g. pg_stat_statements), but for now this patch set does not include that change, but instead keeps adding to these legacy globals as well.
Patches attached:
0001: Separate node instrumentation from other use of Instrumentation struct
Previously different places (e.g. query "total time") were repurposing the per-node Instrumentation struct. Instead, simplify the Instrumentation struct to only track time, WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information.
Previously different places (e.g. query "total time") were repurposing the per-node Instrumentation struct. Instead, simplify the Instrumentation struct to only track time, WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information.
0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros
0003: Introduce stack for tracking per-node WAL/buffer usage
Feedback/thoughts welcome!
CCing Andres since he had expressed interest in this off-list.
[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based EXPLAIN (BUFFERS DISTINCT): https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf
Thanks,
Lukas
Lukas Fittl
Вложения
Hi, On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote: > Please find attached a patch series that introduces a new paradigm for how > per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce > overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated > distinct buffer hits [0]. I like this for a third reason: To separate out buffer access statistics for the index and the table in index scans. Right now it's very hard to figure out if a query is slow because of the index lookups or finding the tuples in the table. > 0001: Separate node instrumentation from other use of Instrumentation struct > > Previously different places (e.g. query "total time") were repurposing > the per-node Instrumentation struct. Instead, simplify the Instrumentation > struct to only track time, WAL/buffer usage, and tuple counts. Similarly, > drop the use of InstrEndLoop outside of per-node instrumentation. Introduce > the NodeInstrumentation struct to carry forward the per-node > instrumentation information. It's mildly odd that the two types of instrumentation have a different definition of 'total' (one double, one instr_time). > 0003: Introduce stack for tracking per-node WAL/buffer usage > From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <lukas@fittl.com> > Date: Sun, 31 Aug 2025 16:37:05 -0700 > Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage > diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c > index b83ced9a57a..1c2268bc608 100644 > --- a/src/backend/executor/execMain.c > +++ b/src/backend/executor/execMain.c > @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc, > DestReceiver *dest; > bool sendTuples; > MemoryContext oldcontext; > + InstrStackResource *res; > > /* sanity checks */ > Assert(queryDesc != NULL); > @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc, > if (queryDesc->totaltime) > InstrStart(queryDesc->totaltime); > > + /* Start up per-query node level instrumentation */ > + res = InstrStartQuery(); > + > /* > * extract information from the query descriptor and the query feature. > */ > @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc, > if (sendTuples) > dest->rShutdown(dest); > > + /* Shut down per-query node level instrumentation */ > + InstrShutdownQuery(res); > + > if (queryDesc->totaltime) > InstrStop(queryDesc->totaltime, estate->es_processed); Why are we doing Instr{Start,Stop}Query when not using instrumentation? Resowner stuff ain't free, so I'd skip them when not necessary. > diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c > index d286471254b..7436f307994 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context) > > /* Stop the node if we started it above, reporting 0 tuples. */ > if (node->instrument && node->instrument->running) > + { > InstrStopNode(node->instrument, 0); > > + /* > + * Propagate WAL/buffer stats to the parent node on the > + * instrumentation stack (which is where InstrStopNode returned us > + * to). > + */ > + InstrNodeAddToCurrent(&node->instrument->stack); > + } > + > return false; > } Can we rely on this being reached? Note that ExecutePlan() calls ExecShutdownNode() conditionally: /* * If we know we won't need to back up, we can release resources at this * point. */ if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD)) ExecShutdownNode(planstate); > +static void > +ResOwnerReleaseInstrStack(Datum res) > +{ > + /* > + * XXX: Registered resources are *not* called in reverse order, i.e. we'll > + * get what was first registered first at shutdown. To avoid handling > + * that, we are resetting the stack here on abort (instead of recovering > + * to previous). > + */ > + pgInstrStack = NULL; > +} Hm, doesn't that mean we loose track of instrumentation if you e.g. do an EXPLAIN ANALYZE of a query that executes a function, which internally triggers an error and catches it? I wonder if the solution could be to walk the stack and search for the to-be-released element. Greetings, Andres Freund
Hi Andres,
Thanks for the review!
Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.
FWIW, pgWalUsage would also be nice to remove, but it has some interesting interactions with the cumulative statistics system and heap_page_prune_and_freeze, and seems less performance critical.
On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:
> 0001: Separate node instrumentation from other use of Instrumentation struct
...
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).
Yeah, agreed. I added in a new 0001 patch that changes this to instr_time consistently. I don't see a good reason to keep the transformation to seconds in the Instrumentation logic, since all in-tree callers convert it to milliseconds anyway.
> 0003: Introduce stack for tracking per-node WAL/buffer usage
Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.
Makes sense, I've adjusted that to be conditional (in the now renamed 0004 patch).
In the updated patch I've also decided to piggyback on QueryDesc totaltime as the "owning" Instrumentation here for the query's lifetime. It seems simpler that way and avoids having special purpose methods. To go along with that I've changed the general purpose Instrumentation struct to use stack-based instrumentation at the same time.
> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
...
> + InstrNodeAddToCurrent(&node->instrument->stack);
...
Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:
You are of course correct, I didn't consider cursors correctly here.
It seems there isn't an existing executor node walk that could be repurposed, so I added a new one in ExecutorFinish ("ExecAccumNodeInstrumentation"). From my read of the code there are no use cases where we need aggregated instrumentation data before ExecutorFinish.
> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}
Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?
I wonder if the solution could be to walk the stack and search for the
to-be-released element.
Yes, good point, I did not consider that case. I've addressed this by only updating the current stack if its not already a parent of the element being released. We are also always adding the element's statistics to the (updated) active stack element at abort.
Thanks,
Lukas
--
Lukas Fittl
Вложения
- v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch
- v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch
- v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch
- v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch
- v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch
- v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch
- v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch