Обсуждение: Organize working memory under per-PlanState context
Right now, the work_mem limit is tracked and enforced ad-hoc throughout the executor. Different nodes tally the total memory usage differently, may use different internal data structures (each of which can consume work_mem independently), and decide when to spill based on different criteria, etc. Note: this is *not* an attempt to manage the memory usage across an entire query. This is just about memory usage within a single executor node. The attached proof-of-concept patch adds a MemoryContext field ps_WorkMem to PlanState, and nodes that care about it initialize it with GetWorkMem(), and switch to that context as necessary. It doesn't do much yet, but it creates infrastructure that will be useful for subsequent patches to make the memory accounting and enforcement more consistent throughout the executor. In particular, it uses MemoryContextMemAllocated(), which includes fragmentation, chunk headers, and free space. In the current code, fragmentation is ignored most places, so (for example) switching to the Bump allocator doesn't show the savings. This isn't ready yet, but I'd appreciate some thoughts on the overall architectural direction. Regards, Jeff Davis
Вложения
On Aug 20, 2025, at 07:34, Jeff Davis <pgsql@j-davis.com> wrote:
<v1-0001-Use-per-PlanState-memory-context-for-work-mem.patch>
I understand this is not a final patch, so I would focus on the design:
1. This patch adds ps_WorkMem to PlanState, and make it as parent for other per-node memory contexts, thus all memory usage with that node is grouped and measurable, which is good.
I am thinking that, this change may lead to a chance to free those per-node memory usage earlier once a node finishes. Before this patch, those node execution memory contexts’ parent/grandparent is portal context, memories will only be released once a query is done. Now, with this change, maybe we can free per-node memory earlier, which may benefit large complex query executions.
I know some memory must be retained until the entire query finishes. But those per-node memories, such as hash table, might be destroyed immediately after a node finishes.
2. For the new function ValidateWorkMemParent(), the logic is that, if a parent is given, then the parent must be a child of workmem; otherwise use the node’s ps_WorkMem as parent. Can you add a comment to explain why design like this? And when a parent should be explicitly specified?
3. The new function WorkMemClamBlockSize() will change the existing logic, because it always use Min(maxBlockSize, size). But nodeAgg.c, the old logic is:
/*
* Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
* to avoid large jumps in memory usage.
*/
maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);
/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);
/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);
aggstate->hash_tablecxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
"HashAgg table context",
ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE,
maxBlockSize);
The new function doesn’t cover the logic of “no smaller than ALLOCSET_DEFAULT_INTSIZE. Is logic change intended?
4. You add MemoryContextSwitchTo(GetWorkMem(ps)) wrapper around all tuplesort_begin_heap():
oldcontext = MemoryContextSwitchTo(GetWorkMem(&aggstate->ss.ps));
aggstate->sort_out = tuplesort_begin_heap(tupDesc,
sortnode->numCols,
sortnode->sortColIdx,
sortnode->sortOperators,
sortnode->collations,
sortnode->nullsFirst,
work_mem,
NULL, TUPLESORT_NONE);
MemoryContextSwitchTo(oldcontext);
Should we just pass in a memory context into tuplesort_begin_heap()? Also, for the parameter “worker_mem”, should we change it to GetWorkMemLimit()?
5. For the new function CheckWorkMemLimit():
bool
CheckWorkMemLimit(PlanState *ps)
{
size_t allocated = MemoryContextMemAllocated(GetWorkMem(ps), true);
return allocated <= ps->ps_WorkMemLimit;
}
It calls GetWorkMem(), and GetWorkMem() will create work men if it is null. Meaning that, once a caller calls CheckWorkMemLimit() against a node, if the node has no work mem yet, then it will automatically create workmem for the node. Will that lead to unneeded workmem creation? If workmem is null, that means workmem is not used, thus not exceeding limit.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On 20/8/2025 07:38, Chao Li wrote: > I know some memory must be retained until the entire query finishes. But > those per-node memories, such as hash table, might be destroyed > immediately after a node finishes. I'm not sure I understand your reasoning clearly. How do you know that the current subtree will not be rescanned with the same parameter set? Building a hash table repeatedly may be pretty costly, no? -- regards, Andrei Lepikhov
On 20/8/2025 01:34, Jeff Davis wrote: > It doesn't do much yet, but it creates infrastructure that will be > useful for subsequent patches to make the memory accounting and > enforcement more consistent throughout the executor.Does this mean that you are considering flexible memory allocation during execution based on a specific memory quota? If so, I believe this should be taken into account during the optimisation stage. If the planner calculates the cost of the nodes using only a single work_mem value, it could lead to suboptimal execution. For example, this might result in too many intermediate results being written to disk, which in turn can reduce correlation between the estimated plan cost and the actual execution time. -- regards, Andrei Lepikhov
On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote: > I'm not sure I understand your reasoning clearly. How do you know > that > the current subtree will not be rescanned with the same parameter > set? > Building a hash table repeatedly may be pretty costly, no? We can check the eflags for EXEC_FLAG_REWIND. That might not be the only condition we need to check, but we should know at plan time whether a subtree might be executed more than once. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote: >> Building a hash table repeatedly may be pretty costly, no? > We can check the eflags for EXEC_FLAG_REWIND. That might not be the > only condition we need to check, but we should know at plan time > whether a subtree might be executed more than once. Side note: EXEC_FLAG_REWIND is defined as "you should be prepared to handle REWIND efficiently". Not as "if this is off, you are guaranteed not to see a REWIND". I'm not sure that this affects what Jeff wants to do, but let's not be fuzzy about what information is available at execution time. regards, tom lane
On 20/8/2025 19:00, Jeff Davis wrote: > On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote: >> I'm not sure I understand your reasoning clearly. How do you know >> that >> the current subtree will not be rescanned with the same parameter >> set? >> Building a hash table repeatedly may be pretty costly, no? > > We can check the eflags for EXEC_FLAG_REWIND. That might not be the > only condition we need to check, but we should know at plan time > whether a subtree might be executed more than once.Postgres builds the plan tree from the bottom up, no? Estimating costs and choosing a specific operator at one level of the query tree, the planner never knows what will come next, on the upper level of this tree. To work such problems out, in my 'optimiser support' extensions, I use one extra 'Top-Bottom' pass, reconsidering decisions that have been made based on information grabbed from the upper levels of the almost finished plan. Does your project move in a similar direction? [1] https://github.com/danolivo/conf/blob/main/2025-MiddleOut/MiddleOut.pdf -- regards, Andrei Lepikhov