Обсуждение: Organize working memory under per-PlanState context

Поиск
Список
Период
Сортировка

Organize working memory under per-PlanState context

От
Jeff Davis
Дата:
Right now, the work_mem limit is tracked and enforced ad-hoc throughout
the executor. Different nodes tally the total memory usage differently,
may use different internal data structures (each of which can consume
work_mem independently), and decide when to spill based on different
criteria, etc.

Note: this is *not* an attempt to manage the memory usage across an
entire query. This is just about memory usage within a single executor
node.

The attached proof-of-concept patch adds a MemoryContext field
ps_WorkMem to PlanState, and nodes that care about it initialize it
with GetWorkMem(), and switch to that context as necessary.

It doesn't do much yet, but it creates infrastructure that will be
useful for subsequent patches to make the memory accounting and
enforcement more consistent throughout the executor.

In particular, it uses MemoryContextMemAllocated(), which includes
fragmentation, chunk headers, and free space. In the current code,
fragmentation is ignored most places, so (for example) switching to the
Bump allocator doesn't show the savings.

This isn't ready yet, but I'd appreciate some thoughts on the overall
architectural direction.

Regards,
    Jeff Davis




Вложения

Re: Organize working memory under per-PlanState context

От
Chao Li
Дата:


On Aug 20, 2025, at 07:34, Jeff Davis <pgsql@j-davis.com> wrote:


<v1-0001-Use-per-PlanState-memory-context-for-work-mem.patch>

I understand this is not a final patch, so I would focus on the design:

1. This patch adds ps_WorkMem to PlanState, and make it as parent for other per-node memory contexts, thus all memory usage with that node is grouped and measurable, which is good. 

I am thinking that, this change may lead to a chance to free those per-node memory usage earlier once a node finishes. Before this patch, those node execution memory contexts’ parent/grandparent is portal context, memories will only be released once a query is done. Now, with this change, maybe we can free per-node memory earlier, which may benefit large complex query executions.

I know some memory must be retained until the entire query finishes. But those per-node memories, such as hash table, might be destroyed immediately after a node finishes.

2. For the new function ValidateWorkMemParent(), the logic is that, if a parent is given, then the parent must be a child of workmem; otherwise use the node’s ps_WorkMem as parent. Can you add a comment to explain why design like this? And when a parent should be explicitly specified?

3. The new function WorkMemClamBlockSize() will change the existing logic, because it always use Min(maxBlockSize, size). But nodeAgg.c, the old logic is:

  /*
   * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
   * to avoid large jumps in memory usage.
   */
  maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);

  /* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
  maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);

  /* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
  maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);

  aggstate->hash_tablecxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
                        "HashAgg table context",
                        ALLOCSET_DEFAULT_MINSIZE,
                        ALLOCSET_DEFAULT_INITSIZE,
                        maxBlockSize);

The new function doesn’t  cover the logic of “no smaller than ALLOCSET_DEFAULT_INTSIZE. Is logic change intended?

4. You add MemoryContextSwitchTo(GetWorkMem(ps)) wrapper around all tuplesort_begin_heap():

    oldcontext = MemoryContextSwitchTo(GetWorkMem(&aggstate->ss.ps));

    aggstate->sort_out = tuplesort_begin_heap(tupDesc,
                          sortnode->numCols,
                          sortnode->sortColIdx,
                          sortnode->sortOperators,
                          sortnode->collations,
                          sortnode->nullsFirst,
                          work_mem,
                          NULL, TUPLESORT_NONE);

    MemoryContextSwitchTo(oldcontext);

Should we just pass in a memory context into tuplesort_begin_heap()? Also, for the parameter “worker_mem”, should we change it to GetWorkMemLimit()?

5. For the new function CheckWorkMemLimit():

bool
CheckWorkMemLimit(PlanState *ps)
{
  size_t allocated = MemoryContextMemAllocated(GetWorkMem(ps), true);
  return allocated <= ps->ps_WorkMemLimit;
}

It calls GetWorkMem(), and GetWorkMem() will create work men if it is null. Meaning that, once a caller calls CheckWorkMemLimit() against a node, if the node has no work mem yet, then it will automatically create workmem for the node. Will that lead to unneeded workmem creation? If workmem is null, that means workmem is not used, thus not exceeding limit.



--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Organize working memory under per-PlanState context

От
Andrei Lepikhov
Дата:
On 20/8/2025 07:38, Chao Li wrote:
> I know some memory must be retained until the entire query finishes. But 
> those per-node memories, such as hash table, might be destroyed 
> immediately after a node finishes.
I'm not sure I understand your reasoning clearly. How do you know that 
the current subtree will not be rescanned with the same parameter set? 
Building a hash table repeatedly may be pretty costly, no?

-- 
regards, Andrei Lepikhov



Re: Organize working memory under per-PlanState context

От
Andrei Lepikhov
Дата:
On 20/8/2025 01:34, Jeff Davis wrote:
> It doesn't do much yet, but it creates infrastructure that will be
> useful for subsequent patches to make the memory accounting and
> enforcement more consistent throughout the executor.Does this mean that you are considering flexible memory
allocation
 
during execution based on a specific memory quota? If so, I believe this 
should be taken into account during the optimisation stage. If the 
planner calculates the cost of the nodes using only a single work_mem 
value, it could lead to suboptimal execution. For example, this might 
result in too many intermediate results being written to disk, which in 
turn can reduce correlation between the estimated plan cost and the 
actual execution time.

-- 
regards, Andrei Lepikhov



Re: Organize working memory under per-PlanState context

От
Jeff Davis
Дата:
On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote:
> I'm not sure I understand your reasoning clearly. How do you know
> that
> the current subtree will not be rescanned with the same parameter
> set?
> Building a hash table repeatedly may be pretty costly, no?

We can check the eflags for EXEC_FLAG_REWIND. That might not be the
only condition we need to check, but we should know at plan time
whether a subtree might be executed more than once.

Regards,
    Jeff Davis




Re: Organize working memory under per-PlanState context

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote:
>> Building a hash table repeatedly may be pretty costly, no?

> We can check the eflags for EXEC_FLAG_REWIND. That might not be the
> only condition we need to check, but we should know at plan time
> whether a subtree might be executed more than once.

Side note: EXEC_FLAG_REWIND is defined as "you should be prepared
to handle REWIND efficiently".  Not as "if this is off, you are
guaranteed not to see a REWIND".  I'm not sure that this affects
what Jeff wants to do, but let's not be fuzzy about what information
is available at execution time.

            regards, tom lane



Re: Organize working memory under per-PlanState context

От
Andrei Lepikhov
Дата:
On 20/8/2025 19:00, Jeff Davis wrote:
> On Wed, 2025-08-20 at 09:22 +0200, Andrei Lepikhov wrote:
>> I'm not sure I understand your reasoning clearly. How do you know
>> that
>> the current subtree will not be rescanned with the same parameter
>> set?
>> Building a hash table repeatedly may be pretty costly, no?
> 
> We can check the eflags for EXEC_FLAG_REWIND. That might not be the
> only condition we need to check, but we should know at plan time
> whether a subtree might be executed more than once.Postgres builds the plan tree from the bottom up, no? Estimating
costs
 
and choosing a specific operator at one level of the query tree, the 
planner never knows what will come next, on the upper level of this tree.
To work such problems out, in my 'optimiser support' extensions, I use 
one extra 'Top-Bottom' pass, reconsidering decisions that have been made 
based on information grabbed from the upper levels of the almost 
finished plan. Does your project move in a similar direction?

[1] https://github.com/danolivo/conf/blob/main/2025-MiddleOut/MiddleOut.pdf

-- 
regards, Andrei Lepikhov