Обсуждение: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

Поиск
Список
Период
Сортировка
I want customers to be able to run large OLAP queries on PostgreSQL,
using as much memory as possible, to avoid spilling — without running
out of memory.

There are other ways to run out of memory, but the fastest and easiest
way, on an OLAP query, is to use a lot of work_mem. (This is true for
any SQL database: SQL operators are “usually” streaming operators...
except for those that use work_mem.) PostgreSQL already supports the
work_mem GUC, and every PostgreSQL operator tries very hard to spill
to disk rather than exceed its work_mem limit. For now, I’m not
concerned about other ways for queries to run out of memory — just
work_mem.

I like the way PostgreSQL operators respect work_mem, but I can’t find
a good way to set the work_mem GUC. Oracle apparently had the same
problem, with their RDBMS, 20 years ago [1]:

“In releases earlier than Oracle Database 10g, the database
administrator controlled the maximum size of SQL work areas by setting
the following parameters: SORT_AREA_SIZE, HASH_AREA_SIZE, ... Setting
these parameters is difficult, because the maximum work area size is
ideally selected from the data input size and the total number of work
areas active in the system. These two factors vary greatly from one
work area to another and from one time to another. Thus, the various
*_AREA_SIZE parameters are difficult to tune under the best of
circumstances.

“For this reason, Oracle strongly recommends that you leave automatic
PGA memory management enabled.”

It’s difficult to tune PostgreSQL’s work_mem and hash_mem_multiplier
GUCs, under the best of circumstances, yeah. The work_mem and
hash_mem_multiplier GUCs apply to all operators of a given type, even
though two operators of the same type, even in the same query, might
need vastly different amounts of work_mem.

I would like a “query_work_mem” GUC, similar to what’s proposed in
[2]. This GUC would be useful on its own, because it would be much
easier to tune than the existing work_mem + hash_mem_multiplier GUCs;
and it would serve as a milestone on a path to my ultimate goal of
something like Oracle’s “automatic PGA memory management.”

I call it “query_work_mem,” rather than “max_total_backend_memory,”
because (a) for now, I care only about limiting work_mem, I’ll deal
with other types of memory separately; and (b) “query” instead of
“backend” avoids ambiguity over how much memory a recursively-compiled
query gets.

(Re (b), see “crosstab()”, [3]. The “sql text” executed by crosstab()
would get its own query_work_mem allocation, separate from the query
that called the crosstab() function.)

The main problem I have with the “max_total_backend_memory” proposal,
however, is that it “enforces” its limit by killing the offending
query. This seems an overreaction to me, especially since PostgreSQL
operators already know how to spill to disk. If a customer’s OLAP
query exceeds its memory limit by 1%, I would rather spill 1% of their
data to disk, instead of cancelling their entire query.

(And if their OLAP query exceeds its memory limit by 1,000x... I still
don’t want PostgreSQL to preemptively cancel it, because either the
customer ends up OK with the overall performance, even with the
spilling; or else they decide the query is taking too long, and cancel
it themselves. I don’t want to be in the business of preemptively
cancelling customer queries.)

So, I want a query_work_mem GUC, and I want PostgreSQL to distribute
that total query_work_mem to the query’s individual SQL operators, so
that each operator will spill rather than exceed its per-operator
limit.

Making query_work_mem a session GUC makes it feasible for a DBA or an
automated system to distribute memory from a global memory pool among
individual queries, e.g. via pg_hint_plan(). So (as mentioned above),
“query_work_mem” is useful to a DBA, and also a step toward a
fully-automated memory-management system.

How should “query_work_mem” work? Let’s start with an example: suppose
we have an OLAP query that has 2 Hash Joins, and no other operators
that use work_mem. Suppose we’re pretty sure that one of the Hash
Joins will use 10 KB of work_mem, while the other will use 1 GB. And
suppose we know that the PostgreSQL instance has 1 GB of memory
available, for use by our OLAP query. (Perhaps we reserve 1 GB for
OLAP queries, and we allow only 1 OLAP query at a time; or perhaps we
have some sort of dynamic memory manager.)

How should we configure PostgreSQL so that our OLAP query spills as
little as possible, without running out of memory?

-- First, we could just say: 2 operators, total available working
memory is 1 GB — give each operator 512 MB. Then we would spill 512 MB
from the large Hash Join, because we’d waste around 512 MB for the
small Hash Join. We’re undersubscribing, to be safe, but our
performance suffers. That’s bad! We’re basically wasting memory that
the query would like to use.

-- Second, we could say, instead: the small Hash Join is *highly
unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
expecting that the small Hash Join won’t use more than 1 MB of its
1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
oversubscribing, betting that the small Hash Join will just stay
within some smaller, “unenforced” memory limit.

In this example, this bet is probably fine — but it won’t work in
general. I don’t want to be in the business of gambling with customer
resources: if the small Hash Join is unlikely to use more than 1 MB,
then let’s just assign it 1 MB of work_mem. That way, if I’m wrong,
the customer’s query will just spill, instead of running out of
memory. I am very strongly opposed to cancelling queries if/when we
can just spill to disk.

-- Third, we could just rewrite the existing “work_mem” logic so that
all of the query’s operators draw, at runtime, from a single,
“query_work_mem” pool. So, an operator won’t spill until all of
“query_work_mem” is exhausted — by the operator itself, or by some
other operator in the same query.

But doing that runs into starvation/fairness problems, where an
unlucky operator, executing later in the query, can’t get any
query_work_mem, because earlier, greedy operators used up all of it.

The solution I propose here is just to distribute the “query_work_mem”
into individual, per-operator, work_mem limits.

**Proposal:**

I propose that we add a “query_work_mem” GUC, which works by
distributing (using some algorithm to be described in a follow-up
email) the entire “query_work_mem” to the query’s operators. And then
each operator will spill when it exceeds its own work_mem limit. So
we’ll preserve the existing “spill” logic as much as possible.

To enable this to-be-described algorithm, I would add an “nbytes”
field to the Path struct, and display this (and related info) in
EXPLAIN PLAN. So the customer will be able to see how much work_mem
the SQL compiler thinks they’ll need, per operator; and so will the
algorithm.

I wouldn’t change the existing planning logic (at least not in the
initial implementaton). So the existing planning logic would choose
between different SQL operators, still on the assumption that every
operator that needs working memory will get work_mem [*
hash_mem_multiplier]. This assumption might understate or overstate
the actual working memory we’ll give the operator, at runtime. If it
understates, the planner will be biased in favor of operators that
don’t use much working memory. If it overstates, the planner will be
biased in favor of operators that use too much working memory.

(We could add a feedback loop to the planner, or even something simple
like generating multiple path, at different “work_mem” limits, but
everything I can think of here adds complexity without much potential
benefit. So I would defer any changes to the planner behavior until
later, if ever.)

The to-be-described algorithm would look at a query’s Paths’ “nbytes”
fields, as well as the session “work_mem” GUC (which would, now, serve
as a hint to the SQL compiler), and decide how much of
“query_work_mem” to assign to the corresponding Plan node.

It would assign that limit to a new “work_mem” field, on the Plan
node. And this limit would also be exposed, of course, in EXPLAIN
ANALYZE, along with the actual work_mem usage, which might very well
exceed the limit. This will let the customer know when a query spills,
and why.

I would write the algorithm to maintain the existing work_mem
behavior, as much as possible. (Backward compatibility is good!) Most
likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
*minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
work_mem [* hash _mem_multiplier] , for all operators in the query,
all operators would be assigned at least work_mem, which would make my
proposal a Pareto improvement.

Last, at runtime, each PlanState would check its plan -> work_mem
field, rather than the global work_mem GUC. Execution would otherwise
be the same as today.

What do you think?

James

[1]
https://docs.oracle.com/en//database/oracle/oracle-database/23/admin/managing-memory.html#GUID-8D7FC70A-56D8-4CA1-9F74-592F04172EA7
[2] https://www.postgresql.org/message-id/flat/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel%40crunchydata.com
[3] https://www.postgresql.org/docs/current/tablefunc.html



On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
> How should “query_work_mem” work? Let’s start with an example:
> suppose
> we have an OLAP query that has 2 Hash Joins, and no other operators
> that use work_mem.

So we plan first, and then assign available memory afterward? If we do
it that way, then the costing will be inaccurate, because the original
costs are based on the original work_mem.

It may be better than killing the query, but not ideal.

> -- Second, we could say, instead: the small Hash Join is *highly
> unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
> expecting that the small Hash Join won’t use more than 1 MB of its
> 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
> oversubscribing, betting that the small Hash Join will just stay
> within some smaller, “unenforced” memory limit.
>
> In this example, this bet is probably fine — but it won’t work in
> general. I don’t want to be in the business of gambling with customer
> resources: if the small Hash Join is unlikely to use more than 1 MB,
> then let’s just assign it 1 MB of work_mem.

I like this idea. Operators that either know they don't need much
memory, or estimate that they don't need much memory, can constrain
themselves. That would protect against misestimations and advertise to
the higher levels of the planner how much memory the operator actually
wants. Right now, the planner doesn't know which operators need a lot
of memory and which ones don't need any significant amount at all.

The challenge, of course, is what the higher levels of the planner
would do with that information, which goes to the rest of your
proposal. But tracking the information seems very reasonable to me.

> I propose that we add a “query_work_mem” GUC, which works by
> distributing (using some algorithm to be described in a follow-up
> email) the entire “query_work_mem” to the query’s operators. And then
> each operator will spill when it exceeds its own work_mem limit. So
> we’ll preserve the existing “spill” logic as much as possible.

The description above sounds too "top-down" to me. That may work, but
has the disadvantage that costing has already happened. We should also
consider:

* Reusing the path generation infrastructure so that both "high memory"
and "low memory" paths can be considered, and if a path requires too
much memory in aggregate, then it would be rejected in favor of a path
that uses less memory. This feels like it fits within the planner
architecture the best, but it also might lead to a path explosion, so
we may need additional controls.

* Some kind of negotiation where the top level of the planner finds
that the plan uses too much memory, and replans some or all of it. (I
think is similar to what you described as the "feedback loop" later in
your email.) I agree that this is complex and may not have enough
benefit to justify.

Regards,
    Jeff Davis




On 1/21/25 22:26, Jeff Davis wrote:
> On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
>> How should “query_work_mem” work? Let’s start with an example:
>> suppose
>> we have an OLAP query that has 2 Hash Joins, and no other operators
>> that use work_mem.
> 
> So we plan first, and then assign available memory afterward? If we do
> it that way, then the costing will be inaccurate, because the original
> costs are based on the original work_mem.
> 
> It may be better than killing the query, but not ideal.
> 
>> -- Second, we could say, instead: the small Hash Join is *highly
>> unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
>> expecting that the small Hash Join won’t use more than 1 MB of its
>> 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
>> oversubscribing, betting that the small Hash Join will just stay
>> within some smaller, “unenforced” memory limit.
>>
>> In this example, this bet is probably fine — but it won’t work in
>> general. I don’t want to be in the business of gambling with customer
>> resources: if the small Hash Join is unlikely to use more than 1 MB,
>> then let’s just assign it 1 MB of work_mem.
> 
> I like this idea. Operators that either know they don't need much
> memory, or estimate that they don't need much memory, can constrain
> themselves. That would protect against misestimations and advertise to
> the higher levels of the planner how much memory the operator actually
> wants. Right now, the planner doesn't know which operators need a lot
> of memory and which ones don't need any significant amount at all.
> 

I'm not sure I like the idea that much.

At first restricting the operator to the amount the optimizer predicts
will be needed seems reasonable, because that's generally the best idea
of memory usage we have without running the query.

But these estimates are often pretty fundamentally unreliable - maybe
not for simple examples, but once you put an aggregate on top of a join,
the errors can be pretty wild. And allowing the operator to still use
more work_mem makes this more adaptive. I suspect forcing the operator
to adhere to the estimated work_mem might make this much worse (but I
haven't tried, maybe spilling to temp files is not that bad).

> The challenge, of course, is what the higher levels of the planner
> would do with that information, which goes to the rest of your
> proposal. But tracking the information seems very reasonable to me.
> 

I agree. Tracking additional information seems like a good idea, but
it's not clear to me what would the planner use this. I can imagine
various approaches - e.g. we might do the planning as usual and then
distribute the query_work_mem between the nodes in proportion to the
estimated amount of memory. But it all seems like a very ad hoc
heuristics, and easy to confuse / make the wrong decision.

>> I propose that we add a “query_work_mem” GUC, which works by
>> distributing (using some algorithm to be described in a follow-up
>> email) the entire “query_work_mem” to the query’s operators. And then
>> each operator will spill when it exceeds its own work_mem limit. So
>> we’ll preserve the existing “spill” logic as much as possible.
> 
> The description above sounds too "top-down" to me. That may work, but
> has the disadvantage that costing has already happened. We should also
> consider:
> 
> * Reusing the path generation infrastructure so that both "high memory"
> and "low memory" paths can be considered, and if a path requires too
> much memory in aggregate, then it would be rejected in favor of a path
> that uses less memory. This feels like it fits within the planner
> architecture the best, but it also might lead to a path explosion, so
> we may need additional controls.
> 
> * Some kind of negotiation where the top level of the planner finds
> that the plan uses too much memory, and replans some or all of it. (I
> think is similar to what you described as the "feedback loop" later in
> your email.) I agree that this is complex and may not have enough
> benefit to justify.
> 

Right, it seems rather at odds with the bottom-up construction of paths.
The amount of memory an operator may use seems like a pretty fundamental
information, but if it's available only after the whole plan is built,
that seems ... not great.

I don't know if generating (and keeping) low/high-memory paths is quite
feasible. Isn't that really a continuum for many paths? A hash join may
need very little memory (with batching) or a lot of memory (if keeping
everything in memory), so how would this work? Would we generate paths
for a range of work_mem values (with different costs)?


regards

-- 
Tomas Vondra




On 1/10/25 19:00, James Hunter wrote:
> ...
>
> **Proposal:**
> 
> I propose that we add a “query_work_mem” GUC, which works by
> distributing (using some algorithm to be described in a follow-up
> email) the entire “query_work_mem” to the query’s operators. And then
> each operator will spill when it exceeds its own work_mem limit. So
> we’ll preserve the existing “spill” logic as much as possible.
> 
> To enable this to-be-described algorithm, I would add an “nbytes”
> field to the Path struct, and display this (and related info) in
> EXPLAIN PLAN. So the customer will be able to see how much work_mem
> the SQL compiler thinks they’ll need, per operator; and so will the
> algorithm.
> 
> I wouldn’t change the existing planning logic (at least not in the
> initial implementaton). So the existing planning logic would choose
> between different SQL operators, still on the assumption that every
> operator that needs working memory will get work_mem [*
> hash_mem_multiplier].

All this seems generally feasible, but it hinges on the to-be-described
algorithm can distribute the memory in a sensible way that doesn't
affect the costing too much. If we plan a hash join with nbatch=1, and
then come back and make it to use nbatch=1024, maybe we wouldn't have
used hash join at all. Not sure.

The fundamental issue seems to be not having any information about how
much memory might be available to the operator. And in principle we
can't have that during the bottom-up part of the planning, until after
we construct the whole plan. Only at that point we know how many
operators will need work_mem.

Could we get at least some initial estimate how many such operators the
query *might* end up using? Maybe that'd be just enough a priori
information to set the effective work_mem for the planning part to make
this practical.

> This assumption might understate or overstate
> the actual working memory we’ll give the operator, at runtime. If it
> understates, the planner will be biased in favor of operators that
> don’t use much working memory. If it overstates, the planner will be
> biased in favor of operators that use too much working memory.
> 

I'm not quite sure I understand this part. Could you elaborate?

> (We could add a feedback loop to the planner, or even something simple
> like generating multiple path, at different “work_mem” limits, but
> everything I can think of here adds complexity without much potential
> benefit. So I would defer any changes to the planner behavior until
> later, if ever.)
> 

What would be the feedback? I can imagine improving the estimate of how
much memory a given operator needs during the bottom-up phase, but it
doesn't quite help with knowing what will happen above the current node.

> The to-be-described algorithm would look at a query’s Paths’ “nbytes”
> fields, as well as the session “work_mem” GUC (which would, now, serve
> as a hint to the SQL compiler), and decide how much of
> “query_work_mem” to assign to the corresponding Plan node.
> 
> It would assign that limit to a new “work_mem” field, on the Plan
> node. And this limit would also be exposed, of course, in EXPLAIN
> ANALYZE, along with the actual work_mem usage, which might very well
> exceed the limit. This will let the customer know when a query spills,
> and why.
> 
> I would write the algorithm to maintain the existing work_mem
> behavior, as much as possible. (Backward compatibility is good!) Most
> likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
> *minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
> work_mem [* hash _mem_multiplier] , for all operators in the query,
> all operators would be assigned at least work_mem, which would make my
> proposal a Pareto improvement.
> 
> Last, at runtime, each PlanState would check its plan -> work_mem
> field, rather than the global work_mem GUC. Execution would otherwise
> be the same as today.
> 
> What do you think?
> 

I find it a bit hard to discuss an abstract proposal, without knowing
the really crucial ingredient. It might be helpful to implement some
sort of PoC of this approach, I'm sure that'd give us a lot of insights
and means to experiment with it (instead of just speculating about what
might happen).


regards

-- 
Tomas Vondra




On Wed, 2025-01-22 at 21:48 +0100, Tomas Vondra wrote:
> But these estimates are often pretty fundamentally unreliable - maybe
> not for simple examples, but once you put an aggregate on top of a
> join,
> the errors can be pretty wild.

It would be conditional on whether there's some kind of memory
constraint or not. Setting aside the difficulty of implementing a new
memory constraint, if we assume there is one, then it would be good to
know how much memory an operator estimates that it needs.

(Also, if extra memory is available, spill files will be able to use
the OS filesystem cache, which mitigates the spilling cost.)

Another thing that would be good to know is about concurrent memory
usage. That is, if it's a blocking executor node, then it can release
all the memory from child nodes when it completes. Therefore the
concurrent memory usage might be less than just the sum of memory used
by all operators in the plan.

> I don't know if generating (and keeping) low/high-memory paths is
> quite
> feasible. Isn't that really a continuum for many paths? A hash join
> may
> need very little memory (with batching) or a lot of memory (if
> keeping
> everything in memory), so how would this work? Would we generate
> paths
> for a range of work_mem values (with different costs)?

A range might cause too much of an explosion. Let's do something simple
like define "low" to mean 1/16th, or have a separate low_work_mem GUC
(that could be an absolute number or a fraction).

There are a few ways we could pass the information down. We could just
have every operator generate twice as many paths (at least those
operators that want to use as much memory as they can get). Or we could
pass down the query_work_mem by subtracting the current operator's
memory needs and dividing what's left among its input paths.

We may have to track extra information to make sure that high-memory
paths don't dominate low-memory paths that are still useful (similar to
path keys).

Regards,
    Jeff Davis