Обсуждение: XPRS
Hello, After rereading some old papers recently, I wanted to share some thoughts about XPRS and modern PostgreSQL. XPRS stood for "eXtended Postgres on RAID and Sprite", and was a research project done nearly three decades ago at Berkeley by the POSTGRES group working with operating system researchers, on a shared memory machine with a lot of CPUs for the time (12). As far as I can tell (and if anyone knows how to find out, I'd love to know), the parallel query parts of the XPRS system described in various writings by Wei Hong and Michael Stonebraker are actually present in the POSTGRES 4.2 tarball and were removed in Postgres95. Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we know that XPRS ran on a Sequent Symmetry; the parallel hash join algorithm matches the description given in Hong's doctoral thesis; the page and range based parallel scans seen in various places also seem to match. Hong's thesis covers a lot of material and I certainly haven't understood all of it, but basically it's about how to share CPU, IO bandwidth and memory out fairly and dynamically at execution time so you're using the whole system efficiently. Facets of this problem obviously keep coming up on this mailing list (see practically any discussion of parallel degree, admission control, prefetch or work_mem, all of which we punt on by deferring to user supplied GUCs and scan size-based heuristics). Here are three things I wanted to highlight from Hong's 1991 paper[1] (later writings elaborate greatly but this one is short and much easier to read than the thesis and sets the scene): 1. "The overall performance goal of a multiprocessor database system is to obtain increased throughput as well as reduced response time in a multiuser environment. The objective function that XPRS uses for query optimization is a combination of resource consumption and response time as follows: cost = resource_consumption + w * response_time, where w is a system-specific weighting factor." 2. The "Buffer-Size-Independent Hypothesis" (here meaning work_mem): "The choice of the best sequential plan is insensitive to the amount of buffer space available as long as the buffer size is above the hash join threshold" (with a caveat about localised special cases that can be handled by choosing alternative subplans at runtime). 3. The "Two-Phase Hypothesis": "The best parallel plan is a parallelization of the best sequential plan." I read all of that a while back while working on bits of parallel query machinery (though I only realised after the fact that I'd implemented parallel hash the same way as Hong did 27 years earlier, that is, shared no-partition, which is now apparently back in vogue due to the recent ubiquity of high core count shared memory systems, so that every server looks a bit like a Sequent Symmetry; for example Oracle is rumoured to have a "parallel shared hash join" like ours in the pipeline). I didn't understand the context or importance of XPRS, though, until I read this bit of Hellerstein's "Looking Back at Postgres"[2]: "In principle, parallelism “blows up” the plan space for a query optimizer by making it multiply the traditional choices made during query optimization (data access, join algorithms, join orders) against all possible ways of parallelizing each choice. The basic idea of what Stonebraker called “The Wei Hong Optimizer” was to cut the problem in two: run a traditional single-node query optimizer in the style of System R, and then “parallelize” the resulting single-node query plan by scheduling the degree of parallelism and placement of each operator based on data layouts and system configuration. This approach is heuristic, but it makes parallelism an additive cost to traditional query optimization, rather than a multiplicative cost. Although “The Wei Hong Optimizer” was designed in the context of Postgres, it became the standard approach for many of the parallel query optimizers in industry." I don't know what to think about the buffer-size-independent hypothesis, but the two-phase hypothesis and the claim that is is the standard approach caught my attention. Firstly, I don't think the hypothesis holds on our system currently, because (for example) we lack parallel merge joins and sorts, so you couldn't parallelise such serial plans, and yet we'd already have thrown away a hash join based plan that would be vastly better in parallel. That might be just an implementation completeness problem. I wonder what fundamental problems lurk here. (Perhaps the non-availability of globally unique partial paths?) Anyway, AFAICS we do the exact thing Hong wanted to avoid: we plan parallel queries as extra paths at planning time. We don't really suffer too much of a planning explosion though, because we don't consider different parallel degrees. If we did, because our cost model doesn't include any penalty for resource usage, I suspect we'd always go for the maximum number of workers because they're 'free', which creates a perverse incentive to burn resource (CPU + copies of work_mem). Those are all problems Hong solved with execution time resource allocation, as part of a bigger picture. I have no idea what to do about any of this but thought that was an interesting bit of our project's history worth sharing. It's really humbling to read these old papers. I wonder if we're missing a chance to stand on the shoulders of giants. [1] http://db.cs.berkeley.edu/jmh/tmp/pdis91-xprs.pdf [2] https://arxiv.org/pdf/1901.01973.pdf -- Thomas Munro https://enterprisedb.com
On Thu, Aug 22, 2019 at 11:41:45AM +1200, Thomas Munro wrote: >Hello, > >After rereading some old papers recently, I wanted to share some >thoughts about XPRS and modern PostgreSQL. XPRS stood for "eXtended >Postgres on RAID and Sprite", and was a research project done nearly >three decades ago at Berkeley by the POSTGRES group working with >operating system researchers, on a shared memory machine with a lot of >CPUs for the time (12). > >As far as I can tell (and if anyone knows how to find out, I'd love to >know), the parallel query parts of the XPRS system described in >various writings by Wei Hong and Michael Stonebraker are actually >present in the POSTGRES 4.2 tarball and were removed in Postgres95. >Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we >know that XPRS ran on a Sequent Symmetry; the parallel hash join >algorithm matches the description given in Hong's doctoral thesis; the >page and range based parallel scans seen in various places also seem >to match. > >Hong's thesis covers a lot of material and I certainly haven't >understood all of it, but basically it's about how to share CPU, IO >bandwidth and memory out fairly and dynamically at execution time so >you're using the whole system efficiently. Facets of this problem >obviously keep coming up on this mailing list (see practically any >discussion of parallel degree, admission control, prefetch or >work_mem, all of which we punt on by deferring to user supplied GUCs >and scan size-based heuristics). > >Here are three things I wanted to highlight from Hong's 1991 paper[1] >(later writings elaborate greatly but this one is short and much >easier to read than the thesis and sets the scene): > >1. "The overall performance goal of a multiprocessor database system >is to obtain increased throughput as well as reduced response time in >a multiuser environment. The objective function that XPRS uses for >query optimization is a combination of resource consumption and >response time as follows: cost = resource_consumption + w * >response_time, where w is a system-specific weighting factor." > >2. The "Buffer-Size-Independent Hypothesis" (here meaning work_mem): >"The choice of the best sequential plan is insensitive to the amount >of buffer space available as long as the buffer size is above the hash >join threshold" (with a caveat about localised special cases that can >be handled by choosing alternative subplans at runtime). > >3. The "Two-Phase Hypothesis": "The best parallel plan is a >parallelization of the best sequential plan." > >I read all of that a while back while working on bits of parallel >query machinery (though I only realised after the fact that I'd >implemented parallel hash the same way as Hong did 27 years earlier, >that is, shared no-partition, which is now apparently back in vogue >due to the recent ubiquity of high core count shared memory systems, >so that every server looks a bit like a Sequent Symmetry; for example >Oracle is rumoured to have a "parallel shared hash join" like ours in >the pipeline). I didn't understand the context or importance of XPRS, >though, until I read this bit of Hellerstein's "Looking Back at >Postgres"[2]: > >"In principle, parallelism “blows up” the plan space for a query >optimizer by making it multiply the traditional choices made during >query optimization (data access, join algorithms, join orders) against >all possible ways of parallelizing each choice. The basic idea of what >Stonebraker called “The Wei Hong Optimizer” was to cut the problem in >two: run a traditional single-node query optimizer in the style of >System R, and then “parallelize” the resulting single-node query plan >by scheduling the degree of parallelism and placement of each operator >based on data layouts and system configuration. This approach is >heuristic, but it makes parallelism an additive cost to traditional >query optimization, rather than a multiplicative cost. > I think this relies on a huge assumption that all steps in any sequential plan can be parallelized. Which certainly is not true for PostgreSQL - as you point out later on the join example. That means the optimal join order may be different for parallel plan, and so on. The other thing is that "parallelizations" of different sequential plans may have different requirements for resources (say, work_mem), so I'd expect cases when a parallel version of a "worse" sequential plan may end up being superior thanks to allowing larger number of workers. >Although “The Wei Hong Optimizer” was designed in the context of >Postgres, it became the standard approach for many of the parallel >query optimizers in industry." > I assume this quote is from 30 years ago. I wonder if the claim is still true, on current hardware (including e.g. distributed databases). >I don't know what to think about the buffer-size-independent >hypothesis, but the two-phase hypothesis and the claim that is is the >standard approach caught my attention. Firstly, I don't think the >hypothesis holds on our system currently, because (for example) we >lack parallel merge joins and sorts, so you couldn't parallelise such >serial plans, and yet we'd already have thrown away a hash join based >plan that would be vastly better in parallel. That might be just an >implementation completeness problem. I wonder what fundamental >problems lurk here. (Perhaps the non-availability of globally unique >partial paths?) Anyway, AFAICS we do the exact thing Hong wanted to >avoid: we plan parallel queries as extra paths at planning time. We >don't really suffer too much of a planning explosion though, because >we don't consider different parallel degrees. If we did, because our >cost model doesn't include any penalty for resource usage, I suspect >we'd always go for the maximum number of workers because they're >'free', which creates a perverse incentive to burn resource (CPU + >copies of work_mem). Those are all problems Hong solved with >execution time resource allocation, as part of a bigger picture. > I don't know. I'd guess the hardware changed quite a bit, so maybe some of the assumptions from the paper are too simplistic nowadays? Consider for example the memory hierarchy - 30 years ago the amount of on-CPU cache was miniscule, while now we have L1/L2/L3 caches that are tens of megabytes. It's usually much faster to do sorts that fit into L3, for example. FWIW I think we'll have to do something about resource acquisition, sooner or later. It was always quite annoying that we don't really consider memory consumption of the query as a whole during planning, and parallel query made it a bit more painful. >I have no idea what to do about any of this but thought that was an >interesting bit of our project's history worth sharing. It's really >humbling to read these old papers. I wonder if we're missing a chance >to stand on the shoulders of giants. > Thanks, I think it's always useful / interesting to look at papers like this. I don't know if we can use the stuff described in those papers directly, but maybe we can build on those ideas and see which of the assumptions are no longer true. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Although “The Wei Hong Optimizer” was designed in the context of > > Postgres, it became the standard approach for many of the parallel > > query optimizers in industry." > > I assume this quote is from 30 years ago. I wonder if the claim is still > true, on current hardware (including e.g. distributed databases). The quote is from 2018, and appears in the article I linked (it's a chapter from the book Making Databases Work: The Wisdom of Michael Stonebraker), but I'm not sure which systems it's referring to. Speculation: Many other systems have what we call parallel-oblivious operators only, and then insert various exchange operators to make a parallel plan. That is, they don't necessarily have a direct equivalent of our "Parallel Sequential Scan", "Parallel Index Scan", "Parallel Foreign Scan": they just use their regular scans, possibly with the addition of some kind of "Parallel Scatter" node (that's my made up name, it's the opposite of Gather, called various things like "page supplier" or "block iterator") or "Parallel Repartition" inserted in the right places. Perhaps they create a serial plan first, and then try to parallelise it by inserting various parallel operators and then recomputing the costs? Rather than considering the separate paths in the first phase of the optimiser, as we do. The cases where Hong's best-parallel-plan hypothesis isn't true for us now might go away if we had Parallel Repartition, so that each 'stream' would be the complete set of tuples for some known partition. To be clear, I'm not suggesting we do that necessarily, just pointing out some interesting claims about ancient POSTGRES wisdom, in a highly speculative water cooler thread. Actually, this isn't the first time it's occurred to me that elements of our design were falling out of the order that we chose to implement things in. Another example is the "Shared Hash" that I had in an early version of my work on Parallel Hash Join, where just one process would run a parallel-safe but non-parallel-oblivious plan to build a shared hash table while other workers twiddled their thumbs; I dropped it because our cost model has no penalty for running N copies of the same plan rather than just one so there was no way to pick that plan, and that's because we don't have a cost model like Hong's that considers resource usage too. Another more speculative observation: maybe no-partition/shared Parallel Hash Join is only obvious if you already have the general concept of parallel-aware executor nodes. AFAIK Robert and Amit invented those to be able to coordinate parallel scans between processes, where thread-based systems might be able to share a single scan state somehow under a Scatter-like operator. If you had threads, you might not need that concept that allows arbitrary executor nodes to communicate with each other between workers, and then it might be more obvious and natural to use repartitioning for parallelising hash joins. > FWIW I think we'll have to do something about resource acquisition, sooner > or later. It was always quite annoying that we don't really consider > memory consumption of the query as a whole during planning, and parallel > query made it a bit more painful. Agreed. Here's an approach I have been wondering about to cap total executor memory usage, which is a much more down-to-Earth idea than any of the above space cadet planner problems. Let's start at the other end of the problem, by introducing admission control and memory quotas. That is, keep using work_mem with its current per-node-copy meaning at planning time, for now, and then: 1. Compute the peak amount of memory each plan thinks it will need. Initially that could be done by by summing estimates from all nodes and considering workers. A later refinement could deal with nodes that give back memory early, if we get around to doing that. The estimate could be shown by EXPLAIN. (Some details to work out: worst case vs expected etc.) 2. Introduce a new GUC global_work_mem, which limits the total plan that are allowed to run concurrently, according to their memory estimates. Introduce a shared memory counter of currently allocated quota. 3. Introduce a new GUC session_work_mem, which is the amount of quota that every session tries to acquire when it connects or perhaps first runs a query, and that it won't give back until the end of the session. Or perhaps they acquire less than that if they need less, but that's the amount they never give back once they've got that much. The idea is to allow queries with estimates under that limit, for example high frequency OLTP queries, to avoid any extra communication overhead from this scheme. 4. To run queries that have estimates higher than the session's current allocated quota, the session must acquire more quota for the duration of the query. If it can't be acquired right now without exceeding global_work_mem, it has to join a queue and wait. A refinement could be that you are allowed to run with fewer workers than planned to reduce the requirement. 5. While executing, executor nodes could opportunistically ask for more quota than was planned for, up to some limit, to avoid having to spill to disk. If the request is unsuccessful, that's OK, they can deal with that. 6. So long as we have nodes that have no escape mechanism in certain edge cases (hash aggregates and joins with bad stats and extreme skew), you could perhaps have the option of raising an error or forcing the total to exceed global_work_mem temporarily with a warning (which would at least prevent other large queries from running and making it worse). 7. Regular heap memory and DSM memory should be counted together, since it makes no difference to the operating system, it's all memory and we should count it against the same quota. You'd probably want to consider hidden allocator fragmentation too, as well as other hidden overheads, to get decent accuracy. This is sort of fudging together of ideas from conversations with Kevin Grittner (who talked about admission control a few years back), Peter Geoghegan (who mentioned opportunistically asking for more), and things I've heard of on SQL Server ("memory grants"). I think it would provide some relief from the problems we see today: it's hard to set work_mem so that you never get OOM but you can still use a decent amount of your precious memory, especially with mixed parallel and non-parallel query workloads thanks to our current work_mem-multiplying design. -- Thomas Munro https://enterprisedb.com
On Mon, Sep 02, 2019 at 02:19:15PM +1200, Thomas Munro wrote: >On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> > Although “The Wei Hong Optimizer” was designed in the context of >> > Postgres, it became the standard approach for many of the parallel >> > query optimizers in industry." >> >> I assume this quote is from 30 years ago. I wonder if the claim is still >> true, on current hardware (including e.g. distributed databases). > >The quote is from 2018, and appears in the article I linked (it's a >chapter from the book Making Databases Work: The Wisdom of Michael >Stonebraker), but I'm not sure which systems it's referring to. > Hmm, that's unfortunate - it'd be quite interesting to know which databases it's referring to. I suspect no optimizer is ideal in this regard, i.e. each database has some "gaps" where some nodes don't have a straightforward parallel version. >Speculation: Many other systems have what we call parallel-oblivious >operators only, and then insert various exchange operators to make a >parallel plan. That is, they don't necessarily have a direct >equivalent of our "Parallel Sequential Scan", "Parallel Index Scan", >"Parallel Foreign Scan": they just use their regular scans, possibly >with the addition of some kind of "Parallel Scatter" node (that's my >made up name, it's the opposite of Gather, called various things like >"page supplier" or "block iterator") or "Parallel Repartition" >inserted in the right places. Perhaps they create a serial plan >first, and then try to parallelise it by inserting various parallel >operators and then recomputing the costs? Rather than considering the >separate paths in the first phase of the optimiser, as we do. The >cases where Hong's best-parallel-plan hypothesis isn't true for us now >might go away if we had Parallel Repartition, so that each 'stream' >would be the complete set of tuples for some known partition. > I don't know. It kinda reminds me planning with distributed databases, which also need exchange data between nodes in various cases - say, when joining two relations distributed in different ways. The redistribution is however pretty costly (network I/O, bandwidth etc.) to the extent that it's often much better to pick a very different join to reduce the amount of data to exchange, or eliminate the redistribution altogether. For parallelism the costs are much lower, of course, but I don't think we can just ignore those. FWIW it's not clear to me why the cost would need to be recomputed after constructing the parallel version of the plan? My understanding is that the idea is to do cost-based planning for the serial plan, and then just "mechanically" construct a parallel plan. Although, maybe there could be multiple parallel alternatives ... >To be clear, I'm not suggesting we do that necessarily, just pointing >out some interesting claims about ancient POSTGRES wisdom, in a highly >speculative water cooler thread. Actually, this isn't the first time >it's occurred to me that elements of our design were falling out of >the order that we chose to implement things in. Another example is >the "Shared Hash" that I had in an early version of my work on >Parallel Hash Join, where just one process would run a parallel-safe >but non-parallel-oblivious plan to build a shared hash table while >other workers twiddled their thumbs; I dropped it because our cost >model has no penalty for running N copies of the same plan rather than >just one so there was no way to pick that plan, and that's because we >don't have a cost model like Hong's that considers resource usage too. >Another more speculative observation: maybe no-partition/shared >Parallel Hash Join is only obvious if you already have the general >concept of parallel-aware executor nodes. AFAIK Robert and Amit >invented those to be able to coordinate parallel scans between >processes, where thread-based systems might be able to share a single >scan state somehow under a Scatter-like operator. If you had threads, >you might not need that concept that allows arbitrary executor nodes >to communicate with each other between workers, and then it might be >more obvious and natural to use repartitioning for parallelising hash >joins. > >> FWIW I think we'll have to do something about resource acquisition, sooner >> or later. It was always quite annoying that we don't really consider >> memory consumption of the query as a whole during planning, and parallel >> query made it a bit more painful. > >Agreed. > >Here's an approach I have been wondering about to cap total executor >memory usage, which is a much more down-to-Earth idea than any of the >above space cadet planner problems. Let's start at the other end of >the problem, by introducing admission control and memory quotas. That >is, keep using work_mem with its current per-node-copy meaning at >planning time, for now, and then: > >1. Compute the peak amount of memory each plan thinks it will need. >Initially that could be done by by summing estimates from all nodes >and considering workers. A later refinement could deal with nodes >that give back memory early, if we get around to doing that. The >estimate could be shown by EXPLAIN. (Some details to work out: worst >case vs expected etc.) > >2. Introduce a new GUC global_work_mem, which limits the total plan >that are allowed to run concurrently, according to their memory >estimates. Introduce a shared memory counter of currently allocated >quota. > >3. Introduce a new GUC session_work_mem, which is the amount of quota >that every session tries to acquire when it connects or perhaps first >runs a query, and that it won't give back until the end of the >session. Or perhaps they acquire less than that if they need less, >but that's the amount they never give back once they've got that much. >The idea is to allow queries with estimates under that limit, for >example high frequency OLTP queries, to avoid any extra communication >overhead from this scheme. > >4. To run queries that have estimates higher than the session's >current allocated quota, the session must acquire more quota for the >duration of the query. If it can't be acquired right now without >exceeding global_work_mem, it has to join a queue and wait. A >refinement could be that you are allowed to run with fewer workers >than planned to reduce the requirement. > >5. While executing, executor nodes could opportunistically ask for >more quota than was planned for, up to some limit, to avoid having to >spill to disk. If the request is unsuccessful, that's OK, they can >deal with that. > >6. So long as we have nodes that have no escape mechanism in certain >edge cases (hash aggregates and joins with bad stats and extreme >skew), you could perhaps have the option of raising an error or >forcing the total to exceed global_work_mem temporarily with a warning >(which would at least prevent other large queries from running and >making it worse). > >7. Regular heap memory and DSM memory should be counted together, >since it makes no difference to the operating system, it's all memory >and we should count it against the same quota. You'd probably want to >consider hidden allocator fragmentation too, as well as other hidden >overheads, to get decent accuracy. > >This is sort of fudging together of ideas from conversations with >Kevin Grittner (who talked about admission control a few years back), >Peter Geoghegan (who mentioned opportunistically asking for more), and >things I've heard of on SQL Server ("memory grants"). I think it >would provide some relief from the problems we see today: it's hard to >set work_mem so that you never get OOM but you can still use a decent >amount of your precious memory, especially with mixed parallel and >non-parallel query workloads thanks to our current >work_mem-multiplying design. > I think this is probably the simplest and most realistic first step. Whenever I was thinking about memory acquisition, I've assumed we'd monitor how much memory the plan is expected to use while we're constructing it. My main problem was what to do when we reach the per-query limit - whether to (a) simply reject the plan, (b) go back and see if we can replan with lower work_mem (but how much and for which nodes?), or (c) just continue. The proposed plan deals with this by not limiting the per-query (or rather per-session) budget directly, and instead requesting requesting additional budget. Which is nice. I suspect we should also keep an additional plan that is expected to meet the session_work_mem limit, aside from the regular cheapest plan, and use it if it's not much worse. Imagine you have a plan with cost 1000 that needs (global_work_mem/2 + 1kB) memory, essentially serializing executions of this query. And then there's an alternative plan with cost 1100 that can run with session_work_mem. It seems better to just accept the second plan, because it won't need to wait. Another challenge with work_mem is that anyone can modify it arbitrarily, i.e. a user can do SET work_mem = '1TB'; and use as much memory as they wist, or even crash the system. I wonder if we could define the new GUCs (session_work_mem and global_work_mem) in a way to prevent this. We probably don't want to make them PGC_POSTMASTER (it seems useful to allow overriding them in ALTER USER/DATABASE), but I don't think we have a good way to do that at the moment. Any ideas in this direction? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 3, 2019 at 5:20 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > FWIW it's not clear to me why the cost would need to be recomputed after > constructing the parallel version of the plan? My understanding is that > the idea is to do cost-based planning for the serial plan, and then just > "mechanically" construct a parallel plan. Although, maybe there could be > multiple parallel alternatives ... Presumably you still need to choose between the serial and parallel plans by comparing costs. You lose some by adding exchange operators, but you win some by dividing cardinality estimates. > >This is sort of fudging together of ideas from conversations with > >Kevin Grittner (who talked about admission control a few years back), > >Peter Geoghegan (who mentioned opportunistically asking for more), and > >things I've heard of on SQL Server ("memory grants"). I think it > >would provide some relief from the problems we see today: it's hard to > >set work_mem so that you never get OOM but you can still use a decent > >amount of your precious memory, especially with mixed parallel and > >non-parallel query workloads thanks to our current > >work_mem-multiplying design. > > I think this is probably the simplest and most realistic first step. > > Whenever I was thinking about memory acquisition, I've assumed we'd > monitor how much memory the plan is expected to use while we're > constructing it. My main problem was what to do when we reach the > per-query limit - whether to (a) simply reject the plan, (b) go back and > see if we can replan with lower work_mem (but how much and for which > nodes?), or (c) just continue. Yeah, it's all quite tricky and circular. But I'm pretty sure that we need caps at execution time, anyway, so I think it's OK to start at that end of the problem and then later try to improve the way the planner. > The proposed plan deals with this by not limiting the per-query (or rather > per-session) budget directly, and instead requesting requesting additional > budget. Which is nice. > > I suspect we should also keep an additional plan that is expected to meet > the session_work_mem limit, aside from the regular cheapest plan, and use > it if it's not much worse. Imagine you have a plan with cost 1000 that > needs (global_work_mem/2 + 1kB) memory, essentially serializing executions > of this query. And then there's an alternative plan with cost 1100 that > can run with session_work_mem. It seems better to just accept the second > plan, because it won't need to wait. Hmm. I wonder if it's worth it. You could also just replan as you said, but I'm wondering if just rejecting the query would be OK. > Another challenge with work_mem is that anyone can modify it arbitrarily, > i.e. a user can do > > SET work_mem = '1TB'; > > and use as much memory as they wist, or even crash the system. I wonder if > we could define the new GUCs (session_work_mem and global_work_mem) in a > way to prevent this. We probably don't want to make them PGC_POSTMASTER > (it seems useful to allow overriding them in ALTER USER/DATABASE), but I > don't think we have a good way to do that at the moment. Any ideas in this > direction? How about something giving the superuser the following GUCs: global_work_mem = 16GB session_min_work_mem = 0.5% -- the amount of quota sessions keep, for fast small queries session_max_work_mem = 20% -- the maximum quota any one session is allowed session_extra_work_mem = 5% -- opportunistic execution-time boost Users are free to plan queries with work_mem = 1TB, and if you do that and it estimates that it wants 512GB, it will be rejected if you try to execute it because it exceeds session_max_work_mem, with a hint telling you to turn down work_mem. Otherwise it either runs or joins the queue if it can't get the quota it needs immediately. Eventually we could try to figure out how to set work_mem to automatic (I don't want to propose a concrete rule, but maybe something based on session_max_work_mem / njoins, with various fudge factors, and some accounting for parallel workers; it's probably good to low-ball it and rely on session_extra_work_mem). Yeah, I think you'd want to be able to set session_XXX on databases and roles so that you can say your regular users can't eat more than 10% of memory each, but a big reporting thing is allowed more. -- Thomas Munro https://enterprisedb.com
On Tue, Sep 03, 2019 at 11:04:43AM +1200, Thomas Munro wrote: >On Tue, Sep 3, 2019 at 5:20 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> FWIW it's not clear to me why the cost would need to be recomputed after >> constructing the parallel version of the plan? My understanding is that >> the idea is to do cost-based planning for the serial plan, and then just >> "mechanically" construct a parallel plan. Although, maybe there could be >> multiple parallel alternatives ... > >Presumably you still need to choose between the serial and parallel >plans by comparing costs. You lose some by adding exchange operators, >but you win some by dividing cardinality estimates. > Oh, right. Silly me. >> >This is sort of fudging together of ideas from conversations with >> >Kevin Grittner (who talked about admission control a few years back), >> >Peter Geoghegan (who mentioned opportunistically asking for more), and >> >things I've heard of on SQL Server ("memory grants"). I think it >> >would provide some relief from the problems we see today: it's hard to >> >set work_mem so that you never get OOM but you can still use a decent >> >amount of your precious memory, especially with mixed parallel and >> >non-parallel query workloads thanks to our current >> >work_mem-multiplying design. >> >> I think this is probably the simplest and most realistic first step. >> >> Whenever I was thinking about memory acquisition, I've assumed we'd >> monitor how much memory the plan is expected to use while we're >> constructing it. My main problem was what to do when we reach the >> per-query limit - whether to (a) simply reject the plan, (b) go back and >> see if we can replan with lower work_mem (but how much and for which >> nodes?), or (c) just continue. > >Yeah, it's all quite tricky and circular. But I'm pretty sure that we >need caps at execution time, anyway, so I think it's OK to start at >that end of the problem and then later try to improve the way the >planner. > True. >> The proposed plan deals with this by not limiting the per-query (or rather >> per-session) budget directly, and instead requesting requesting additional >> budget. Which is nice. >> >> I suspect we should also keep an additional plan that is expected to meet >> the session_work_mem limit, aside from the regular cheapest plan, and use >> it if it's not much worse. Imagine you have a plan with cost 1000 that >> needs (global_work_mem/2 + 1kB) memory, essentially serializing executions >> of this query. And then there's an alternative plan with cost 1100 that >> can run with session_work_mem. It seems better to just accept the second >> plan, because it won't need to wait. > >Hmm. I wonder if it's worth it. You could also just replan as you >said, but I'm wondering if just rejecting the query would be OK. > I think we should not reject queries unnecessarily, if there's a workable execution plan. It's just another optimization criteria, and erroring out right after planning is essentially "can't find a plan". But when there is a plan that we could use, that seems like a bad idea. >> Another challenge with work_mem is that anyone can modify it arbitrarily, >> i.e. a user can do >> >> SET work_mem = '1TB'; >> >> and use as much memory as they wist, or even crash the system. I wonder if >> we could define the new GUCs (session_work_mem and global_work_mem) in a >> way to prevent this. We probably don't want to make them PGC_POSTMASTER >> (it seems useful to allow overriding them in ALTER USER/DATABASE), but I >> don't think we have a good way to do that at the moment. Any ideas in this >> direction? > >How about something giving the superuser the following GUCs: > >global_work_mem = 16GB >session_min_work_mem = 0.5% -- the amount of quota sessions keep, for >fast small queries >session_max_work_mem = 20% -- the maximum quota any one session is allowed >session_extra_work_mem = 5% -- opportunistic execution-time boost > >Users are free to plan queries with work_mem = 1TB, and if you do that >and it estimates that it wants 512GB, it will be rejected if you try >to execute it because it exceeds session_max_work_mem, with a hint >telling you to turn down work_mem. Otherwise it either runs or joins >the queue if it can't get the quota it needs immediately. > Seems reasonable, certainly for v1. I'd keep it as simple as possible. >Eventually we could try to figure out how to set work_mem to automatic >(I don't want to propose a concrete rule, but maybe something based on >session_max_work_mem / njoins, with various fudge factors, and some >accounting for parallel workers; it's probably good to low-ball it and >rely on session_extra_work_mem). > Hmm, so you'd tweak work_mem for individual queries? Not sure that's something I'd do at this point - it may seem simple, but I think it's actually way harder to get right. For example let's say you have two views that are planned nicely, then you join then and suddenly the plan is much worse because the actual work_mem got much lower suddenly. That's not great. Of course, if it's just optional behavior, and the current with explicit work_mem value is the default, then this is not an issue. Anyway, I'd focus on MVP doing the bare minimum with simply enforcing a session limit, and leave this for the future. >Yeah, I think you'd want to be able to set session_XXX on databases >and roles so that you can say your regular users can't eat more than >10% of memory each, but a big reporting thing is allowed more. > Yeah, something like that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services