Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
От | Thomas Munro |
---|---|
Тема | Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker |
Дата | |
Msg-id | CA+hUKGK-donWK+AiJtmpnxbq4tyTG0Mr-xZyNei08Eu1sh2uWA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker (Andrei Lepikhov <a.lepikhov@postgrespro.ru>) |
Ответы |
Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
|
Список | pgsql-bugs |
On Thu, Feb 29, 2024 at 4:37 PM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > On 21/2/2024 19:52, Tomas Vondra wrote: > > It's a bit weird it needs 1.8GB of memory, but perhaps that's also > > linked to the number of batches, somehow? > I found one possible weak point in the code of PHJ: > ExecParallelHashJoinSetUpBatches: > > pstate->batches = dsa_allocate0(hashtable->area, > EstimateParallelHashJoinBatch(hashtable) * nbatch); > > It could explain why we have such a huge memory allocation with a size > not bonded to a power of 2. Hmm, a couple of short term ideas: 1. Maybe the planner should charge a high cost for exceeding some soft limit on the expected number of batches; perhaps it could be linked to the number of file descriptors we can open (something like 1000), because during the build phase we'll be opening and closing random file descriptors like crazy due to vfd pressure, which is not free; that should hopefully discourage the planner from reaching cases like this, but of course only in cases the planner can predict. 2. It sounds like we should clamp nbatches. We don't want the per-partition state to exceed MaxAllocSize, which I guess is what happened here if the above-quoted line produced the error? (The flag that would allow "huge" allocations exceeding MaxAllocSize seems likely to make the world worse while we have so many footguns -- I think once a query has reached this stage of terrible execution, it's better to limit the damage.) The performance and memory usage will still be terrible. We just don't deal well with huge numbers of tiny partitions. Tiny relative to input size, with input size being effectively unbounded. Thoughts for later: A lower limit would of course be possible and likely desirable. Once the partition-bookkeeping memory exceeds work_mem * hash_mem_multiplier, it becomes stupid to double it just because a hash table has hit that size, because that actually increases total memory usage (and fast, because it's quadratic). We'd be better off in practice giving up on the hash table size limit and hoping for the best. But the real long term question is what strategy we're going to use to actually deal with this situation properly *without* giving up our memory usage policies and hoping for the best, and that remains an open question. To summarise the two main ideas put forward so far: (1) allow very high number of batches, but process at most N of M batches at a time, using a temporary "all-the-rest" batch to be re-partitioned to feed the next N batches + the rest in a later cycle, (2) fall back to looping over batches multiple times in order to keep nbatches <= a small limit while also not exceeding a hash table size limit. Both have some tricky edge cases, especially with parallelism in the picture but probably even without it. I'm willing to work more on exploring this some time after the 17 cycle.
В списке pgsql-bugs по дате отправления: