Обсуждение: DSA overflow in hash join
Hi hackers!
There is weird error rarely reproduced with sqlancer: `ERROR: invalid DSA memory alloc request size 1140850688`:
-------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=114075075706156875776.00..114075075706156875776.00 rows=1 width=8)
-> Gather (cost=114075075706156875776.00..114075075706156875776.00 rows=4 width=8)
Workers Planned: 4
-> Partial Aggregate (cost=114075075706156875776.00..114075075706156875776.00 rows=1 width=8)
-> Parallel Hash Join (cost=6510542055.80..97778619693677723648.00 rows=6518582404991658491904 width=0)
Hash Cond: ((t2_1.c1 * t2_1.c1) = t4_1.c1)
-> Nested Loop (cost=0.00..57152791114.20 rows=2539498720000 width=32)
-> Parallel Seq Scan on t2 t2_1 (cost=0.00..12.20 rows=220 width=32)
-> Nested Loop (cost=0.00..144353654.10 rows=11543176000 width=0)
-> Nested Loop (cost=0.00..63915.85 rows=5107600 width=0)
-> Seq Scan on t5 (cost=0.00..32.60 rows=2260 width=0)
-> Materialize (cost=0.00..43.90 rows=2260 width=0)
-> Seq Scan on t0 (cost=0.00..32.60 rows=2260 width=0)
-> Materialize (cost=0.00..43.90 rows=2260 width=0)
-> Seq Scan on t0 t0_1 (cost=0.00..32.60 rows=2260 width=0)
-> Parallel Hash (cost=4028895759.80..4028895759.80 rows=128343727600 width=32)
-> Nested Loop (cost=0.00..4028895759.80 rows=128343727600 width=32)
-> Nested Loop (cost=0.00..3569757.80 rows=145845145 width=0)
-> Nested Loop Left Join (cost=0.00..7536.20 rows=64533 width=0)
Join Filter: (upper((t2.c1 + t2.c1)))::boolean
-> Parallel Seq Scan on t2 (cost=0.00..12.20 rows=220 width=32)
-> Seq Scan on t4 (cost=0.00..18.80 rows=880 width=0)
-> Seq Scan on t5 t5_1 (cost=0.00..32.60 rows=2260 width=0)
-> Seq Scan on t4 t4_1 (cost=0.00..18.80 rows=880 width=32)
The problem is reproduced only with max_parallel_workers_per_gather=4
The error is produced by `dsa_allocate*` most likely in nodeHash.c (fortunately there are not so much calls of dsa_allocate)
Certainly there are checks which should prevent allocation of more than MaxAllocSize:
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
But looks like not everywhere.
Particularly I suspect this places
nodeHash.c:1668:
/* Double the size of the bucket array. */
pstate->nbuckets *= 2;
size = pstate->nbuckets * sizeof(dsa_pointer_atomic);
hashtable->batches[0].shared->size += size / 2;
dsa_free(hashtable->area, hashtable->batches[0].shared->buckets);
hashtable->batches[0].shared->buckets =
dsa_allocate(hashtable->area, size);
nodeHash.c:3290:
batch->buckets =
dsa_allocate(hashtable->area, sizeof(dsa_pointer_atomic) * nbuckets);
In first case number of buckets is doubled, in the second case it may be doubled in ExecChooseHashTableSize:
/*
* It's better to use half the batches, so do that and adjust the
* nbucket in the opposite direction, and double the allowance.
*/
nbatch /= 2;
nbuckets *= 2;
May be I missing something, but do we have any protection here from overflow?
One more notice: as you can see estimation of number of rows in this case is 6518582404991658491904 (doesn't fit in 64 bits).
It is not difficult to create query with even larger estimation of number of tuples, for example:
create table t1(pk1 integer, sk1 integer);
create table t2(pk2 integer, sk2 integer);
create table t3(pk3 integer, sk3 integer);
create table t4(pk4 integer, sk4 integer);
create table t5(pk5 integer, sk5 integer);
create table t6(pk6 integer, sk6 integer);
insert into t1 values (generate_series(1,1000000), 0);
insert into t2 values (generate_series(1,1000000), 0);
insert into t3 values (generate_series(1,1000000), 0);
insert into t4 values (generate_series(1,1000000), 0);
insert into t5 values (generate_series(1,1000000), 0);
insert into t6 values (generate_series(1,1000000), 0);
analyze t1;
analyze t2;
analyze t3;
analyze t4;
analyze t5;
analyze t6;
explain select count(*) from t1 join t2 on sk1=sk2 join t3 on sk2=sk3 join t4 on sk3=sk4 join t5 on sk4=sk5 join t6 on sk5=sk6;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=2994791666679459822091698054365184.00..2994791666679459822091698054365184.00 rows=1 width=8)
-> Gather (cost=2994791666679459822091698054365184.00..2994791666679459822091698054365184.00 rows=2 width=8)
Workers Planned: 4
-> Partial Aggregate (cost=2994791666679459822091698054365184.00..2994791666679459822091698054365184.00 rows=1 width=8)
-> Parallel Hash Join (cost=9195963541715053182976.00..1953125000012793188098301096886272.00 rows=416666666666666672044102856701116416 width=0)
Hash Cond: (t1.sk1 = t3.sk3)
-> Merge Join (cost=180939.82..6250190523.16 rows=416666666667 width=8)
Merge Cond: (t1.sk1 = t2.sk2)
-> Sort (cost=53182.48..54224.15 rows=416667 width=4)
Sort Key: t1.sk1
-> Parallel Seq Scan on t1 (cost=0.00..8591.67 rows=416667 width=4)
-> Materialize (cost=127757.34..132757.34 rows=1000000 width=4)
-> Sort (cost=127757.34..130257.34 rows=1000000 width=4)
Sort Key: t2.sk2
-> Seq Scan on t2 (cost=0.00..14425.00 rows=1000000 width=4)
-> Parallel Hash (cost=1953125000038385975296.00..1953125000038385975296.00 rows=416666666666666659676160 width=16)
-> Parallel Hash Join (cost=23086308963.32..1953125000038385975296.00 rows=416666666666666659676160 width=16)
Hash Cond: (t3.sk3 = t5.sk5)
-> Merge Join (cost=180939.82..6250190523.16 rows=416666666667 width=8)
Merge Cond: (t3.sk3 = t4.sk4)
-> Sort (cost=53182.48..54224.15 rows=416667 width=4)
Sort Key: t3.sk3
-> Parallel Seq Scan on t3 (cost=0.00..8591.67 rows=416667 width=4)
-> Materialize (cost=127757.34..132757.34 rows=1000000 width=4)
-> Sort (cost=127757.34..130257.34 rows=1000000 width=4)
Sort Key: t4.sk4
-> Seq Scan on t4 (cost=0.00..14425.00 rows=1000000 width=4)
-> Parallel Hash (cost=6250190523.16..6250190523.16 rows=416666666667 width=8)
-> Merge Join (cost=180939.82..6250190523.16 rows=416666666667 width=8)
Merge Cond: (t5.sk5 = t6.sk6)
-> Sort (cost=53182.48..54224.15 rows=416667 width=4)
Sort Key: t5.sk5
-> Parallel Seq Scan on t5 (cost=0.00..8591.67 rows=416667 width=4)
-> Materialize (cost=127757.34..132757.34 rows=1000000 width=4)
-> Sort (cost=127757.34..130257.34 rows=1000000 width=4)
Sort Key: t6.sk6
-> Seq Scan on t6 (cost=0.00..14425.00 rows=1000000 width=4)
Please notice that this query doesn't cause alloc error.
But I wonder if we should limit maxima; number of estimated rows, because it seems to be clear that none system can proceed 20^25 rows in some reasonable amount of time. So if we have such estimation that either there is some error in estimations and fortunately the query is finished much faster, either it is not finished at all. But from resource allocation point of view there is no sense in trying to allocate resources for 10^25 rows.
I still trying to understand the reason of DSA overflow in hash join.
In addition to two suspicious places where number of buckets is doubled without chek for overflow (nodeHash.c:1668 and nodeHash.c:3290),
there is one more place where number of batches is multiplied by `EstimateParallelHashJoinBatch(hashtable)` which is
sizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore) + sizeof(SharedTuplestoreParticipant) * participants) * 2
which is 480 bytes!
But when we calculate maximal number of batches, we limit it by macximal number of pointers (8 bytes):
max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
/* If max_pointers isn't a power of 2, must round it down to one */
max_pointers = pg_prevpower2_size_t(max_pointers);
/* Also ensure we avoid integer overflow in nbatch and nbuckets */
/* (this step is redundant given the current value of MaxAllocSize) */
max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
dbuckets = Min(dbuckets, max_pointers);
nbuckets = (int) dbuckets;
But as we see, here multiplier is 480 bytes, not 8 bytes.
I still trying to understand the reason of DSA overflow in hash join.
In addition to two suspicious places where number of buckets is doubled without chek for overflow (nodeHash.c:1668 and nodeHash.c:3290),
there is one more place where number of batches is multiplied by `EstimateParallelHashJoinBatch(hashtable)` which issizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore) + sizeof(SharedTuplestoreParticipant) * participants) * 2
which is 480 bytes!
But when we calculate maximal number of batches, we limit it by macximal number of pointers (8 bytes):
max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
/* If max_pointers isn't a power of 2, must round it down to one */
max_pointers = pg_prevpower2_size_t(max_pointers);
/* Also ensure we avoid integer overflow in nbatch and nbuckets */
/* (this step is redundant given the current value of MaxAllocSize) */
max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
dbuckets = Min(dbuckets, max_pointers);
nbuckets = (int) dbuckets;
But as we see, here multiplier is 480 bytes, not 8 bytes.
Below is script to reproduce the problem:
CREATE TABLE IF NOT EXISTS t0(c0 FLOAT, PRIMARY KEY(c0)) WITH (parallel_workers=966);
CREATE TABLE t2(c0 DECIMAL, c1 int4range ) WITH (parallel_workers=393);
CREATE TABLE t4(LIKE t2);
CREATE TABLE t5(LIKE t0);
INSERT INTO t4(c0) VALUES(0.5934077416223362);
set work_mem='10MB';
set max_parallel_workers_per_gather=5;
explain SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res;
SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res;
And attached please find patch fixing the issue.
Вложения
On 31/07/2025 18:13, Konstantin Knizhnik wrote: > > On 27/07/2025 8:24 PM, Konstantin Knizhnik wrote: >> >> I still trying to understand the reason of DSA overflow in hash join. >> In addition to two suspicious places where number of buckets is >> doubled without chek for overflow (nodeHash.c:1668 and nodeHash.c:3290), >> there is one more place where number of batches is multiplied by >> `EstimateParallelHashJoinBatch(hashtable)` which is >> >> sizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore) + >> sizeof(SharedTuplestoreParticipant) * participants) * 2 >> >> which is 480 bytes! >> >> But when we calculate maximal number of batches, we limit it by >> macximal number of pointers (8 bytes): >> >> max_pointers = hash_table_bytes / sizeof(HashJoinTuple); >> max_pointers = Min(max_pointers, MaxAllocSize / >> sizeof(HashJoinTuple)); >> /* If max_pointers isn't a power of 2, must round it down to one */ >> max_pointers = pg_prevpower2_size_t(max_pointers); >> >> /* Also ensure we avoid integer overflow in nbatch and nbuckets */ >> /* (this step is redundant given the current value of MaxAllocSize) */ >> max_pointers = Min(max_pointers, INT_MAX / 2 + 1); >> >> dbuckets = ceil(ntuples / NTUP_PER_BUCKET); >> dbuckets = Min(dbuckets, max_pointers); >> nbuckets = (int) dbuckets; >> >> >> But as we see, here multiplier is 480 bytes, not 8 bytes. >> > > Below is script to reproduce the problem: > > CREATE TABLE IF NOT EXISTS t0(c0 FLOAT, PRIMARY KEY(c0)) WITH > (parallel_workers=966); > CREATE TABLE t2(c0 DECIMAL, c1 int4range ) WITH (parallel_workers=393); > CREATE TABLE t4(LIKE t2); > CREATE TABLE t5(LIKE t0); > INSERT INTO t4(c0) VALUES(0.5934077416223362); > > set work_mem='10MB'; > set max_parallel_workers_per_gather=5; > > explain SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count > FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON > (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY > t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res; > > SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROM ONLY > t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON > (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM ONLY > t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) as res; Great repro, thanks! > And attached please find patch fixing the issue. There are a lot of allocations in hash join. For example this in ExecParallelHashEnsureBatchAccessors(): /* Allocate this backend's accessor array. */ hashtable->nbatch = pstate->nbatch; hashtable->batches = palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch); That could also exceed MaxAllocSize, right? I think we need to take that into account in calculating the max number of batches, too. Is this only a problem for parallel hash joins? In general, it's really hard to follow the logic of where we enforce what limits in ExecChooseHashTableSize(). How can we be sure that we accounted for all allocations? Could we have more (static) assertions or something, to ensure we've covered everything? A different approach might be to make ExecChooseHashTableSize() return just estimates, and enforce the hard limits later. But that could lead to worse decisions: it's good for ExecChooseHashTableSize() to take the maximum limits into account. On the proposed patch: > @@ -775,6 +777,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, > /* If max_pointers isn't a power of 2, must round it down to one */ > max_pointers = pg_prevpower2_size_t(max_pointers); > > + /* > + * Prevent DSA overflow. This is expanded definition of EstimateParallelHashJoinBatch function > + * used in ExecParallelHashJoinSetUpBatches: > + * dsa_allocate0(hashtable->area, > + * EstimateParallelHashJoinBatch(hashtable) * nbatch) > + */ > + max_batches = MaxAllocSize / (MAXALIGN(sizeof(ParallelHashJoinBatch)) + > + MAXALIGN(sts_estimate(max_parallel_workers_per_gather + 1) * 2)); > + max_batches = pg_prevpower2_size_t(max_batches); > + > /* Also ensure we avoid integer overflow in nbatch and nbuckets */ > /* (this step is redundant given the current value of MaxAllocSize) */ > max_pointers = Min(max_pointers, INT_MAX / 2 + 1); Would be nice to not copy the macro here.. BTW I don't like the name EstimateParallelHashJoinBatch(). It's not just an estimate, it's used to allocate memory for an array, and it better be accurate. See also NthParallelHashJoinBatch. > @@ -910,7 +923,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, > * this during the initial sizing - once we start building the hash, > * nbucket is fixed. > */ > - while (nbatch > 0) > + while (nbatch > 0 && > + nbuckets * 2 <= max_pointers) /* prevent allocation limit overflow */ Hmm, that doesn't seem quite right. 'max_pointers' is derived from work_mem, but the point of this loop is to reduce memory usage, when you're already over the work_mem limit. I think we should use a hard limit derived only from MaxAllocSize here. > { > /* how much memory are we using with current nbatch value */ > size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ); Hmm, should we add the memory usage calculated by EstimateParallelHashJoinBatch() to this per-batch overhead, too? - Heikki
Thank you for review! On 14/08/2025 10:34 PM, Heikki Linnakangas wrote: > On 31/07/2025 18:13, Konstantin Knizhnik wrote: >> >> On 27/07/2025 8:24 PM, Konstantin Knizhnik wrote: >>> >>> I still trying to understand the reason of DSA overflow in hash join. >>> In addition to two suspicious places where number of buckets is >>> doubled without chek for overflow (nodeHash.c:1668 and >>> nodeHash.c:3290), >>> there is one more place where number of batches is multiplied by >>> `EstimateParallelHashJoinBatch(hashtable)` which is >>> >>> sizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore) + >>> sizeof(SharedTuplestoreParticipant) * participants) * 2 >>> >>> which is 480 bytes! >>> >>> But when we calculate maximal number of batches, we limit it by >>> macximal number of pointers (8 bytes): >>> >>> max_pointers = hash_table_bytes / sizeof(HashJoinTuple); >>> max_pointers = Min(max_pointers, MaxAllocSize / >>> sizeof(HashJoinTuple)); >>> /* If max_pointers isn't a power of 2, must round it down to one */ >>> max_pointers = pg_prevpower2_size_t(max_pointers); >>> >>> /* Also ensure we avoid integer overflow in nbatch and nbuckets */ >>> /* (this step is redundant given the current value of >>> MaxAllocSize) */ >>> max_pointers = Min(max_pointers, INT_MAX / 2 + 1); >>> >>> dbuckets = ceil(ntuples / NTUP_PER_BUCKET); >>> dbuckets = Min(dbuckets, max_pointers); >>> nbuckets = (int) dbuckets; >>> >>> >>> But as we see, here multiplier is 480 bytes, not 8 bytes. >>> >> >> Below is script to reproduce the problem: >> >> CREATE TABLE IF NOT EXISTS t0(c0 FLOAT, PRIMARY KEY(c0)) WITH >> (parallel_workers=966); >> CREATE TABLE t2(c0 DECIMAL, c1 int4range ) WITH (parallel_workers=393); >> CREATE TABLE t4(LIKE t2); >> CREATE TABLE t5(LIKE t0); >> INSERT INTO t4(c0) VALUES(0.5934077416223362); >> >> set work_mem='10MB'; >> set max_parallel_workers_per_gather=5; >> >> explain SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as >> count FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON >> (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM >> ONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) >> as res; >> >> SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROM >> ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON >> (upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROM >> ONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0) >> as res; > > Great repro, thanks! > >> And attached please find patch fixing the issue. > > There are a lot of allocations in hash join. For example this in > ExecParallelHashEnsureBatchAccessors(): > > /* Allocate this backend's accessor array. */ > hashtable->nbatch = pstate->nbatch; > hashtable->batches = > palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch); > > That could also exceed MaxAllocSize, right? I think we need to take > that into account in calculating the max number of batches, too. > Yes, I concentrated on DSA and didn't notice it. Also added it to limit number of batches. > > Is this only a problem for parallel hash joins? > > In general, it's really hard to follow the logic of where we enforce > what limits in ExecChooseHashTableSize(). How can we be sure that we > accounted for all allocations? Could we have more (static) assertions > or something, to ensure we've covered everything? Definitely I can not guarantee that there are no much such places in Postgres code. But at least in executor files, only nodeHash takes in account MaxAllocSize when choosing hash table parameters. Certainly if some module doesn't take in account `MaxAllocSize`, it doesn't means that there can not be OOM problems (either in Postgres memory context or in shared memory). May be OOM problem was just not considered... > > A different approach might be to make ExecChooseHashTableSize() return > just estimates, and enforce the hard limits later. But that could lead > to worse decisions: it's good for ExecChooseHashTableSize() to take > the maximum limits into account. Yes, I think that number of batches should be correlated with number of buckets. So it is better to restrict their values from the very beginning (in ExecChooseHashTableSize) rather then do it right before allocation. > > On the proposed patch: > >> @@ -775,6 +777,16 @@ ExecChooseHashTableSize(double ntuples, int >> tupwidth, bool useskew, >> /* If max_pointers isn't a power of 2, must round it down to one */ >> max_pointers = pg_prevpower2_size_t(max_pointers); >> >> + /* >> + * Prevent DSA overflow. This is expanded definition of >> EstimateParallelHashJoinBatch function >> + * used in ExecParallelHashJoinSetUpBatches: >> + * dsa_allocate0(hashtable->area, >> + * EstimateParallelHashJoinBatch(hashtable) * nbatch) >> + */ >> + max_batches = MaxAllocSize / >> (MAXALIGN(sizeof(ParallelHashJoinBatch)) + >> + MAXALIGN(sts_estimate(max_parallel_workers_per_gather + 1) * 2)); >> + max_batches = pg_prevpower2_size_t(max_batches); >> + >> /* Also ensure we avoid integer overflow in nbatch and nbuckets */ >> /* (this step is redundant given the current value of >> MaxAllocSize) */ >> max_pointers = Min(max_pointers, INT_MAX / 2 + 1); > > Would be nice to not copy the macro here.. Yeh... extr4acting macro looks really ugly. But I don't know how to avoid it, because macro is using pointer to hash table which is not allocated yes. But what is worser, result depends on number of participants which is not known yet at this stage. So I have to use upper limit (`max_parallel_workers_per_gather`) instead. > > BTW I don't like the name EstimateParallelHashJoinBatch(). It's not > just an estimate, it's used to allocate memory for an array, and it > better be accurate. See also NthParallelHashJoinBatch. > >> @@ -910,7 +923,8 @@ ExecChooseHashTableSize(double ntuples, int >> tupwidth, bool useskew, >> * this during the initial sizing - once we start building the >> hash, >> * nbucket is fixed. >> */ >> - while (nbatch > 0) >> + while (nbatch > 0 && >> + nbuckets * 2 <= max_pointers) /* prevent allocation limit >> overflow */ > > Hmm, that doesn't seem quite right. 'max_pointers' is derived from > work_mem, but the point of this loop is to reduce memory usage, when > you're already over the work_mem limit. I think we should use a hard > limit derived only from MaxAllocSize here. > Fixed >> { >> /* how much memory are we using with current nbatch value */ >> size_t current_space = hash_table_bytes + (2 * nbatch >> * BLCKSZ); > > Hmm, should we add the memory usage calculated by > EstimateParallelHashJoinBatch() to this per-batch overhead, too? > Should also take in account `palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch)` - it is ~80*nbatches. But may be this multiplier (80) as well multiplier calculated by EstimateParallelHashJoinBatch (480) can be ignore comparing with BLCKSZ=8kB? May be it will have not so much influence on choice of optimal ration between number of buckets and batches? Attached please find new version of the patch (without change of current_space calculation).
Вложения
Hi, I did look at this because of the thread about "nbatch overflow" [1]. And the patches I just posted in that thread resolve the issue for me, in the sense that the reproducer [2] no longer fails for me. But I think that's actually mostly an accident - the balancing reduces nbatch, exchanging it for in-memory hash table. In this case we start with nbatch=2M, but it gets reduced to 64k. Which is low enough to fit into the 1GB allocation limit. Which is nice, but I can't guarantee it will always work out like this. It's unlikely we'd need 2M batches, but is it impossible? So we may still need something like this the max_batches protection. I don't think we should apply this to non-parallel hash joins, though. Which is what the last patch would do, I think. However, why don't we simply allow huge allocations for this? /* Allocate space. */ pstate->batches = dsa_allocate_extended(hashtable->area, EstimateParallelHashJoinBatch(hashtable) * nbatch, (DSA_ALLOC_ZERO | DSA_ALLOC_HUGE)); This fixes the issue for me, even with the balancing disabled. Or is there a reason why this would be a bad idea? It seems a bit strange to force parallel scans to use fewer batches, when (presumably) parallelism is more useful for larger data sets. regards [1] https://www.postgresql.org/message-id/244dc6c1-3b3d-4de2-b3de-b1511e6a6d10%40vondra.me [2] https://www.postgresql.org/message-id/52b94d5b-a135-489d-9833-2991a69ec623%40garret.ru -- Tomas Vondra