Обсуждение: Fix overflow of nbatch
Hi Everyone,
With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.
Please find a patch for the same.
Thanks,
Vaibhav
Вложения
On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote: > With a1b4f28, to compute current_space, nbatch is being multiplied > by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can > easily overflow the int limit.To keep the calculation safe for > current_space, convert nbatch to size_t. Thanks for finding and reporting this. I think a1b4f289b mistakenly thought that there'd be size_t arithmetic in the following two lines because the final result is a size_t: size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ); size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ); I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on the above two lines. All that code is new in a1b4f289b, and if you change the nbatch to a size_t it affects the code that existed before a1b4f289b, and from looking over that code, it seems to ensure that nbatch does not overflow int by doing "dbatch = Min(dbatch, max_pointers);", where max_pointers is constrained by MaxAllocSize / sizeof(HashJoinTuple). Also, making nbatches size_t makes the final line of " *numbatches = nbatch;" somewhat questionable since numbatches is an int pointer. David
On 9/22/25 22:45, David Rowley wrote: > On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote: >> With a1b4f28, to compute current_space, nbatch is being multiplied >> by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can >> easily overflow the int limit.To keep the calculation safe for >> current_space, convert nbatch to size_t. > > Thanks for finding and reporting this. > > I think a1b4f289b mistakenly thought that there'd be size_t arithmetic > in the following two lines because the final result is a size_t: > > size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ); > size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ); > Yeah, I failed to notice this part of the formula can overflow. > I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on > the above two lines. All that code is new in a1b4f289b, and if you > change the nbatch to a size_t it affects the code that existed before > a1b4f289b, and from looking over that code, it seems to ensure that > nbatch does not overflow int by doing "dbatch = Min(dbatch, > max_pointers);", where max_pointers is constrained by MaxAllocSize / > sizeof(HashJoinTuple). > > Also, making nbatches size_t makes the final line of " *numbatches = > nbatch;" somewhat questionable since numbatches is an int pointer. > It seems cleaner to add a the cast to the formula, if only because of the assignment to numbatches. thanks -- Tomas Vondra
On Sep 22, 2025, at 21:20, Vaibhav Jain <jainva@google.com> wrote:Hi Everyone,<0001-Fix-overflow-of-nbatch.patch>With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.Please find a patch for the same.Thanks,Vaibhav
I guess that because earlier in the function, nbatch
is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
So, in practice, nbatch won’t grow to very big. But yes, if nbatch reaches to, say 1 million, it will overflow.
A simple program proves that changing nbatch to size_t will prevent from overflowing:
```
#include <stdio.h>
int main(){
size_t nbatch = 1000000; // 1 million
int BLCKSZ = 8192;
size_t result = 2 * nbatch * BLCKSZ;
printf("%zu\n", result); // will output 16384000000
return 0;
}
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote: > I guess that because earlier in the function, nbatch is always clamped with: > > nbatch = pg_nextpower2_32(Max(2, minbatch)); I don't follow which part of that line could be constituted as clamping. Maybe you've confused Max with Min? David
On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote: > On 9/22/25 22:45, David Rowley wrote: > > I think a1b4f289b mistakenly thought that there'd be size_t arithmetic > > in the following two lines because the final result is a size_t: > > > > size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ); > > size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ); > > > > Yeah, I failed to notice this part of the formula can overflow. Ok cool. We're just in the freeze for 18.0 at the moment. Once that's over, should I take care of this, or do you want to? David
On 9/23/25 02:02, David Rowley wrote: > On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote: >> On 9/22/25 22:45, David Rowley wrote: >>> I think a1b4f289b mistakenly thought that there'd be size_t arithmetic >>> in the following two lines because the final result is a size_t: >>> >>> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ); >>> size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ); >>> >> >> Yeah, I failed to notice this part of the formula can overflow. > > Ok cool. We're just in the freeze for 18.0 at the moment. Once that's > over, should I take care of this, or do you want to? > Feel free to fix, but I can take care of it once 18 is out the door. It's my bug, after all. BTW ExecHashIncreaseBatchSize needs the same fix, I think. I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with 8KB blocks), which is a lot. But with the balancing logic, it'd also mean each batch is about ~2GB. So the whole "hash table" would be about 500GB. Possible, but unlikely. However, looking at the code now, I think the code should adjust the hash_table_bytes value, not just space_allowed. It's meant to be the same thing. Will check tomorrow. thanks -- Tomas Vondra
On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote: > > On 9/23/25 02:02, David Rowley wrote: > > Ok cool. We're just in the freeze for 18.0 at the moment. Once that's > > over, should I take care of this, or do you want to? > > > > Feel free to fix, but I can take care of it once 18 is out the door. > It's my bug, after all. > > BTW ExecHashIncreaseBatchSize needs the same fix, I think. I think it's probably best you handle this. I didn't notice that one. You know this area much better than I do. > I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with > 8KB blocks), which is a lot. But with the balancing logic, it'd also > mean each batch is about ~2GB. So the whole "hash table" would be about > 500GB. Possible, but unlikely. I think no matter how low the chances of overflow are, the code isn't written the way it was intended to be, so it should just be put the way it was intended to be without question of the likelihood of overflow. Otherwise, we'll just end up with harder to hit bugs which could take much longer to [re]discover. Also, in these terms, what's unlikely today may not be in the future. David
On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:I guess that because earlier in the function, nbatch is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?
David
Sorry for the misleading. I actually meant “minbatch”.
I remember I ever traced the function several times. First, with a normal (not much data involved) query,
if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{
With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.
To clarify, I just created a test:
```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-# ON o.id = i.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
Hash Cond: (i.id = o.id)
Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
Buffers: local read=50048 dirtied=50000 written=49993
-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6446kB
Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
Buffers: local read=4480 dirtied=4425 written=4425
Planning:
Buffers: shared hit=14
Planning Time: 389649.420 ms
Execution Time: 1362.392 ms
(14 rows)
```
In this test, minbatch is just 64.
But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
Thank you everyone for the reviews, feedback and the test.
Please find attached a new version of the patch.
On Tue, Sep 23, 2025 at 7:11 AM Chao Li <li.evan.chao@gmail.com> wrote:
On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:I guess that because earlier in the function, nbatch is always clamped with:
nbatch = pg_nextpower2_32(Max(2, minbatch));
I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?
DavidSorry for the misleading. I actually meant “minbatch”.I remember I ever traced the function several times. First, with a normal (not much data involved) query,if (inner_rel_bytes + bucket_bytes > hash_table_bytes){Is hard to meet, then nbatch will be just 1.With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.To clarify, I just created a test:```evantest=# SET enable_nestloop = off;SETevantest=# SET enable_mergejoin = off;SETevantest=# SET enable_hashjoin = on;SETevantest=# CREATE TEMP TABLE inner_tbl ASevantest-# SELECT g AS id, repeat('x', 2000) AS fillerevantest-# FROM generate_series(1, 200000) g;SELECT 200000evantest=# CREATE TEMP TABLE outer_tbl ASevantest-# SELECT g AS id FROM generate_series(1, 1000000) g;SELECT 1000000evantest=#evantest=#evantest=# EXPLAIN ANALYZEevantest-# SELECT *evantest-# FROM outer_tbl oevantest-# JOIN inner_tbl iQUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------Hash Join (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853-> Seq Scan on inner_tbl i (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)Buffers: local read=50048 dirtied=50000 written=49993-> Hash (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)Buckets: 262144 Batches: 8 Memory Usage: 6446kBBuffers: local read=4480 dirtied=4425 written=4425, temp written=2560-> Seq Scan on outer_tbl o (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)Buffers: local read=4480 dirtied=4425 written=4425Planning:Buffers: shared hit=14Planning Time: 389649.420 msExecution Time: 1362.392 ms(14 rows)```In this test, minbatch is just 64.But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).
Вложения
On 9/23/25 03:20, David Rowley wrote: > On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote: >> >> On 9/23/25 02:02, David Rowley wrote: >>> Ok cool. We're just in the freeze for 18.0 at the moment. Once that's >>> over, should I take care of this, or do you want to? >>> >> >> Feel free to fix, but I can take care of it once 18 is out the door. >> It's my bug, after all. >> >> BTW ExecHashIncreaseBatchSize needs the same fix, I think. > > I think it's probably best you handle this. I didn't notice that one. > You know this area much better than I do. > OK, will do. >> I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with >> 8KB blocks), which is a lot. But with the balancing logic, it'd also >> mean each batch is about ~2GB. So the whole "hash table" would be about >> 500GB. Possible, but unlikely. > > I think no matter how low the chances of overflow are, the code isn't > written the way it was intended to be, so it should just be put the > way it was intended to be without question of the likelihood of > overflow. Otherwise, we'll just end up with harder to hit bugs which > could take much longer to [re]discover. Also, in these terms, what's > unlikely today may not be in the future. > I wasn't disputing the validity of the bug. I was just thinking alund about how likely it's to hit. regards -- Tomas Vondra
Hi, I kept looking at this, and unfortunately the it seems a bit worse :-( Fixing the overflow is easy enough - adding the cast does the trick. But then there's the second issue I mentioned - the loop does not adjust the hash_table_bytes. It updates the *space_allowed, but that's not what the current_space/new_space formulas use. This breaks the "balancing" as the nbatch gets decreased until nbatch <= (work_mem / BLCKSZ) while the hash table "space_allowed" is increasing. This may result in an *increased* memory usage :-( I also noticed the code does not clamp nbuckets properly as it should. The question what to do about this. If we got this a week ago, I'd just probably just revert a1b4f289, and then try again for PG19. After all, the issue it meant to address is somewhat rare. But with 18 already stamped ... I've shared these findings with the rest of the RMT, I'll see what their thoughts are. Of course, other opinions/suggestions are welcome. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes: > The question what to do about this. If we got this a week ago, I'd just > probably just revert a1b4f289, and then try again for PG19. After all, > the issue it meant to address is somewhat rare. > But with 18 already stamped ... 18.0 is what it is. If this is the most serious bug in it, I'll be a bit surprised. Take your time, fix it properly. regards, tom lane
On Tue, Sep 23, 2025 at 11:34:56AM -0400, Tom Lane wrote: > 18.0 is what it is. If this is the most serious bug in it, I'll be > a bit surprised. Take your time, fix it properly. +1 -- nathan
On 23/09/2025 6:11 PM, Tomas Vondra wrote: > Hi, > > I kept looking at this, and unfortunately the it seems a bit worse :-( > > Fixing the overflow is easy enough - adding the cast does the trick. > > But then there's the second issue I mentioned - the loop does not adjust > the hash_table_bytes. It updates the *space_allowed, but that's not what > the current_space/new_space formulas use. > > This breaks the "balancing" as the nbatch gets decreased until > > nbatch <= (work_mem / BLCKSZ) > > while the hash table "space_allowed" is increasing. This may result in > an *increased* memory usage :-( > > I also noticed the code does not clamp nbuckets properly as it should. > > > The question what to do about this. If we got this a week ago, I'd just > probably just revert a1b4f289, and then try again for PG19. After all, > the issue it meant to address is somewhat rare. > > But with 18 already stamped ... > > I've shared these findings with the rest of the RMT, I'll see what their > thoughts are. Of course, other opinions/suggestions are welcome. > > > regards > Hi Tomas, If you are going to investigate this problem more, can you also look at the related problem: https://www.postgresql.org/message-id/flat/52b94d5b-a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936 I proposed the patch but got no feedback. Best regards, Konstantin
On 9/23/25 21:13, Konstantin Knizhnik wrote: > On 23/09/2025 6:11 PM, Tomas Vondra wrote: > >> Hi, >> >> I kept looking at this, and unfortunately the it seems a bit worse :-( >> >> Fixing the overflow is easy enough - adding the cast does the trick. >> >> But then there's the second issue I mentioned - the loop does not adjust >> the hash_table_bytes. It updates the *space_allowed, but that's not what >> the current_space/new_space formulas use. >> >> This breaks the "balancing" as the nbatch gets decreased until >> >> nbatch <= (work_mem / BLCKSZ) >> >> while the hash table "space_allowed" is increasing. This may result in >> an *increased* memory usage :-( >> >> I also noticed the code does not clamp nbuckets properly as it should. >> >> >> The question what to do about this. If we got this a week ago, I'd just >> probably just revert a1b4f289, and then try again for PG19. After all, >> the issue it meant to address is somewhat rare. >> >> But with 18 already stamped ... >> >> I've shared these findings with the rest of the RMT, I'll see what their >> thoughts are. Of course, other opinions/suggestions are welcome. >> >> >> regards >> > > > Hi Tomas, > > If you are going to investigate this problem more, can you also look at > the related problem: > > https://www.postgresql.org/message-id/flat/52b94d5b- > a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936 > > I proposed the patch but got no feedback. > Thanks, I'll take a look. I wasn't aware of that thread (or rather I didn't realize it might have been related to this). And AFAICS the max_batches business is separate. But the last bit seems very relevant: - while (nbatch > 0) + while (nbatch > 0 && + nbuckets * 2 <= max_pointers) /* prevent allocation limit overflow */ I believe this is the missing "nbucket claiming" that I mentioned above. But I think it needs to work a bit differently. The max_pointers is calculated from work_mem, but the whole point here is to grow the hash table beyond that. I think it makes sense to relax that limit, and allow up to MaxAllocSize, or something like that. regards -- Tomas Vondra
Hi, Here's a couple draft patches fixing the bug: - 0001 adds the missing size_t cast, to fix the overflow - 0002 fixes the balancing, by adjusting the hash table size limit - 0003 adds the missing overflow protection for nbuckets and the hash table limit - 0004 rewords the comment explaining how the balancing works. Reading it after a couple months, I found it overly verbose / not very clear. I'm sure it could be improved even further. 0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the code is simply how we clamp nbuckets elsewhere (more or less). At some point I started wondering if this would be simpler if it'd have been better to use the algerbraic solution posted by James Hunter back in February [1]. It'd not need the loop, but it'd still need all this new overflow protection etc. I wanted to make sure the patches actually make it work correctly, so I created a table with 4B rows: create table t (a bigint, b text); insert into t select i, md5(i::text) from generate_series(1,4000000000) s(i); and I added this log message at the end of ExecChooseHashTableSize: elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld", work_mem, nbatch, nbuckets, (*space_allowed)/1024, (*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024); and I ran an explain on a self-join set enable_mergejoin = off; set max_parallel_workers_per_gather = 0; set work_mem = '...'; explain select * from t t1 join t t2 on (t1.a = t2.a); with work_mem set to values between 64kB and 1GB. On 18.0 I got this: wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200 wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400 wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800 wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600 wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200 wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248 wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768 wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672 wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912 wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912 wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672 wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768 wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248 I wanted to know how serious the issue is, compared to what would happen without the balancing. I disabled the balancing (by skipping the loop), and then I get this: wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200 wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400 wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800 wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600 wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200 wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248 wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768 wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672 wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912 wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608 wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912 wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672 wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768 wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248 The interesting bit is that the expected total memory usage (the last number in the log line) is exactly the same as for 18.0 with and without balancing. IIUC this is due to the "stop" condition using the initial hash table size. It makes me a bit less worried about this triggering OOM crashes - it does not improve the behavior, but it doesn't use more memory than before. Still an embarrassing bug, though. With the attached patches, this looks like this: wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192 wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384 wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768 wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536 wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072 wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608 wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912 wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672 wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768 wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248 So, this time it actually seems to work correctly and significantly reduces the memory usage ... There's one weird thing remaining - if you look at nbatch, it actually increases for the first couple work_mem steps. That's weird, because after increasing work_mem we should need *fewer* batches. But this has nothing to do with the balancing, it happens even with it disabled. The reason is that when calculating nbatch we do this: dbatch = Min(dbatch, max_pointers); and max_pointers is calculated from work_mem (among other things). It's a bit funny the logica worries about how many batch pointers we have, and refuses to allow more. But at the same time it ignores the BufFiles. AFAICS it's harmless - we may pick low number of batches initially, but then later we'll ramp it up (and the balancing will work too). And if you choose to run huge hash joins with tiny work_mem, I guess you're in for the suffering anyway. In any case, it's unrelated to balancing. regards [1] https://www.postgresql.org/message-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_%2BzecoZYMU%2Bdn3BVVaZg%40mail.gmail.com -- Tomas Vondra