Обсуждение: Fix overflow of nbatch

Поиск
Список
Период
Сортировка

Fix overflow of nbatch

От
Vaibhav Jain
Дата:
Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav
Вложения

Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
> With a1b4f28, to compute current_space, nbatch is being multiplied
> by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
> easily overflow the int limit.To keep the calculation safe for
> current_space, convert nbatch to size_t.

Thanks for finding and reporting this.

I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
in the following two lines because the final result is a size_t:

size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);

I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
the above two lines. All that code is new in a1b4f289b, and if you
change the nbatch to a size_t it affects the code that existed before
a1b4f289b, and from looking over that code, it seems to ensure that
nbatch does not overflow int by doing "dbatch = Min(dbatch,
max_pointers);", where max_pointers is constrained by MaxAllocSize /
sizeof(HashJoinTuple).

Also, making nbatches size_t makes the final line of " *numbatches =
nbatch;" somewhat questionable since numbatches is an int pointer.

David



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/22/25 22:45, David Rowley wrote:
> On Tue, 23 Sept 2025 at 01:57, Vaibhav Jain <jainva@google.com> wrote:
>> With a1b4f28, to compute current_space, nbatch is being multiplied
>> by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
>> easily overflow the int limit.To keep the calculation safe for
>> current_space, convert nbatch to size_t.
> 
> Thanks for finding and reporting this.
> 
> I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
> in the following two lines because the final result is a size_t:
> 
> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
> size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
> 

Yeah, I failed to notice this part of the formula can overflow.

> I'd rather see this fixed by adding a cast, i.e.: "(size_t) nbatch" on
> the above two lines. All that code is new in a1b4f289b, and if you
> change the nbatch to a size_t it affects the code that existed before
> a1b4f289b, and from looking over that code, it seems to ensure that
> nbatch does not overflow int by doing "dbatch = Min(dbatch,
> max_pointers);", where max_pointers is constrained by MaxAllocSize /
> sizeof(HashJoinTuple).
> 
> Also, making nbatches size_t makes the final line of " *numbatches =
> nbatch;" somewhat questionable since numbatches is an int pointer.
> 

It seems cleaner to add a the cast to the formula, if only because of
the assignment to numbatches.


thanks

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Chao Li
Дата:


On Sep 22, 2025, at 21:20, Vaibhav Jain <jainva@google.com> wrote:

Hi Everyone,

With a1b4f28, to compute current_space, nbatch is being multiplied
by BLCKSZ. nbatch is int and when multiplied with BLCKSZ, it can
easily overflow the int limit.To keep the calculation safe for
current_space, convert nbatch to size_t.

Please find a patch for the same.

Thanks,
Vaibhav
<0001-Fix-overflow-of-nbatch.patch>

I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));
So, in practice, nbatch won’t grow to very big. But yes, if nbatch reaches to, say 1 million, it will overflow.

A simple program proves that changing nbatch to size_t will prevent from overflowing:

```
#include <stdio.h>

int main(){
size_t nbatch = 1000000; // 1 million
int BLCKSZ = 8192;
size_t result = 2 * nbatch * BLCKSZ;
printf("%zu\n", result); // will output 16384000000
return 0;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
> I guess that because earlier in the function, nbatch is always clamped with:
>
> nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David



Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
> On 9/22/25 22:45, David Rowley wrote:
> > I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
> > in the following two lines because the final result is a size_t:
> >
> > size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
> > size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
> >
>
> Yeah, I failed to notice this part of the formula can overflow.

Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
over, should I take care of this, or do you want to?

David



Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
On 9/23/25 02:02, David Rowley wrote:
> On Tue, 23 Sept 2025 at 09:31, Tomas Vondra <tomas@vondra.me> wrote:
>> On 9/22/25 22:45, David Rowley wrote:
>>> I think a1b4f289b mistakenly thought that there'd be size_t arithmetic
>>> in the following two lines because the final result is a size_t:
>>>
>>> size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
>>> size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
>>>
>>
>> Yeah, I failed to notice this part of the formula can overflow.
> 
> Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
> over, should I take care of this, or do you want to?
> 

Feel free to fix, but I can take care of it once 18 is out the door.
It's my bug, after all.

BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
8KB blocks), which is a lot. But with the balancing logic, it'd also
mean each batch is about ~2GB. So the whole "hash table" would be about
500GB. Possible, but unlikely.

However, looking at the code now, I think the code should adjust the
hash_table_bytes value, not just space_allowed. It's meant to be the
same thing. Will check tomorrow.


thanks

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
David Rowley
Дата:
On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
>
> On 9/23/25 02:02, David Rowley wrote:
> > Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
> > over, should I take care of this, or do you want to?
> >
>
> Feel free to fix, but I can take care of it once 18 is out the door.
> It's my bug, after all.
>
> BTW ExecHashIncreaseBatchSize needs the same fix, I think.

I think it's probably best you handle this. I didn't notice that one.
You know this area much better than I do.

> I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
> 8KB blocks), which is a lot. But with the balancing logic, it'd also
> mean each batch is about ~2GB. So the whole "hash table" would be about
> 500GB. Possible, but unlikely.

I think no matter how low the chances of overflow are, the code isn't
written the way it was intended to be, so it should just be put the
way it was intended to be without question of the likelihood of
overflow. Otherwise, we'll just end up with harder to hit bugs which
could take much longer to [re]discover. Also, in these terms, what's
unlikely today may not be in the future.

David



Re: Fix overflow of nbatch

От
Chao Li
Дата:


On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal (not much data involved) query, 

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-#   ON o.id = i.id;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
   Hash Cond: (i.id = o.id)
   Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
   ->  Seq Scan on inner_tbl i  (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
         Buffers: local read=50048 dirtied=50000 written=49993
   ->  Hash  (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
         Buckets: 262144  Batches: 8  Memory Usage: 6446kB
         Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
         ->  Seq Scan on outer_tbl o  (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
               Buffers: local read=4480 dirtied=4425 written=4425
 Planning:
   Buffers: shared hit=14
 Planning Time: 389649.420 ms
 Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Fix overflow of nbatch

От
Vaibhav Jain
Дата:
Thank you everyone for the reviews, feedback and the test.

Please find attached a new version of the patch.

On Tue, Sep 23, 2025 at 7:11 AM Chao Li <li.evan.chao@gmail.com> wrote:


On Sep 23, 2025, at 07:35, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 23 Sept 2025 at 11:21, Chao Li <li.evan.chao@gmail.com> wrote:
I guess that because earlier in the function, nbatch is always clamped with:

nbatch = pg_nextpower2_32(Max(2, minbatch));

I don't follow which part of that line could be constituted as
clamping. Maybe you've confused Max with Min?

David

Sorry for the misleading. I actually meant “minbatch”.

I remember I ever traced the function several times. First, with a normal (not much data involved) query, 

if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{

Is hard to meet, then nbatch will be just 1.

With big data involved, it will enter the “if” clause, but minbatch is also hard to go very high.

To clarify, I just created a test:

```
evantest=# SET enable_nestloop = off;
SET
evantest=# SET enable_mergejoin = off;
SET
evantest=# SET enable_hashjoin = on;
SET
evantest=# CREATE TEMP TABLE inner_tbl AS
evantest-# SELECT g AS id, repeat('x', 2000) AS filler
evantest-# FROM generate_series(1, 200000) g;
SELECT 200000
evantest=# CREATE TEMP TABLE outer_tbl AS
evantest-# SELECT g AS id FROM generate_series(1, 1000000) g;
SELECT 1000000
evantest=#
evantest=#
evantest=# EXPLAIN ANALYZE
evantest-# SELECT *
evantest-# FROM outer_tbl o
evantest-# JOIN inner_tbl i
evantest-#   ON o.id = i.id;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=34647.00..1270978635.52 rows=36306020352 width=40) (actual time=353.908..1355.735 rows=200000.00 loops=1)
   Hash Cond: (i.id = o.id)
   Buffers: local read=54528 dirtied=54425 written=54418, temp read=45853 written=45853
   ->  Seq Scan on inner_tbl i  (cost=0.00..113608.96 rows=6356096 width=36) (actual time=1.132..460.711 rows=200000.00 loops=1)
         Buffers: local read=50048 dirtied=50000 written=49993
   ->  Hash  (cost=15904.00..15904.00 rows=1142400 width=4) (actual time=351.280..351.282 rows=1000000.00 loops=1)
         Buckets: 262144  Batches: 8  Memory Usage: 6446kB
         Buffers: local read=4480 dirtied=4425 written=4425, temp written=2560
         ->  Seq Scan on outer_tbl o  (cost=0.00..15904.00 rows=1142400 width=4) (actual time=0.760..162.229 rows=1000000.00 loops=1)
               Buffers: local read=4480 dirtied=4425 written=4425
 Planning:
   Buffers: shared hit=14
 Planning Time: 389649.420 ms
 Execution Time: 1362.392 ms
(14 rows)
```

In this test, minbatch is just 64.

But I agree, I did never test with large amount of data. I don’t actually know how much data can make nbatch to reach to ~130K (the value will lead to overflow if nbatch is of int type).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Вложения

Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/23/25 03:20, David Rowley wrote:
> On Tue, 23 Sept 2025 at 13:01, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 9/23/25 02:02, David Rowley wrote:
>>> Ok cool. We're just in the freeze for 18.0 at the moment. Once that's
>>> over, should I take care of this, or do you want to?
>>>
>>
>> Feel free to fix, but I can take care of it once 18 is out the door.
>> It's my bug, after all.
>>
>> BTW ExecHashIncreaseBatchSize needs the same fix, I think.
> 
> I think it's probably best you handle this. I didn't notice that one.
> You know this area much better than I do.
> 

OK, will do.

>> I wonder how likely the overflow is. AFAICS we'd need nbatch=256k (with
>> 8KB blocks), which is a lot. But with the balancing logic, it'd also
>> mean each batch is about ~2GB. So the whole "hash table" would be about
>> 500GB. Possible, but unlikely.
> 
> I think no matter how low the chances of overflow are, the code isn't
> written the way it was intended to be, so it should just be put the
> way it was intended to be without question of the likelihood of
> overflow. Otherwise, we'll just end up with harder to hit bugs which
> could take much longer to [re]discover. Also, in these terms, what's
> unlikely today may not be in the future.
> 

I wasn't disputing the validity of the bug. I was just thinking alund
about how likely it's to hit.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
Hi,

I kept looking at this, and unfortunately the it seems a bit worse :-(

Fixing the overflow is easy enough - adding the cast does the trick.

But then there's the second issue I mentioned - the loop does not adjust
the hash_table_bytes. It updates the *space_allowed, but that's not what
the current_space/new_space formulas use.

This breaks the "balancing" as the nbatch gets decreased until

    nbatch <= (work_mem / BLCKSZ)

while the hash table "space_allowed" is increasing. This may result in
an *increased* memory usage :-(

I also noticed the code does not clamp nbuckets properly as it should.


The question what to do about this. If we got this a week ago, I'd just
probably just revert a1b4f289, and then try again for PG19. After all,
the issue it meant to address is somewhat rare.

But with 18 already stamped ...

I've shared these findings with the rest of the RMT, I'll see what their
thoughts are. Of course, other opinions/suggestions are welcome.


regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tom Lane
Дата:
Tomas Vondra <tomas@vondra.me> writes:
> The question what to do about this. If we got this a week ago, I'd just
> probably just revert a1b4f289, and then try again for PG19. After all,
> the issue it meant to address is somewhat rare.
> But with 18 already stamped ...

18.0 is what it is.  If this is the most serious bug in it, I'll be
a bit surprised.  Take your time, fix it properly.

            regards, tom lane



Re: Fix overflow of nbatch

От
Nathan Bossart
Дата:
On Tue, Sep 23, 2025 at 11:34:56AM -0400, Tom Lane wrote:
> 18.0 is what it is.  If this is the most serious bug in it, I'll be
> a bit surprised.  Take your time, fix it properly.

+1

-- 
nathan



Re: Fix overflow of nbatch

От
Konstantin Knizhnik
Дата:
On 23/09/2025 6:11 PM, Tomas Vondra wrote:

> Hi,
>
> I kept looking at this, and unfortunately the it seems a bit worse :-(
>
> Fixing the overflow is easy enough - adding the cast does the trick.
>
> But then there's the second issue I mentioned - the loop does not adjust
> the hash_table_bytes. It updates the *space_allowed, but that's not what
> the current_space/new_space formulas use.
>
> This breaks the "balancing" as the nbatch gets decreased until
>
>      nbatch <= (work_mem / BLCKSZ)
>
> while the hash table "space_allowed" is increasing. This may result in
> an *increased* memory usage :-(
>
> I also noticed the code does not clamp nbuckets properly as it should.
>
>
> The question what to do about this. If we got this a week ago, I'd just
> probably just revert a1b4f289, and then try again for PG19. After all,
> the issue it meant to address is somewhat rare.
>
> But with 18 already stamped ...
>
> I've shared these findings with the rest of the RMT, I'll see what their
> thoughts are. Of course, other opinions/suggestions are welcome.
>
>
> regards
>


Hi Tomas,

If you are going to investigate this problem more, can you also look at 
the related problem:


https://www.postgresql.org/message-id/flat/52b94d5b-a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936

I proposed the patch but got no feedback.


Best regards,
Konstantin





Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:

On 9/23/25 21:13, Konstantin Knizhnik wrote:
> On 23/09/2025 6:11 PM, Tomas Vondra wrote:
> 
>> Hi,
>>
>> I kept looking at this, and unfortunately the it seems a bit worse :-(
>>
>> Fixing the overflow is easy enough - adding the cast does the trick.
>>
>> But then there's the second issue I mentioned - the loop does not adjust
>> the hash_table_bytes. It updates the *space_allowed, but that's not what
>> the current_space/new_space formulas use.
>>
>> This breaks the "balancing" as the nbatch gets decreased until
>>
>>      nbatch <= (work_mem / BLCKSZ)
>>
>> while the hash table "space_allowed" is increasing. This may result in
>> an *increased* memory usage :-(
>>
>> I also noticed the code does not clamp nbuckets properly as it should.
>>
>>
>> The question what to do about this. If we got this a week ago, I'd just
>> probably just revert a1b4f289, and then try again for PG19. After all,
>> the issue it meant to address is somewhat rare.
>>
>> But with 18 already stamped ...
>>
>> I've shared these findings with the rest of the RMT, I'll see what their
>> thoughts are. Of course, other opinions/suggestions are welcome.
>>
>>
>> regards
>>
> 
> 
> Hi Tomas,
> 
> If you are going to investigate this problem more, can you also look at
> the related problem:
> 
> https://www.postgresql.org/message-id/flat/52b94d5b-
> a135-489d-9833-2991a69ec623%40garret.ru#ebe4151f1d505bbcc32cf93b2e8a1936
> 
> I proposed the patch but got no feedback.
> 

Thanks, I'll take a look. I wasn't aware of that thread (or rather I
didn't realize it might have been related to this). And AFAICS the
max_batches business is separate.

But the last bit seems very relevant:

-    while (nbatch > 0)
+    while (nbatch > 0 &&
+       nbuckets * 2 <= max_pointers) /* prevent allocation limit
overflow */

I believe this is the missing "nbucket claiming" that I mentioned above.
But I think it needs to work a bit differently. The max_pointers is
calculated from work_mem, but the whole point here is to grow the hash
table beyond that. I think it makes sense to relax that limit, and allow
up to MaxAllocSize, or something like that.



regards

-- 
Tomas Vondra




Re: Fix overflow of nbatch

От
Tomas Vondra
Дата:
Hi,

Here's a couple draft patches fixing the bug:

- 0001 adds the missing size_t cast, to fix the overflow

- 0002 fixes the balancing, by adjusting the hash table size limit

- 0003 adds the missing overflow protection for nbuckets and the hash
table limit

- 0004 rewords the comment explaining how the balancing works. Reading
it after a couple months, I found it overly verbose / not very clear.
I'm sure it could be improved even further.


0001 and 0002 are pretty trivial, 0003 is a bit bigger, but most of the
code is simply how we clamp nbuckets elsewhere (more or less).

At some point I started wondering if this would be simpler if it'd have
been better to use the algerbraic solution posted by James Hunter back
in February [1]. It'd not need the loop, but it'd still need all this
new overflow protection etc.


I wanted to make sure the patches actually make it work correctly, so I
created a table with 4B rows:

  create table t (a bigint, b text);
  insert into t select i, md5(i::text)
                  from generate_series(1,4000000000) s(i);

and I added this log message at the end of ExecChooseHashTableSize:

  elog(WARNING, "wm %d nbatch %d nbucket %d space %ld total %ld",
       work_mem, nbatch, nbuckets, (*space_allowed)/1024,
       (*space_allowed + 2 * nbatch * (Size) BLCKSZ)/1024);

and I ran an explain on a self-join

  set enable_mergejoin = off;
  set max_parallel_workers_per_gather = 0;
  set work_mem = '...';

  explain select * from t t1 join t t2 on (t1.a = t2.a);

with work_mem set to values between 64kB and 1GB.

On 18.0 I got this:

  wm 64 nbatch 8 nbucket 2097152 hash 131072 total 131200
  wm 128 nbatch 16 nbucket 4194304 hash 262144 total 262400
  wm 256 nbatch 32 nbucket 8388608 hash 524288 total 524800
  wm 512 nbatch 64 nbucket 16777216 hash 1048576 total 1049600
  wm 1024 nbatch 128 nbucket 33554432 hash 2097152 total 2099200
  wm 2048 nbatch 256 nbucket 33554432 hash 2097152 total 2101248
  wm 4096 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 8192 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 16384 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

I wanted to know how serious the issue is, compared to what would happen
without the balancing. I disabled the balancing (by skipping the loop),
and then I get this:

  wm 64 nbatch 8192 nbucket 2048 hash 128 total 131200
  wm 128 nbatch 16384 nbucket 4096 hash 256 total 262400
  wm 256 nbatch 32768 nbucket 8192 hash 512 total 524800
  wm 512 nbatch 65536 nbucket 16384 hash 1024 total 1049600
  wm 1024 nbatch 131072 nbucket 32768 hash 2048 total 2099200
  wm 2048 nbatch 131072 nbucket 65536 hash 4096 total 2101248
  wm 4096 nbatch 65536 nbucket 131072 hash 8192 total 1056768
  wm 8192 nbatch 32768 nbucket 262144 hash 16384 total 540672
  wm 16384 nbatch 16384 nbucket 524288 hash 32768 total 294912
  wm 32768 nbatch 8192 nbucket 1048576 hash 65536 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

The interesting bit is that the expected total memory usage (the last
number in the log line) is exactly the same as for 18.0 with and without
balancing. IIUC this is due to the "stop" condition using the initial
hash table size. It makes me a bit less worried about this triggering
OOM crashes - it does not improve the behavior, but it doesn't use more
memory than before. Still an embarrassing bug, though.

With the attached patches, this looks like this:

  wm 64 nbatch 256 nbucket 65536 hash 4096 total 8192
  wm 128 nbatch 512 nbucket 131072 hash 8192 total 16384
  wm 256 nbatch 1024 nbucket 262144 hash 16384 total 32768
  wm 512 nbatch 2048 nbucket 524288 hash 32768 total 65536
  wm 1024 nbatch 4096 nbucket 1048576 hash 65536 total 131072
  wm 2048 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 4096 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 8192 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 16384 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 32768 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 65536 nbatch 4096 nbucket 2097152 hash 131072 total 196608
  wm 131072 nbatch 2048 nbucket 4194304 hash 262144 total 294912
  wm 262144 nbatch 1024 nbucket 8388608 hash 524288 total 540672
  wm 524288 nbatch 512 nbucket 16777216 hash 1048576 total 1056768
  wm 1048576 nbatch 256 nbucket 33554432 hash 2097152 total 2101248

So, this time it actually seems to work correctly and significantly
reduces the memory usage ...

There's one weird thing remaining - if you look at nbatch, it actually
increases for the first couple work_mem steps. That's weird, because
after increasing work_mem we should need *fewer* batches. But this has
nothing to do with the balancing, it happens even with it disabled.

The reason is that when calculating nbatch we do this:

    dbatch = Min(dbatch, max_pointers);

and max_pointers is calculated from work_mem (among other things). It's
a bit funny the logica worries about how many batch pointers we have,
and refuses to allow more. But at the same time it ignores the BufFiles.

AFAICS it's harmless - we may pick low number of batches initially, but
then later we'll ramp it up (and the balancing will work too). And if
you choose to run huge hash joins with tiny work_mem, I guess you're in
for the suffering anyway. In any case, it's unrelated to balancing.


regards

[1]
https://www.postgresql.org/message-id/CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_%2BzecoZYMU%2Bdn3BVVaZg%40mail.gmail.com

-- 
Tomas Vondra

Вложения