Обсуждение: Remove traces of long in dynahash.c

Поиск
Список
Период
Сортировка

Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
Hi all,

While looking at the recent business with dynahash.c in [1], I have
been reminded of the fact that this code still depends on long.  Based
on my lookup of the code, I don't think that there is any issue with
the current uses, but I've never been a fan of using a type that's 8
byte everywhere except on WIN32, and we have been bitten by that in
the past depending on the code paths where long is used, even if WIN32
should be niche these days.

So I have looked at this file, and finished with the attached.  The
result is nice, removing long from hsearch.h, as well as dynahash.c,
cleaning up some variable declarations on the way.

While monitoring the callers of the function signatures updated in
this patch, there is a mix of Size, int or int32 used in the variable
declarations used with the callers.  There were a couple of long
declared in a couple of places like pgss, locking, predicate, shmem
that were declared as such to map with dynahash.  These can be
replaced.

Thoughts?

[1]: https://www.postgresql.org/message-id/aKF8mPDqgQb3uBjG@paquier.xyz
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Chao Li
Дата:


On Aug 19, 2025, at 14:24, Michael Paquier <michael@paquier.xyz> wrote:


--
Michael
<0001-Replace-uses-of-long-by-uint64-in-dynahash.c-and-hse.patch>


-static long next_pow2_long(long num);
-static int next_pow2_int(long num);
+static uint64 next_pow2_uint64(uint64 num);
+static int next_pow2_int(uint64 num);

There are already pg_nextpower2_64() and pg_nextpower2_32() in pg_bitutils.h, feels like the new replacement functions are duplicate to the existing ones.


- * we no longer need to worry about registering hash_seq_search scans,
+ * we no uint64er need to worry about registering hash_seq_search scans,

Looks like a bad auto replacement: “no longer”, here “long" should be kept.

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Remove traces of long in dynahash.c

От
Tom Lane
Дата:
Chao Li <li.evan.chao@gmail.com> writes:
>> On Aug 19, 2025, at 14:24, Michael Paquier <michael@paquier.xyz> wrote:
>> <0001-Replace-uses-of-long-by-uint64-in-dynahash.c-and-hse.patch>

> There are already pg_nextpower2_64() and pg_nextpower2_32() in pg_bitutils.h, feels like the new replacement
functionsare duplicate to the existing ones. 

It always seemed weird to me that dynahash.c has its own bit-twiddling
functions.  (There are indications in the source code that it was once
a standalone test program, which perhaps explains that.)
+1 for getting rid of those while we're doing janitorial work here.
They're not *quite* duplicates though, for instance next_pow2_int has
different response to out-of-range values than pg_nextpower2_32.

            regards, tom lane



Re: Remove traces of long in dynahash.c

От
Peter Eisentraut
Дата:
On 19.08.25 08:24, Michael Paquier wrote:
> While looking at the recent business with dynahash.c in [1], I have
> been reminded of the fact that this code still depends on long.

It's definitely a good idea to get rid of "long" usage.  But you can 
also replace it with long long instead of int64.  I suppose this is a 
stylistic question, but I would tend to use the intNN types only when I 
need exactly that many bits.

Also, your patch changes from signed to unsigned types.  Maybe that's 
ok, but you didn't explain it.




Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Tue, Aug 19, 2025 at 10:46:58AM -0400, Tom Lane wrote:
> +1 for getting rid of those while we're doing janitorial work here.
> They're not *quite* duplicates though, for instance next_pow2_int has
> different response to out-of-range values than pg_nextpower2_32.

This would mean introducing more flavors in pg_bitutils.h with limit
checks.  That does not seem completely right to do in this file, which
is a wrapper for all the __builtin_*() calls?  A second point is on
the signedness but we could just cap the maximum at
(PG_UINT{32,64}_MAX / 2), I guess, with two new routines like:
uint64 pg_nextpower2_64_max(uint64 num);
uint32 pg_prevpower2_32_max(uint32 num);

And then cast the unsigned results back to signed in dynahash.c.

Without this point, I have switched the patch to use int64, keeping
the signedness the same as the original.  I have missed that there was
one spot where we relied on NO_MAX_DSIZE.
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Chao Li
Дата:


On Aug 20, 2025, at 15:40, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 19, 2025 at 10:46:58AM -0400, Tom Lane wrote:
+1 for getting rid of those while we're doing janitorial work here.
They're not *quite* duplicates though, for instance next_pow2_int has
different response to out-of-range values than pg_nextpower2_32.

This would mean introducing more flavors in pg_bitutils.h with limit
checks.  That does not seem completely right to do in this file, which
is a wrapper for all the __builtin_*() calls?  A second point is on
the signedness but we could just cap the maximum at
(PG_UINT{32,64}_MAX / 2), I guess, with two new routines like:
uint64 pg_nextpower2_64_max(uint64 num);
uint32 pg_prevpower2_32_max(uint32 num);


I wonder if we can keep the same naming style to make the new function name like next_pow2_64()?

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Wed, Aug 20, 2025 at 04:14:15PM +0800, Chao Li wrote:
> I wonder if we can keep the same naming style to make the new
> function name like next_pow2_64()?

I don't think that this would be a good idea to have new routines
published in pg_bitutils.h with names inconsistent with the existing
one.  next_pow2_long() and next_pow2_int() are now local to
dynahash.c, so we don't really have to follow their naming pattern.
It would be more important to me to choose a new name, rather in line
with the other ones.

After sleeping on it, I am not sure what to do with these routines.  I
don't deny that more refactoring can be done.  However, all that can
also happen outside the long -> int64 switch I am suggesting.

Any comments from others?
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Chao Li
Дата:
Hi Michael,

On Aug 21, 2025, at 07:07, Michael Paquier <michael@paquier.xyz> wrote:


After sleeping on it, I am not sure what to do with these routines.  I
don't deny that more refactoring can be done.  However, all that can
also happen outside the long -> int64 switch I am suggesting.

Any comments from others?
--
Michael

I don’t want to block you. Unless others have the same comment about naming, you can go ahead and move forward.  I agree further refactoring can belong to a separate commit. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Remove traces of long in dynahash.c

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Aug 20, 2025 at 04:14:15PM +0800, Chao Li wrote:
>> I wonder if we can keep the same naming style to make the new
>> function name like next_pow2_64()?

> I don't think that this would be a good idea to have new routines
> published in pg_bitutils.h with names inconsistent with the existing
> one.  next_pow2_long() and next_pow2_int() are now local to
> dynahash.c, so we don't really have to follow their naming pattern.
> It would be more important to me to choose a new name, rather in line
> with the other ones.

Yes, the precedent to follow here is the naming conventions in
pg_bitutils.h.

> After sleeping on it, I am not sure what to do with these routines.  I
> don't deny that more refactoring can be done.  However, all that can
> also happen outside the long -> int64 switch I am suggesting.

If you prefer to regard this as an independent issue, that's okay with
me ... but it's touching most of the same lines of code, so it seems
to me that it'd be about as easy to deal with both items at once.

            regards, tom lane



Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Thu, Aug 21, 2025 at 12:53:09AM -0400, Tom Lane wrote:
> If you prefer to regard this as an independent issue, that's okay with
> me ... but it's touching most of the same lines of code, so it seems
> to me that it'd be about as easy to deal with both items at once.

I'd rather do that after a second look at the whole picture as this
led to an accumulation of bullet points.  The long->int64 switch was
looking OK on its own and I have applied it.

Attached is the second piece of refactoring, where I have introduced a
couple more APIs in pg_bitutils.h (cross-checked the resulting
computations of the old and new routines with some quick hacks, in
case, and they matched):
pg_nextpower2_32_bound, replacing next_pow2_int
pg_nextpower2_64_bound, replacing next_pow2_int64
pg_ceil_log2_64_bound, replacing my_log2()
pg_ceil_log2_32_bound, not used, present for symmetry.

An extra thing is a suggested change for pg_nextpower2_32(), to use a
uint64 instead of a uint32 as argument, which is caused by
next_pow2_int64() and next_pow2_int(), that both used int64
previously.

There's likely some opinion differences according to one's taste;
that's my idea of the refactoring to remove the duplication.
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Peter Eisentraut
Дата:
On 22.08.25 07:09, Michael Paquier wrote:
> An extra thing is a suggested change for pg_nextpower2_32(), to use a
> uint64 instead of a uint32 as argument, which is caused by
> next_pow2_int64() and next_pow2_int(), that both used int64
> previously.

That seems highly confusing.  What is the meaning of the "32" then?

If you need 64-bit behavior, use the variant with "64" in the name.




Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Wed, Aug 27, 2025 at 10:00:16AM +0200, Peter Eisentraut wrote:
> That seems highly confusing.  What is the meaning of the "32" then?
>
> If you need 64-bit behavior, use the variant with "64" in the name.

static int
next_pow2_int(int64 num)
{
    if (num > INT_MAX / 2)
        num = INT_MAX / 2;
    return 1 << my_log2(num);
}

The pain point for me is the assumption of this routine on HEAD and
older branches, leading to a more protective overflow pattern for the
number of partitions calculated.  I don't see an elegant way to keep
the same calculations for the "next power" routines while making the
int32 flavor more compliant with the fact that it may have a int64
argument (long previously), because it would mean that we would
underestimate the number returned here each time "num" is higher than
(INT_MAX / 2).  That's quite dangerous when applied to dynahash.c,
which is a layer that extensions like.  That would lead to doubling
the number of "next power" routines in pg_bitutils.h, which is not
cool in the long-term because it would facilitate incorrect uses.

So, taking a step back, I don't know what would be a good fit for
these duplicates of the "next power" routines upper-bounded on input
when attached to pg_bitutils.h.  However, I do see that we can get rid
of pg_log2() and dynahash.h with a consistent interface in
pg_bitutils.h, by reducing my proposal to the introduction of
pg_ceil_log2_32_bound() and pg_ceil_log2_64_bound().

At the end, next_pow2_int64() and next_pow2_int() are a lesser deal to
me, being static to dynahash.c.  With that in mind, I am finishing
with the attached.  Less ambitious, still it's a nice cleanup IMO.

What do you think?
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Peter Eisentraut
Дата:
On 01.09.25 05:25, Michael Paquier wrote:
> So, taking a step back, I don't know what would be a good fit for
> these duplicates of the "next power" routines upper-bounded on input
> when attached to pg_bitutils.h.  However, I do see that we can get rid
> of pg_log2() and dynahash.h with a consistent interface in
> pg_bitutils.h, by reducing my proposal to the introduction of
> pg_ceil_log2_32_bound() and pg_ceil_log2_64_bound().
> 
> At the end, next_pow2_int64() and next_pow2_int() are a lesser deal to
> me, being static to dynahash.c.  With that in mind, I am finishing
> with the attached.  Less ambitious, still it's a nice cleanup IMO.

pg_bitutils.h is aligned with standard compiler intrinsics and in the 
long term C23 <stdbit.h>, so we shouldn't add our own custom stuff in 
there without considering that bigger picture.

I would agree with what I think you're saying, we can keep these custom 
variants with a particular error-checking behavior local to dynahash.c. 
Maybe a comment or two to explain this more clearly would be good.

Taking a look at your previous patch with the changes from long to 
int64, I think there is something that still doesn't fit.

For example, taking a look at the callers of hash_estimate_size(int64, 
Size), they pass either int as the first argument, or in a few cases 
long.  Looking around inside dynahash.c, do any of these places actually 
need the int64 range?  These are all just counters, the memory sizes use 
Size correctly it seems.  Do we want to support more than INT_MAX 
elements?  I wonder whether the right solution would be to turn the long 
uses into int instead.  Then you also don't need to deal with two 
next_pow2* variants.




Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Wed, Sep 03, 2025 at 02:48:40PM +0200, Peter Eisentraut wrote:
> Taking a look at your previous patch with the changes from long to int64, I
> think there is something that still doesn't fit.
>
> For example, taking a look at the callers of hash_estimate_size(int64,
> Size), they pass either int as the first argument, or in a few cases long.
> Looking around inside dynahash.c, do any of these places actually need the
> int64 range?  These are all just counters, the memory sizes use Size
> correctly it seems.  Do we want to support more than INT_MAX elements?  I
> wonder whether the right solution would be to turn the long uses into int
> instead.  Then you also don't need to deal with two next_pow2* variants.

In terms of in-core callers, I am not worried.  The highest cap that
can be reached would be I think PGSS, and we are capped by the
pgss_max GUC at (INT_MAX / 2).

hash_create() is too generic to offer hints at Postgres-related uses
outside of core.  hash_get_num_entries() and hash_select_dirsize()
offer a couple of more dedicated hints, but everything I am seeing
seems to point out to the argument that the tables are capped due to
integer GUCs being 4 bytes, or use some hardcoded values which are
much lower than the 2^32 limit, like some stuff in this repo:
https://github.com/jithinjose2004/postgres_cluster/

So I could agree with your argument of dropping the 8-byte part
altogether argument and restrict all the interfaces to 4 bytes.  Any
comments or opinions from others in favor or against taking this Leap
of Faith?
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Dean Rasheed
Дата:
On Mon, 1 Sept 2025 at 04:26, Michael Paquier <michael@paquier.xyz> wrote:
>
> So, taking a step back, I don't know what would be a good fit for
> these duplicates of the "next power" routines upper-bounded on input
> when attached to pg_bitutils.h.  However, I do see that we can get rid
> of pg_log2() and dynahash.h with a consistent interface in
> pg_bitutils.h, by reducing my proposal to the introduction of
> pg_ceil_log2_32_bound() and pg_ceil_log2_64_bound().
>
> At the end, next_pow2_int64() and next_pow2_int() are a lesser deal to
> me, being static to dynahash.c.  With that in mind, I am finishing
> with the attached.  Less ambitious, still it's a nice cleanup IMO.
>
> What do you think?

+/*
+ * pg_ceil_log2_32_bound
+ *        Returns equivalent of ceil(log2(num)), with overflow safeguard
+ *        for pg_leftmost_one_pos32.
+ */
+static inline uint32
+pg_ceil_log2_32_bound(uint32 num)
+{
+    if (num > PG_INT32_MAX / 2)
+        num = PG_INT32_MAX / 2;
+
+    if (num < 2)
+        return 0;
+    else
+        return pg_leftmost_one_pos32(num - 1) + 1;
+}

So this is capping the result to be at most 30 (for any input greater
than or equal to 2^30, it will return 30). That's not at all clear
from the comments.

Also, why not have it call pg_ceil_log2_32(), instead of duplicating that code?

Alternatively, why not just impose the upper bound at the call sites
if needed, so there may be no need for these functions at all. For
example, looking at nodeHash.c, it would seem much more logical to
have ExecChooseHashTableSize() put an upper bound on nbuckets, rather
than capping log2_nbuckets after nbuckets has been chosen, risking
them getting out-of-sync.

Regards,
Dean



Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Fri, Sep 05, 2025 at 10:40:46AM +0100, Dean Rasheed wrote:
> Alternatively, why not just impose the upper bound at the call sites
> if needed, so there may be no need for these functions at all. For
> example, looking at nodeHash.c, it would seem much more logical to
> have ExecChooseHashTableSize() put an upper bound on nbuckets, rather
> than capping log2_nbuckets after nbuckets has been chosen, risking
> them getting out-of-sync.

Yep, that may be the best course of action.  As far as I can see, this
is capped by palloc() and HashJoinTuple, so we should be OK with
putting a hard limit at (INT_MAX / 2) and call it a day, I guess?

The two other call sites of my_log2() are worker.c, for the number of
subxacts, which relies on int32.  The other call site is nodeAgg.c,
capped at HASHAGG_MAX_PARTITIONS (1024).

As of the attached, dynahash.h can be removed, which is the minimal
goal I had in mind.  I am not sure about the need to tweak more
dynahash.c, as we've relied on long in this file for many years.  We
could bite the bullet and do it, of course, but I am not sure..  So
I would be happy with only the attached changes.

What do you think?
--
Michael

Вложения

Re: Remove traces of long in dynahash.c

От
Dean Rasheed
Дата:
On Tue, 9 Sept 2025 at 04:58, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 05, 2025 at 10:40:46AM +0100, Dean Rasheed wrote:
> > Alternatively, why not just impose the upper bound at the call sites
> > if needed, so there may be no need for these functions at all. For
> > example, looking at nodeHash.c, it would seem much more logical to
> > have ExecChooseHashTableSize() put an upper bound on nbuckets, rather
> > than capping log2_nbuckets after nbuckets has been chosen, risking
> > them getting out-of-sync.
>
> Yep, that may be the best course of action.  As far as I can see, this
> is capped by palloc() and HashJoinTuple, so we should be OK with
> putting a hard limit at (INT_MAX / 2) and call it a day, I guess?

+1

In ExecChooseHashTableSize():

+    /*
+     * Cap nbuckets, for power of 2 calculations.  This maximum is safe
+     * for pg_ceil_log2_32().
+     */
+    if (nbuckets > PG_INT32_MAX / 2)
+        nbuckets = PG_INT32_MAX / 2;

That comment is not really right, because pg_ceil_log2_32() can accept
larger inputs than that. But in case, that cap is wrong because
nbuckets should always be a power of 2, and PG_INT32_MAX / 2 is 2^30 -
1.

Looking more closely, I don't think a cap is needed at all, given the
prior computations. Further up, there's this code:

    /* Also ensure we avoid integer overflow in nbatch and nbuckets */
    /* (this step is redundant given the current value of MaxAllocSize) */
    max_pointers = Min(max_pointers, INT_MAX / 2 + 1);

and in practice, it's constrained to be much less than that, based on
this earlier code:

    max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));

So I think in theory that ensures that nbuckets can never get anywhere
near overflowing a 32-bit integer. Given that nbuckets is a 32-bit
signed integer, and it is a power of 2, it is automatically less than
or equal to 2^30, or else if somehow there was an error in the
preceding logic and an attempt had been made to make it larger than
that, integer wrap-around would have occurred (e.g., it might have
become -2^31), in which case the "Assert(nbuckets > 0)" would trap it.

So I think there's no point in adding that cap, or any additional
checks in ExecChooseHashTableSize().

Regards,
Dean



Re: Remove traces of long in dynahash.c

От
Michael Paquier
Дата:
On Tue, Sep 09, 2025 at 10:28:13AM +0100, Dean Rasheed wrote:
> So I think there's no point in adding that cap, or any additional
> checks in ExecChooseHashTableSize().

You are right that this hardcoded limit introduced in the previous
patch was useless.  So I have removed that, and applied the result.
Thanks for the suggestion.

Regarding removing the next power functions in dynahash.c, I am not
sure if it is worth bothering much.  Now that dynahash.h is removed,
all the code duplication that was in the backend, without the
hardcoded thresholds, is removed.
--
Michael

Вложения