Обсуждение: Safer hash table initialization macro
Hi hackers,
Currently to create a hash table we do things like:
A) create a struct, say:
typedef struct SeenRelsEntry
{
Oid rel_id;
int list_index;
} SeenRelsEntry;
where the first member is the hash key, and then later:
B)
ctl.keysize = sizeof(Oid);
ctl.entrysize = sizeof(SeenRelsEntry);
ctl.hcxt = CurrentMemoryContext;
seen_rels = hash_create("find_all_inheritors temporary table",
32, /* start small and extend */
&ctl,
I can see 2 possible issues:
1)
We manually specify the type for keysize, which could become incorrect (from the
start) or if the key member's type changes.
2)
It may be possible to remove the key member without the compiler noticing it.
Take this example and remove:
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 929bb53b620..eb11976afef 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -36,7 +36,6 @@
*/
typedef struct SeenRelsEntry
{
- Oid rel_id; /* relation oid */
int list_index; /* its position in output list(s) */
} SeenRelsEntry;
That would compile without any issues because this rel_id member is not
referenced in the code (for this particular example). That's rare but possible.
But then, on my machine, during make check:
TRAP: failed Assert("!found"), File: "nodeModifyTable.c", Line: 5157, PID: 140430
The reason is that the struct member access is done only for bytes level
operations (within the hash related macros). So it's easy to think that this
member is unused (because it is not referenced in the code).
I'm thinking about what kind of safety we could put in place to better deal with
1) and 2).
What about adding a macro that:
- requests the key member name
- ensures that it is at offset 0
- computes the key size based on the member
Something like:
"
#define HASH_ELEM_INIT(ctl, entrytype, keymember) \
do { \
StaticAssertStmt(offsetof(entrytype, keymember) == 0, \
#keymember " must be first member in " #entrytype); \
(ctl).keysize = sizeof(((entrytype *)0)->keymember); \
(ctl).entrysize = sizeof(entrytype); \
} while (0)
"
That way:
- The key member is explicitly referenced in the code (preventing "unused"
false positives)
- The key size is automatically computed from the actual member type (preventing
type mismatches)
- We enforce that the key is at offset 0
An additional benefit: it avoids repeating the "keysize =" followed by "entrysize ="
in a lot of places in the code (currently about 100 times).
If that sounds like a good idea, I could work on a patch doing so.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Thoughts?
I think the hashtable creation API in postgres is so terrible that it
actively discourages usage. At Citus we definitely had the problem
that we would use Lists for cases where a hashtable was preferable
perf wise, just because the API was so terrible.
That's why I eventually implemented some wrapper helper functions to
make it less verbose and error prone to use in by far the most common
patterns we had[1].
So I'm definitely in favor of improving this API (probably by adding a
few new functions). I have a few main thoughts on what could be
improved:
1. Automatically determine keysize and entrysize given a keymember and
entrytype (like you suggested).
2. Autodect most of the flags.
a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE,
HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected
based on the relevant fields from HASHCTL. Passing them in explicitly
is just duplication that causes code noise and is easy to forget
accidentally.
b. HASH_ELEM is useless noise because it is required
c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by
default if the keytype is char*)
3. Default to CurrentMemoryContext instead of TopMemoryContext. Like
all of the other functions that allocate, because right now it's way
too easy to accidentally use TopMemoryContext when you did not intend
to.
4. Have a creation function that doesn't require HASHCTL at all (just
takes entrytype and keymember and maybe a version of this that takes a
memorycontext).
[1]:
https://github.com/citusdata/citus/blob/ae2eb65be082d52db646b68a644474e24bc6cea1/src/include/distributed/hash_helpers.h#L74
Hi, On Mon, Dec 01, 2025 at 03:44:41PM +0100, Jelte Fennema-Nio wrote: > On Mon, 1 Dec 2025 at 14:45, Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Thoughts? > > I think the hashtable creation API in postgres is so terrible that it > actively discourages usage. Thanks for sharing your thoughts! > So I'm definitely in favor of improving this API (probably by adding a > few new functions). I have a few main thoughts on what could be > improved: > > 1. Automatically determine keysize and entrysize given a keymember and > entrytype (like you suggested). PFA a patch introducing and using the new macro. Note that it also introduces HASH_ELEM_INIT_FULL for the rare cases where the whole struct is the key. Also one case remains untouched: $ git grep "entrysize = sizeof" "*.c" src/backend/replication/logical/relation.c: ctl.entrysize = sizeof(LogicalRepRelMapEntry); That's because the key is a member of a nested struct so that the new macro can not be used. As there is only one occurrence of it, I think we can keep it as it is. But we could create a dedicated macro for those cases if we feel the need. Now that I'm writing this, that might be a better idea: that way we'd avoid any "entrysize/keysize = " in the .c files. Also a nice side effect of using the macros: 138 insertions(+), 203 deletions(-) > 2. Autodect most of the flags. > a. HASH_PARTITION, HASH_SEGMENT, HASH_FUNCTION, HASH_DIRSIZE, > HASH_COMPARE, HASH_KEYCOPY, HASH_ALLOC can all be simply be detected > based on the relevant fields from HASHCTL. Passing them in explicitly > is just duplication that causes code noise and is easy to forget > accidentally. > b. HASH_ELEM is useless noise because it is required > c. HASH_BLOBS could be the default (and maybe use HASH_STRINGS by > default if the keytype is char*) > 3. Default to CurrentMemoryContext instead of TopMemoryContext. Like > all of the other functions that allocate, because right now it's way > too easy to accidentally use TopMemoryContext when you did not intend > to. > 4. Have a creation function that doesn't require HASHCTL at all (just > takes entrytype and keymember and maybe a version of this that takes a > memorycontext). Thanks for the above suggestions! I did not think so deep as you did during your Citus time, but will think about those too. I suggest we move forward one step at a time, first step being the new macros. Does that make sense to you? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Wed Dec 3, 2025 at 12:17 PM CET, Bertrand Drouvot wrote: > I suggest we move forward one > step at a time, first step being the new macros. Does that make sense to you? Normally I would agree, but in this case I think the new macros you proposing would become obsolete once we have the better hash table creation functions I have in mind. And if we're going to update all places where we create hash tables, I'd rather update them to something really nice than a small improvement. I couldn't let it go (nerd-sniped). So here's a patchset that adds some macros that I think are pretty nice. Including a foreach_hash macro. I'm a bit on the fence about the C11 _Generic code to determine whether we should use HASH_BLOBS or HASH_STRINGS based on the type of the key. It works really nicely in practice, but I'm worried it's a bit too much magic. Probably we should at least have an override to allow using HASH_BLOBS anyway for a char array (in case it's not null terminated).
Вложения
On Fri, Dec 5, 2025 at 4:30 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > I'm a bit on the fence about the C11 _Generic code to determine whether > we should use HASH_BLOBS or HASH_STRINGS based on the type of the key. > It works really nicely in practice, but I'm worried it's a bit too much > magic. Probably we should at least have an override to allow using > HASH_BLOBS anyway for a char array (in case it's not null terminated). How much of our header stuff is supposed to work from C++ too? I assume that if this is passing cpluspluscheck then it's only because those _Generic macros aren't being expanded. I suppose you could write the typeof-based version you already hinted at, but only use it for __cplusplus__ (where typeof exists as decltype). What I mean is, if something like DuckDB (C++ IIRC) wants to actually use these APIs with the new interfaces, they won't work if the macros expant to _Generic (not legal C++), but a typeof/decltype-based version should work just fine, but at the same time I don't think we'd want to use typeof just because it is available, or we'd never test the _Generic version: all 3 C compilers have something we can use for typeof, it's just not standard C before C23, and we want PostgreSQL to be a valid C11 program and actually have coverage of those code branches. Another consideration is what impact we have on the Rust world, and potentially other languages used for extensions that call C via FFI etc, if we start using _Generic in our public interfaces, as opposed to just using it in smaller ways as part of our internal assertions and C implementation code that doesn't affect extensions. Of course people using C APIs have to deal with the complexities of C evolution too, I'm not saying that's a blocker, I'm just trying to think through the impact of these sorts of API changes on the ecosystem.
On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
> How much of our header stuff is supposed to work from C++ too?
I think it's nice if it works, but it doesn't seem the most important.
Especially since C++ has its own hashmaps. And if it really needs to
create a hashmap it's still possible to call the.
> I suppose you could
> write the typeof-based version you already hinted at, but only use it
> for __cplusplus__ (where typeof exists as decltype).
I tried to figure something out that would work in C++ (with help of
Claude), but I wasn't able to create a version of the macros without
also needing to add:
#ifdef __cplusplus
}
#include <type_traits>
extern "C" {
#endif
It seems quite ugly to escape the extern "C" from the parent like that
and then re-enter it. Overall it doesn't seem worth the hassle to me
to make these macros work in C++.
> Another consideration is what impact we have on the Rust world, and
> potentially other languages used for extensions that call C via FFI
> etc
FFI generally cannot call macros anyway, only actual symbols.
On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote:
> > How much of our header stuff is supposed to work from C++ too?
>
> I think it's nice if it works, but it doesn't seem the most important.
> Especially since C++ has its own hashmaps. And if it really needs to
> create a hashmap it's still possible to call the.
... C functions without the helper macros. Yeah. That seems OK to me.
> > I suppose you could
> > write the typeof-based version you already hinted at, but only use it
> > for __cplusplus__ (where typeof exists as decltype).
>
> I tried to figure something out that would work in C++ (with help of
> Claude), but I wasn't able to create a version of the macros without
> also needing to add:
>
> #ifdef __cplusplus
> }
> #include <type_traits>
> extern "C" {
> #endif
>
> It seems quite ugly to escape the extern "C" from the parent like that
> and then re-enter it. Overall it doesn't seem worth the hassle to me
> to make these macros work in C++.
Yeah. I don't think we want that sort of thing all over the place.
We could eventually come up with a small set of tools in a central
place though, so people can work with this stuff without also known
C++ meta-programming voodoo. For example something like (untested, I
didn't think about char[size], just spitballing here...):
(pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *))
... given the definition I posted recently[1].
I take your point that it's not really important for this case though.
> > Another consideration is what impact we have on the Rust world, and
> > potentially other languages used for extensions that call C via FFI
> > etc
>
> FFI generally cannot call macros anyway, only actual symbols.
Sure, I was just thinking about how such cross-language usage would be
forced to unpick our macrology and call the underlying C functions
without them. Doesn't seem like the end of the world anyway, I was
just thinking out loud about the consequences of this phenomenon in
headers.
[1] https://www.postgresql.org/message-id/CA+hUKGL7trhWiJ4qxpksBztMMTWDyPnP1QN+Lq341V7QL775DA@mail.gmail.com
On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote: > On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: >> On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote: >> create a hashmap it's still possible to call the. > > ... C functions without the helper macros. Oops, forgot to finish that sentence. > > Yeah. I don't think we want that sort of thing all over the place. > We could eventually come up with a small set of tools in a central > place though, so people can work with this stuff without also known > C++ meta-programming voodoo. For example something like (untested, I > didn't think about char[size], just spitballing here...): > > (pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *)) > > ... given the definition I posted recently[1]. Ugh... It would have saved me some time if I'd seen that email before. I also had no clue that 'extern "C++"' was a thing. Attached is a new patchset where your proposed macro is used. I also needed a pg_nullptr_of macro, because the trick that worked in C didn't in C++. Also, it's probably worth checking out this thread[2]. Especially because it overlaps significantly with[1]. > Sure, I was just thinking about how such cross-language usage would be > forced to unpick our macrology and call the underlying C functions > without them. Doesn't seem like the end of the world anyway, I was > just thinking out loud about the consequences of this phenomenon in > headers. It's not great, but yeah that's the situation. Stuff like PG_TRY and PG_CATCH are especially painful to reimplement. Luckily those things can usually be re-implemented similarly in the target language, so the macros only need to be disected once. > [1] https://www.postgresql.org/message-id/CA+hUKGL7trhWiJ4qxpksBztMMTWDyPnP1QN+Lq341V7QL775DA@mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg%40mail.gmail.com
Вложения
Hi, On Mon, Dec 08, 2025 at 11:53:02AM +0100, Jelte Fennema-Nio wrote: > On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote: > > On Sat, Dec 6, 2025 at 3:32 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > > On Fri, 5 Dec 2025 at 02:30, Thomas Munro <thomas.munro@gmail.com> wrote: > > > create a hashmap it's still possible to call the. > > > > ... C functions without the helper macros. > > Oops, forgot to finish that sentence. Thanks for this patch series! > > Yeah. I don't think we want that sort of thing all over the place. > > We could eventually come up with a small set of tools in a central > > place though, so people can work with this stuff without also known > > C++ meta-programming voodoo. For example something like (untested, I > > didn't think about char[size], just spitballing here...): > > > > (pg_expr_has_type_p(ptr, char *) || pg_expr_has_type_p(ptr, NameData *)) > > > > ... given the definition I posted recently[1]. +#if defined(__cplusplus) +#define pg_expr_has_type_p(expr, type) (std::is_same<decltype(expr), type>::value) +#else +#define pg_expr_has_type_p(expr, type) \ + _Generic((expr), type: 1, default: 0) +#endif What about relying on the existing __builtin_types_compatible_p() instead of _Generic() here? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Dec 9, 2025 at 8:27 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Mon, Dec 08, 2025 at 11:53:02AM +0100, Jelte Fennema-Nio wrote: > > On Sat Dec 6, 2025 at 1:56 AM CET, Thomas Munro wrote: > +#if defined(__cplusplus) > +#define pg_expr_has_type_p(expr, type) (std::is_same<decltype(expr), type>::value) > +#else > +#define pg_expr_has_type_p(expr, type) \ > + _Generic((expr), type: 1, default: 0) > +#endif > > What about relying on the existing __builtin_types_compatible_p() instead of > _Generic() here? If we used standard C/C++ it'd work on MSVC too.
On Tue, 9 Dec 2025 at 10:11, Thomas Munro <thomas.munro@gmail.com> wrote: > > What about relying on the existing __builtin_types_compatible_p() instead of > > _Generic() here? > > If we used standard C/C++ it'd work on MSVC too. And to be clear, that's important because the result of pg_expr_has_type_p fundamentally impacts the meaning of the code (i.e. it determines the hash function). Our existing usage of __builtin_types_compatible_p only adds some *optional* type checking, so for that it's not critical that it works on MSVC too.
On Tue, 9 Dec 2025 at 08:27, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Thanks for this patch series! To clarify: Does that mean +1 from you on the proposed API?
Hi, On Tue, Dec 09, 2025 at 11:45:11AM +0100, Jelte Fennema-Nio wrote: > On Tue, 9 Dec 2025 at 08:27, Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Thanks for this patch series! > > To clarify: Does that mean +1 from you on the proposed API? Yeah, I think the hash table API needs improvement. I did not look at all the details of your patches, but from what I have seen I think your API proposal make sense. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com