Обсуждение: RFC: Improve CPU cache locality of syscache searches
CPU caches have multiple levels, so I had an idea to use that concept in the syscaches.
Imagine if catcache buckets were cacheline-sized. In that case, we would store the most recently accessed hash values and pointers to catctup, in addition to the dlist_head:
typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};
You can think of the bucket here as a 4-way set associative L1 and the list walk as an inclusive L2. There might be an existing term for this scheme, but I don't know what it is offhand.
Creating entries:
Instead of calling dlist_push_head(), we call dlist_push_tail() and then stash the entry in the L1 so it's still quickly available if it's accessed in the near future. This way, older entries evicted from L1 will tend to remain toward the front of the list. Walking the list will always go from oldest to newest, which might have better prefetch behavior (not sure).
Search:
Buckets with <= 4 entries would only need to access a single cacheline to get the catctup pointer with the matching hash value. If we have to walk the list to find a match, we stash it in the L1, which is faster than calling dlist_move_head().
L1 Eviction:
When putting an entry here, we memmove() everything down in each array and place the pointer and the hash value in the front, evicting the fourth (possibly NULL) entry.
The buckets would now be 4 times larger on 64-bit machines, but that might not be a problem if we just divide the initial bucket size by four as well. If the minimum nbuckets is 2, then the smaller caches would waste a bit of space, but it doesn't seem too bad. As far as when to rehash the bucket array, the fill factor would logically jump from 2 to 8. The worst-case list walk would be longer, but with a better overall memory access pattern, it might be fine.
I think it would be straightforward to code this up -- the difficulty is testing and accounting for random effects due to binary layout changes. Thoughts?
--
Imagine if catcache buckets were cacheline-sized. In that case, we would store the most recently accessed hash values and pointers to catctup, in addition to the dlist_head:
typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};
You can think of the bucket here as a 4-way set associative L1 and the list walk as an inclusive L2. There might be an existing term for this scheme, but I don't know what it is offhand.
Creating entries:
Instead of calling dlist_push_head(), we call dlist_push_tail() and then stash the entry in the L1 so it's still quickly available if it's accessed in the near future. This way, older entries evicted from L1 will tend to remain toward the front of the list. Walking the list will always go from oldest to newest, which might have better prefetch behavior (not sure).
Search:
Buckets with <= 4 entries would only need to access a single cacheline to get the catctup pointer with the matching hash value. If we have to walk the list to find a match, we stash it in the L1, which is faster than calling dlist_move_head().
L1 Eviction:
When putting an entry here, we memmove() everything down in each array and place the pointer and the hash value in the front, evicting the fourth (possibly NULL) entry.
The buckets would now be 4 times larger on 64-bit machines, but that might not be a problem if we just divide the initial bucket size by four as well. If the minimum nbuckets is 2, then the smaller caches would waste a bit of space, but it doesn't seem too bad. As far as when to rehash the bucket array, the fill factor would logically jump from 2 to 8. The worst-case list walk would be longer, but with a better overall memory access pattern, it might be fine.
I think it would be straightforward to code this up -- the difficulty is testing and accounting for random effects due to binary layout changes. Thoughts?
Hi, On 2021-08-04 12:39:29 -0400, John Naylor wrote: > CPU caches have multiple levels, so I had an idea to use that concept in > the syscaches. I do think we loose a good bit to syscache efficiency in real workloads, so I think it's worth investing time into it. However: > Imagine if catcache buckets were cacheline-sized. In that case, we would > store the most recently accessed hash values and pointers to catctup, in > addition to the dlist_head: > > typedef struct cc_bucket > { > uint32 hashes[4]; > catctup *ct[4]; > dlist_head; > }; I'm not convinced that the above the right idea though. Even if the hash matches, you're still going to need to fetch at least catctup->keys[0] from a separate cacheline to be able to return the cache entry. If the hashes we use are decent and we're resizing at reasonable thresholds, we shouldn't constantly lookup up values that are later in a collision chain - particularly because we move cache hits to the head of the bucket. So I don't think a scheme as you described would actually result in a really better cache hit ratio? ISTM that something like struct cc_bucket_1 { uint32 hashes[3]; // 12 // 4 bytes alignment padding Datum key0s[3]; // 24 catctup *ct[3]; // 24 // cacheline boundary dlist_head conflicts; // 16 }; would be better for 1 key values? It's obviously annoying to need different bucket types for different key counts, but given how much 3 unused key Datums waste, it seems worth paying for? One issue with stuff like this is that aset.c won't return cacheline aligned values... It's possible that it'd be better to put the catctup pointers onto a neigboring cacheline (so you still get TLB benefits, as well as making it easy for prefetchers) and increase the number of hashes/keys that fit on one cacheline. If we stuffed four values into one bucket we could potentially SIMD the hash and Datum comparisons ;) Greetings, Andres Freund
On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > typedef struct cc_bucket
> > {
> > uint32 hashes[4];
> > catctup *ct[4];
> > dlist_head;
> > };
>
> I'm not convinced that the above the right idea though. Even if the hash
> matches, you're still going to need to fetch at least catctup->keys[0] from
> a separate cacheline to be able to return the cache entry.
I see your point. It doesn't make sense to inline only part of the information needed.
> struct cc_bucket_1
> {
> uint32 hashes[3]; // 12
> // 4 bytes alignment padding
> Datum key0s[3]; // 24
> catctup *ct[3]; // 24
> // cacheline boundary
> dlist_head conflicts; // 16
> };
>
> would be better for 1 key values?
>
> It's obviously annoying to need different bucket types for different key
> counts, but given how much 3 unused key Datums waste, it seems worth paying
> for?
Yeah, it's annoying, but it does make a big difference to keep out unused Datums:
keys cachelines
3 values 4 values
1 1 1/4 1 1/2
2 1 5/8 2
3 2 2 1/2
4 2 3/8 3
Or, looking at it another way, limiting the bucket size to 2 cachelines, we can fit:
keys values
1 5
2 4
3 3
4 2
Although I'm guessing inlining just two values in the 4-key case wouldn't buy much.
> If we stuffed four values into one bucket we could potentially SIMD the hash
> and Datum comparisons ;)
;-) That's an interesting future direction to consider when we support building with x86-64-v2. It'd be pretty easy to compare a vector of hashes and quickly get the array index for the key comparisons (ignoring for the moment how to handle the rare case of multiple identical hashes). However, we currently don't memcmp() the Datums and instead call an "eqfast" function, so I don't see how that part would work in a vector setting.
--
John Naylor
EDB: http://www.enterprisedb.com
> > typedef struct cc_bucket
> > {
> > uint32 hashes[4];
> > catctup *ct[4];
> > dlist_head;
> > };
>
> I'm not convinced that the above the right idea though. Even if the hash
> matches, you're still going to need to fetch at least catctup->keys[0] from
> a separate cacheline to be able to return the cache entry.
I see your point. It doesn't make sense to inline only part of the information needed.
> struct cc_bucket_1
> {
> uint32 hashes[3]; // 12
> // 4 bytes alignment padding
> Datum key0s[3]; // 24
> catctup *ct[3]; // 24
> // cacheline boundary
> dlist_head conflicts; // 16
> };
>
> would be better for 1 key values?
>
> It's obviously annoying to need different bucket types for different key
> counts, but given how much 3 unused key Datums waste, it seems worth paying
> for?
Yeah, it's annoying, but it does make a big difference to keep out unused Datums:
keys cachelines
3 values 4 values
1 1 1/4 1 1/2
2 1 5/8 2
3 2 2 1/2
4 2 3/8 3
Or, looking at it another way, limiting the bucket size to 2 cachelines, we can fit:
keys values
1 5
2 4
3 3
4 2
Although I'm guessing inlining just two values in the 4-key case wouldn't buy much.
> If we stuffed four values into one bucket we could potentially SIMD the hash
> and Datum comparisons ;)
;-) That's an interesting future direction to consider when we support building with x86-64-v2. It'd be pretty easy to compare a vector of hashes and quickly get the array index for the key comparisons (ignoring for the moment how to handle the rare case of multiple identical hashes). However, we currently don't memcmp() the Datums and instead call an "eqfast" function, so I don't see how that part would work in a vector setting.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi, On 2021-08-05 12:27:49 -0400, John Naylor wrote: > On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-08-04 12:39:29 -0400, John Naylor wrote: > > > typedef struct cc_bucket > > > { > > > uint32 hashes[4]; > > > catctup *ct[4]; > > > dlist_head; > > > }; > > > > I'm not convinced that the above the right idea though. Even if the hash > > matches, you're still going to need to fetch at least catctup->keys[0] > from > > a separate cacheline to be able to return the cache entry. > > I see your point. It doesn't make sense to inline only part of the > information needed. At least not for the unconditionally needed information. > Although I'm guessing inlining just two values in the 4-key case wouldn't > buy much. Not so sure about that. I'd guess that two key comparisons take more cycles than a cacheline fetch the further keys (perhaps not if we had inlined key comparisons). I.e. I'd expect out-of-order + speculative execution to hide the latency for fetching the second cacheline for later key values. > > If we stuffed four values into one bucket we could potentially SIMD the > hash > > and Datum comparisons ;) > > ;-) That's an interesting future direction to consider when we support > building with x86-64-v2. It'd be pretty easy to compare a vector of hashes > and quickly get the array index for the key comparisons (ignoring for the > moment how to handle the rare case of multiple identical hashes). > However, we currently don't memcmp() the Datums and instead call an > "eqfast" function, so I don't see how that part would work in a vector > setting. It definitely couldn't work unconditionally - we have to deal with text, oidvector, comparisons after all. But we could use it for the other types. However, looking at the syscaches, I think it'd not very often be applicable for caches with enough columns. I have wondered before whether we should have syscache definitions generate code specific to each syscache definition. I do think that'd give a good bit of performance boost. But I don't see a trivial way to get there without notational overhead. We could define syscaches in a header using a macro that's defined differently in syscache.c than everywhere else. The header would declare a set of functions for each syscache, syscache.c would define them to call an always_inline function with the relevant constants. Or perhaps we should move syscache definitions into the pg_*.h headers, and generate the relevant code as part of their processing. That seems like it could be nice from a modularity POV alone. And it could do better than the current approach, because we could hardcode the types for columns in the syscache definition without increasing notational overhead. Greetings, Andres Freund
Andres Freund писал 2021-08-05 23:12: > Hi, > > On 2021-08-05 12:27:49 -0400, John Naylor wrote: >> On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> >> wrote: >> > On 2021-08-04 12:39:29 -0400, John Naylor wrote: >> > > typedef struct cc_bucket >> > > { >> > > uint32 hashes[4]; >> > > catctup *ct[4]; >> > > dlist_head; >> > > }; >> > >> > I'm not convinced that the above the right idea though. Even if the hash >> > matches, you're still going to need to fetch at least catctup->keys[0] >> from >> > a separate cacheline to be able to return the cache entry. >> >> I see your point. It doesn't make sense to inline only part of the >> information needed. > > At least not for the unconditionally needed information. > > >> Although I'm guessing inlining just two values in the 4-key case >> wouldn't >> buy much. > > Not so sure about that. I'd guess that two key comparisons take more > cycles > than a cacheline fetch the further keys (perhaps not if we had inlined > key > comparisons). I.e. I'd expect out-of-order + speculative execution to > hide the > latency for fetching the second cacheline for later key values. > > >> > If we stuffed four values into one bucket we could potentially SIMD the >> hash >> > and Datum comparisons ;) >> >> ;-) That's an interesting future direction to consider when we support >> building with x86-64-v2. It'd be pretty easy to compare a vector of >> hashes >> and quickly get the array index for the key comparisons (ignoring for >> the >> moment how to handle the rare case of multiple identical hashes). >> However, we currently don't memcmp() the Datums and instead call an >> "eqfast" function, so I don't see how that part would work in a >> vector >> setting. > > It definitely couldn't work unconditionally - we have to deal with > text, > oidvector, comparisons after all. But we could use it for the other > types. However, looking at the syscaches, I think it'd not very often > be > applicable for caches with enough columns. > > > I have wondered before whether we should have syscache definitions > generate > code specific to each syscache definition. I do think that'd give a > good bit > of performance boost. But I don't see a trivial way to get there > without > notational overhead. > > We could define syscaches in a header using a macro that's defined > differently > in syscache.c than everywhere else. The header would declare a set of > functions for each syscache, syscache.c would define them to call an > always_inline function with the relevant constants. > > Or perhaps we should move syscache definitions into the pg_*.h headers, > and > generate the relevant code as part of their processing. That seems like > it > could be nice from a modularity POV alone. And it could do better than > the > current approach, because we could hardcode the types for columns in > the > syscache definition without increasing notational overhead. Why don't use simplehash or something like that? Open-addressing schemes show superior cache locality. It could be combined: use hashValue as a key in simplehash and dlist for hashValue collision handling. 99.99% of time there will be no collisions on hashValue itself, therefore it will be almost always two-three memory lookups: one-two for dlist_head in simple_hash by hashValue and then fetching first element in dlist. And code will remain almost same. Just "bucket" search will change a bit. (And I'd recommend use lower fill factor for this simplehash. At most 0.85). Well, simplehash entry will be 24 bytes this way. If simplehash template supports external key/element storage, then it could be shrunk to 16 bytes, and syscache entries will not need dlist_node. (But it doesn't at the moment). And custom open-addressing table could be even with 8 bytes per element: - element is a pointer to entry, - missing node is encoded as NULL, - with fill factor as low as 0.66 there will be small amount of collisions, therefore non-empty entry will be matched entry most of time, and memory lookup for key comparison will be amortized free. Note that 8byte entry with fill factor 0.66 consumes amortized 12.12 byte, while 16byte entry with fill factor 0.99 consumes amortized 16.16byte. regards, Yura Sokolov y.sokolov@postgrespro.ru funny.falcon@gmail.com
Hi, On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote: > Why don't use simplehash or something like that? Open-addressing schemes > show superior cache locality. I thought about that as well - but it doesn't really resolve the question of what we want to store in-line in the hashtable and what not. We can't store the tuples themselves in the hashtable for a myriad of reasons (need pointer stability, they're variably sized, way too large to move around frequently). > Well, simplehash entry will be 24 bytes this way. If simplehash template > supports external key/element storage, then it could be shrunk to 16 bytes, > and syscache entries will not need dlist_node. (But it doesn't at the > moment). I think storing keys outside of the hashtable entry defeats the purpose of the open addressing, given that they are always checked and that our conflict ratio should be fairly low. Greetings, Andres Freund
Andres Freund писал 2021-08-06 06:49: > Hi, > > On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote: >> Why don't use simplehash or something like that? Open-addressing >> schemes >> show superior cache locality. > > I thought about that as well - but it doesn't really resolve the > question of > what we want to store in-line in the hashtable and what not. We can't > store > the tuples themselves in the hashtable for a myriad of reasons (need > pointer > stability, they're variably sized, way too large to move around > frequently). > > >> Well, simplehash entry will be 24 bytes this way. If simplehash >> template >> supports external key/element storage, then it could be shrunk to 16 >> bytes, >> and syscache entries will not need dlist_node. (But it doesn't at the >> moment). > > I think storing keys outside of the hashtable entry defeats the purpose > of the > open addressing, given that they are always checked and that our > conflict > ratio should be fairly low. It's opposite: if conflict ratio were high, then key outside of hashtable will be expensive, since lookup to non-matched key will cost excess memory access. But with low conflict ratio we will usually hit matched entry at first probe. And since we will use entry soon, it doesn't matter when it will go to CPU L1 cache: during lookup or during actual usage. regards, Yura Sokolov
Hi, On Thu, Aug 5, 2021, at 22:20, Yura Sokolov wrote: > Andres Freund писал 2021-08-06 06:49: > > Hi, > > > > On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote: > >> Why don't use simplehash or something like that? Open-addressing > >> schemes > >> show superior cache locality. > > > > I thought about that as well - but it doesn't really resolve the > > question of > > what we want to store in-line in the hashtable and what not. We can't > > store > > the tuples themselves in the hashtable for a myriad of reasons (need > > pointer > > stability, they're variably sized, way too large to move around > > frequently). > > > > > >> Well, simplehash entry will be 24 bytes this way. If simplehash > >> template > >> supports external key/element storage, then it could be shrunk to 16 > >> bytes, > >> and syscache entries will not need dlist_node. (But it doesn't at the > >> moment). > > > > I think storing keys outside of the hashtable entry defeats the purpose > > of the > > open addressing, given that they are always checked and that our > > conflict > > ratio should be fairly low. > > It's opposite: if conflict ratio were high, then key outside of > hashtable will > be expensive, since lookup to non-matched key will cost excess memory > access. > But with low conflict ratio we will usually hit matched entry at first > probe. > And since we will use entry soon, it doesn't matter when it will go to > CPU L1 > cache: during lookup or during actual usage. Often enough it does matter, because there will be earlier dependencies on whether a lookup is a cache hit/miss than on thecontent of the cached tuple. Regards, Andres
On Thu, Aug 5, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> I have wondered before whether we should have syscache definitions generate
> code specific to each syscache definition. I do think that'd give a good bit
> of performance boost. But I don't see a trivial way to get there without
> notational overhead.
>
> We could define syscaches in a header using a macro that's defined differently
> in syscache.c than everywhere else. The header would declare a set of
> functions for each syscache, syscache.c would define them to call an
> always_inline function with the relevant constants.
>
> Or perhaps we should move syscache definitions into the pg_*.h headers, and
> generate the relevant code as part of their processing. That seems like it
> could be nice from a modularity POV alone. And it could do better than the
> current approach, because we could hardcode the types for columns in the
> syscache definition without increasing notational overhead.
I had code laying around to generate the info needed for initialization, but I apparently deleted it and never emailed it. :-(
If we went that far, I wonder if we could specialize the cc_bucket for the actual types, rather than just number of Datums, which would further save on space. More complex, though.
Also, if comparisons got cheaper, maybe we could get away with not storing the full hash in the bucket in favor of a tag of the 16 highest bits. (The catctup would still need the full hash for invalidation, at least).
Another idea I came across while researching in-memory key-value stores is "bulk chaining" -- see [1] for an example. In that case, our example 2-cacheline bucket would have a dlist_head pointing not to the catctups, but to another bucket. So that throws away my L1/L2 analogy and uses a chain of buckets, each of which contain multiple sets of hashes/keys/pointers. That's closer to a normal collision chain, but with better cache locality. If we did that, I imagine the catctup could dispense with storing its Datum array and its dlist_node entirely.
[1] https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf
--
John Naylor
EDB: http://www.enterprisedb.com
OK, here is a hackish WIP to see if we get anywhere with the L1 concept:
0001 is extracted from a patch from Horiguchi-san to remove the "dead" flag
0002 adds the large bucket, but doesn't use it for anything
0003 uses the new bucket for the L1 cache
0004 changes when to rehash
0005 is Horiguchi-san's v7 benchmark
0006 removes expiration stuff from the benchmark and prevents alloc errors with my patches while running the benchmark
This doesn't align the bucket array to a cacheline boundary, nor does it change the initial number of buckets
make -C contrib/catcachebench/ install
psql -c 'create extension catcachebench'
# This is also from Horiguchi-san
perl gen_tbl.pl | psql > /dev/null
# warmup
psql -c 'select catcachebench(0)'
# measure median of 3
master:
psql -c 'select catcachebench(1)'
catcachebench
---------------
6084.992169
patch:
./inst/bin/psql -c 'select catcachebench(1)'
catcachebench
---------------
5508.072532
That's decent, but not exactly stellar. I get a huge slowdown in catcachebench(2), however, so I'll have to dig into why before going any further.
Some time I'll also try the function specialization Andres mentioned and see how big of a speedup that gives.
--
John Naylor
EDB: http://www.enterprisedb.com
0001 is extracted from a patch from Horiguchi-san to remove the "dead" flag
0002 adds the large bucket, but doesn't use it for anything
0003 uses the new bucket for the L1 cache
0004 changes when to rehash
0005 is Horiguchi-san's v7 benchmark
0006 removes expiration stuff from the benchmark and prevents alloc errors with my patches while running the benchmark
This doesn't align the bucket array to a cacheline boundary, nor does it change the initial number of buckets
make -C contrib/catcachebench/ install
psql -c 'create extension catcachebench'
# This is also from Horiguchi-san
perl gen_tbl.pl | psql > /dev/null
# warmup
psql -c 'select catcachebench(0)'
# measure median of 3
master:
psql -c 'select catcachebench(1)'
catcachebench
---------------
6084.992169
patch:
./inst/bin/psql -c 'select catcachebench(1)'
catcachebench
---------------
5508.072532
That's decent, but not exactly stellar. I get a huge slowdown in catcachebench(2), however, so I'll have to dig into why before going any further.
Some time I'll also try the function specialization Andres mentioned and see how big of a speedup that gives.
--
John Naylor
EDB: http://www.enterprisedb.com
Вложения
- v1-0002-Specialize-bucket-types-by-number-of-keys.patch
- v1-0001-Remove-dead-flag-from-CatCTup.patch
- v1-0004-Rationalize-rehashing-threshold.patch
- v1-0005-catcachebench.patch
- v1-0003-Use-hash-bucket-as-a-level-one-cache-to-avoid-wal.patch
- v1-0006-Some-adjustments-to-catcachebench-and-catcache.c-.patch
- gen_tbl.pl
On Thu, Aug 5, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
> I have wondered before whether we should have syscache definitions generate
> code specific to each syscache definition. I do think that'd give a good bit
> of performance boost. But I don't see a trivial way to get there without
> notational overhead.
I've made a small step in this direction in the attached. It uses a template approach to generate type-specific SearchCatCache* functions, for now only the 4-key ones. Since there's only a few of those, it's manageable to invoke the template and change the call sites by hand. To get this to scale up to all syscaches would require some scripting, but this much is enough to get feedback on the approach. One possible concern in starting down this path is that hundreds of call sites would have to be changed. (If there's a good way around that, it hasn't occurred to me.) It might also be possible to replace the per-key invocations with a single memcpy() for most types, but that's a bit more work.
--
John Naylor
EDB: http://www.enterprisedb.com
Вложения
Hi, On 2021-08-19 19:10:37 -0400, John Naylor wrote: > I've made a small step in this direction in the attached. It uses a > template approach to generate type-specific SearchCatCache* functions, for > now only the 4-key ones. Since there's only a few of those, it's manageable > to invoke the template and change the call sites by hand. To get this to > scale up to all syscaches would require some scripting, but this much is > enough to get feedback on the approach. One possible concern in starting > down this path is that hundreds of call sites would have to be changed. (If > there's a good way around that, it hasn't occurred to me.) I was thinking of avoiding the need for that via a macro. I.e. have SearchSysCache4(AMOPSTRATEGY, ...) be mapped to SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care, and we could continue to evolve the internals without continually doing large-scale code changes? > diff --git a/src/include/utils/catcache_search_template.h b/src/include/utils/catcache_search_template.h > new file mode 100644 > index 0000000000..6f5dc2573c > --- /dev/null > +++ b/src/include/utils/catcache_search_template.h > @@ -0,0 +1,176 @@ > +/*------------------------------------------------------------------------- > + * catcache_search_template.h > + * > + * A template for type-specialized SearchCatCache functions > + * > + * Usage notes: > + * > + * To generate functions specialized for a set of catcache keys, > + * the following parameter macros should be #define'd before this > + * file is included. > + * > + * - CC_SEARCH - the name of the search function to be generated > + * - CC_NKEYS - the number of search keys for the tuple > + * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s) > + * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s) It's not clear to me we need such a complicated interface here. Can't we just have a pg_attribute_always_inline function with a bunch more parameters? Greetings, Andres Freund
On Fri, Aug 27, 2021 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-19 19:10:37 -0400, John Naylor wrote:
> > I've made a small step in this direction in the attached. It uses a
> > template approach to generate type-specific SearchCatCache* functions, for
> > now only the 4-key ones. Since there's only a few of those, it's manageable
> > to invoke the template and change the call sites by hand. To get this to
> > scale up to all syscaches would require some scripting, but this much is
> > enough to get feedback on the approach. One possible concern in starting
> > down this path is that hundreds of call sites would have to be changed. (If
> > there's a good way around that, it hasn't occurred to me.)
>
> I was thinking of avoiding the need for that via a macro. I.e. have
> SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
> SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care, and
> we could continue to evolve the internals without continually doing
> large-scale code changes?
Makes sense.
> > + * To generate functions specialized for a set of catcache keys,
> > + * the following parameter macros should be #define'd before this
> > + * file is included.
> > + *
> > + * - CC_SEARCH - the name of the search function to be generated
> > + * - CC_NKEYS - the number of search keys for the tuple
> > + * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
> > + * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)
>
> It's not clear to me we need such a complicated interface here. Can't we just
> have a pg_attribute_always_inline function with a bunch more parameters?
Were you thinking in terms of passing the type oid in parameters, like this?
HeapTuple
SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
{
return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
}
And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple() would likewise take types as parameters, which they could use to pick the right functions at compile time?
--
John Naylor
EDB: http://www.enterprisedb.com
>
> Hi,
>
> On 2021-08-19 19:10:37 -0400, John Naylor wrote:
> > I've made a small step in this direction in the attached. It uses a
> > template approach to generate type-specific SearchCatCache* functions, for
> > now only the 4-key ones. Since there's only a few of those, it's manageable
> > to invoke the template and change the call sites by hand. To get this to
> > scale up to all syscaches would require some scripting, but this much is
> > enough to get feedback on the approach. One possible concern in starting
> > down this path is that hundreds of call sites would have to be changed. (If
> > there's a good way around that, it hasn't occurred to me.)
>
> I was thinking of avoiding the need for that via a macro. I.e. have
> SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
> SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care, and
> we could continue to evolve the internals without continually doing
> large-scale code changes?
Makes sense.
> > + * To generate functions specialized for a set of catcache keys,
> > + * the following parameter macros should be #define'd before this
> > + * file is included.
> > + *
> > + * - CC_SEARCH - the name of the search function to be generated
> > + * - CC_NKEYS - the number of search keys for the tuple
> > + * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
> > + * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)
>
> It's not clear to me we need such a complicated interface here. Can't we just
> have a pg_attribute_always_inline function with a bunch more parameters?
Were you thinking in terms of passing the type oid in parameters, like this?
HeapTuple
SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
{
return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
}
And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple() would likewise take types as parameters, which they could use to pick the right functions at compile time?
--
John Naylor
EDB: http://www.enterprisedb.com
Hi, On 2021-08-31 15:06:32 -0400, John Naylor wrote: > Were you thinking in terms of passing the type oid in parameters, like this? > > HeapTuple > SearchCatCache1(CatCache *cache, Datum v1, Oid t1) > { > return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0); > } > > And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple() > would likewise take types as parameters, which they could use to pick the > right functions at compile time? I was thinking that the script could emit something like static const cachedesc syscachedesc_AGGFNOID = { .reloid = AggregateRelationId, .indoid = AggregateFnoidIndexId, .nkeys = 1, .keys = { { .varno = Anum_pg_aggregate_aggfnoid, .type = Oid, .comparator = ... } } }; static CatCache syscache_AGGFNOID; HeapTuple SearchSysCacheAGGFNOID(Datum key1) { return SearchSysCacheInternal(&syscachedesc_AGGFNOID, syscache_AGGFNOID, key1); } and SearchSysCacheInternal would have a pg_attribute_always_inline() fastpath. That fastpath would basically be SearchCatCacheInternal(), with an extra const cachedesc * paramter. Then the compiler should be able to generate direct function calls to the hash functions in CatalogCacheComputeHashValue() and direct calls to the equal function in CatalogCacheCompareTuple(). One difficulty would be that I don't see how to make this work with syscache.c being compiled separately from catcache.c - but there's probably no need to. The script could generate a syscache_impl.h that catcache.c includes. Greetings, Andres Freund