Re: Use merge-based matching for MCVs in eqjoinsel

Поиск
Список
Период
Сортировка
От Ilia Evdokimov
Тема Re: Use merge-based matching for MCVs in eqjoinsel
Дата
Msg-id f0a2c63a-5776-462a-b9b7-434158297a9d@tantorlabs.com
обсуждение исходный текст
Ответ на Re: Use merge-based matching for MCVs in eqjoinsel  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Use merge-based matching for MCVs in eqjoinsel
Список pgsql-hackers


On 14.11.2025 01:21, Tom Lane wrote:
This didn't look very much like the refactorization that I had in
mind: I thought we should have one copy of the matching code, not two.
Also, after looking closer at your patch I realized you were just
punting for cross-type comparison operators, which I felt was kind
of sad.  It's a little bit tricky to get simplehash.h to go along
with cross-type hashing, because it wants to use just one hash and
one equality function.  But since those are interface routines we
are going to supply anyway, we can make them deal with the insert
and lookup cases differently.


I had considered the cross-type comparison operators and I didn’t see a clean way to support them, so I intentionally excluded cross-type cases from hash probing. Your suggestion to switch the hash function in probe mode is clearly a more correct approach than simply rejecting those cases. Thanks for the explanation.



So after a bit of hacking I ended up with the attached.  I split up
the refactorization into several steps to make it easier to review.
(But I'd anticipate squashing these into one commit in the end,
so I didn't spend a lot of time on the commit messages.)


I reviewed patches 0002-0004 with the refactoring, and I think the overall approach is excellent. However, I noticed one issue: in eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before the refactoring it was initialized by zero.


Also, 0001 in this series is not meant to be committed; what it
does is to add some debug logging to ease comparing runtimes of
different versions of eqjoinsel.  I was able to use that to
convince myself that the refactoring steps didn't cost anything
meaningful in performance.  Perhaps we could use it to investigate
the right hashing threshold more carefully, too.


With 0001 patch I tested the selectivity calculation time for SEMI JOIN after applying patches 0002-0004, and the time was cut in half. Thank you for the work on that.



There are still a couple of XXX comments in the attached, denoting
loose ends to look at.  In particular, I wondered whether the
hash threshold check
        if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)

should use Max() instead --- that is, it might be safer to hash
if either MCV list is long.  Or, holding one's head at a different
angle, perhaps the sum of the list lengths should be what's checked?
			regards, tom lane


Hmm… using the sum actually seems like a good idea for me. It may provide a smoother switch-over point between the two MCV-scanning algorithms when both list lengths are below the threshold. But this definitely needs to be validated by measuring different MCV lengths below the threshold using the 0001 patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

В списке pgsql-hackers по дате отправления: