On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:
>On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
>> I am looking for stable hash functions producing 8-byte or 4-byte hashes
>> from long text values in Postgres 10 or later.
>>
>> There is md5(), the result of which can be cast to uuid. This reliably
>> produces practically unique, stable 16-byte values. I have usecases where
>> an 8-byte or even 4-byte hash would be good enough to make collisions
>> reasonably unlikely. (I can recheck on the full string) - and expression
>> indexes substantially smaller. I could truncate md5 and cast back and
>> forth, but that seems like a lot of wasted computation. Are there
>> suggestions for text hash functions that are
>> - fast
>> - keep collisions to a minimum
>> - stable across major Postgres versions (so expression indexes don't break)
>> - croptographic aspect is not needed (acceptable, but no benefit)
>
>What about a CRC32 function? It's fast, and an SSE4 instruction has been in
>Intel CPUs for about 10 years.
On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text. Given that the OP can check the actual string in the
event of a collision, it may work well enough.
One way to make it more discriminating is to compute a pair of CRCs
using different polynomials. Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial. You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.
I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.
YMMV,
George