Re: [PATCHES] updated hash functions for postgresql v1
От | Kenneth Marshall |
---|---|
Тема | Re: [PATCHES] updated hash functions for postgresql v1 |
Дата | |
Msg-id | 20081222194739.GA4930@it.is.rice.edu обсуждение исходный текст |
Ответы |
Re: [PATCHES] updated hash functions for postgresql v1
Re: [PATCHES] updated hash functions for postgresql v1 |
Список | pgsql-hackers |
Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Ken ............ Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken ---------------------------------------------------- Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash input old new newv2 ---------- --- --- ----- cracklib 338 316 338 cracklib x 2 (i.e. clibclib) 305 319 300 cracklib x 3 (clibclibclib) 323 329 315 cracklib x 10 302 310 329 cracklib x 100 350 335 298 cracklib x 1000 314 309 315 cracklib x 100 truncated to char(100) 311 327 320 uint32 from 1-1648379 309 319 347 (uint32 1-1948379)*256 309 314 304 (uint32 1-1948379)*16 310 314 324 "a"uint32 (i.e. a00001,a0002...) 320 321 312 uint32uint32 (i.e. uint64) 321 287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric.
Вложения
В списке pgsql-hackers по дате отправления: