Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
От | Tomas Vondra |
---|---|
Тема | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop |
Дата | |
Msg-id | 263b03b1-3e1c-49ca-165a-8ac6751419c4@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in aninfinite loop
|
Список | pgsql-bugs |
On 12/05/2017 11:01 PM, Tomas Vondra wrote: > > Based on the initial discussion, the problem here is twofold. > > Firstly, the code assumes that if it gets certain number of bucket > collisions (essentially, the initial bucket and certain number of > neighboring buckets already being full), making the table larger is > guaranteed to reduce the number of collisions. > > Which is obviously untrue for duplicate hash values, but it may also > happen for distinct hash values that form a chain (think a sequence of > hash values 1,2,3,4,5,6,7,...,K - that is never gonna get fixed). > I've managed to construct (by sheer brute force) an example that breaks simplehash in this way. It triggers the growth by having a sequence of distinct but consecutive hash values, e.g. 1, 2, 3, 4, ..., 1000, and then trying to insert a value with duplicate hash (say, 4). Which will trigger the "emptydist" condition (as opposed to "insertdist", as in the other example). This means there are very few collisions (no hash value matching more than 2 distinct values), and all the hash values are "at the beginning of the range" so growing the hash table can't possible break the hash sequences. The generated datasets are quite large (16MB each), so I've placed them here (along with the ugly C code to generate it): https://github.com/tvondra/simplehash-break Each CSV file contains ~1M of strings with distinct hashes between 0 and 1048576 (essentially, covering about 97.5% of the range). There are "chains" of up to ~300 consecutive hash values. This works: CREATE TABLE hash_text (val text); COPY hash_text FROM '/tmp/data1.csv'; SET work_mem = '1GB'; SELECT DISTINCT val FROM hash_text; but this does not: CREATE TABLE hash_text (val text); COPY hash_text FROM '/tmp/data1.csv'; COPY hash_text FROM '/tmp/data2.csv'; SET work_mem = '1GB'; SELECT DISTINCT val FROM hash_text; In fact, only a small subset of the second data set should be needed. FWIW I first tried to do the same thing with bigint values, but no matter what I did I've been unable to generate a dataset covering more than ~60% of the range (essentially hashint8 only ever produces 60% of values from [0,1000000], which likely increases collision rate). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-bugs по дате отправления: