hist boundary duplicates bug in head and 8.3
От | Nathan Boley |
---|---|
Тема | hist boundary duplicates bug in head and 8.3 |
Дата | |
Msg-id | 6fa3b6e20901051715p2a6b03dbt30ce14e9e2bc796c@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: hist boundary duplicates bug in head and 8.3
|
Список | pgsql-hackers |
For heavy tailed distributions, it is possible for analyze to duplicate histogram boundaries. Here is the output from a test against HEAD; I've attached the test data. =# create table bug(f float); COPY 100000 =# copy bug from '/tmp/test_data.txt'; COPY 100000 =# analyze bug; ANALYZE =# select histogram_bounds from pg_stats where tablename='bug'; histogram_bounds ------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------ -------- {34,34,34,34,34,34,34,36,36,36,36,36,37,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,41,41,41,41,42,42,42,43,43,44,44,44,45,45,45,46,46,46,47,47,48,48,48,4 9,49,50,50,51,51,52,53,53,54,55,55,56,57,58,58,59,60,61,62,63,64,65,67,68,70,72,74,76,79,81,83,86,89,91,94,98,101,105,110,117,122,131,139,152,171,202,236,30 8,1457} (1 row) Analyze assumes that if a value has a sample count less than samplesize/num_buckets, ( maxmincount near line 2170 in commands/analyze.c) then it is safe to include it in a histogram. However, because the histogram only contains non mcv's, this is incorrect. As far as I can see, there are 4 solutions: 1) track all of the distinct values This wouldn't be *too* expensive in analyze, especially considering we are tracking all of the sampled values as it is. However, this opens up the possibility of having huge mcv's lists in the worst case.To see this, consider a distribution such that the most common value was 20% of the table, the next mcv was 20% of the remaining entries, etc. Clearly, for stats targets greater than 5, every value in the table would overrun a histogram boundary, leading to an mcv list that contained every distinct value in the sample. 2) reduce number_of_bins if values exist with frequency greater than 1/nbins This would fix the bug, but at the cost of reducing the utility of the histogram ( it would introduce a large skew to the ndistinct distribution, which is assumed to be uniform over non-mcvs ). 3) use variable width histogram bins over all values. This is probably the cleanest solution, but the most invasive. 4) Fix the binary search in ineqsel to correctly find the boundaries, even with duplicates This would also be relatively clean, but are the hist boundaries assumption of being strictly increasing being satisfied anywhere else besides ineqsel? I've attached a patch that is a compromise between 1 and 2. It puts a hard limit on the number of mcv's at 2x the stats target, and then, if there are still values with too high a frequency, it reduces the number of histogram buckets. -Nathan
Вложения
В списке pgsql-hackers по дате отправления: