Re: [HACKERS] A design for amcheck heapam verification
От | Michael Paquier |
---|---|
Тема | Re: [HACKERS] A design for amcheck heapam verification |
Дата | |
Msg-id | CAB7nPqSWdXuNzxfCz6HfHJDV4-rYWUJoxAr5JW3qYDZgQXi_NQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] A design for amcheck heapam verification (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] A design for amcheck heapam verification
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
On Fri, Dec 8, 2017 at 4:37 AM, Peter Geoghegan <pg@bowt.ie> wrote: > Here, we repeat the same test 3 times, varying only the seed value > used for each run. Thanks for the updated version! Looking at 0001, I have run a coverage test and can see all code paths covered, which is nice. It is also easier to check the correctness of the library. There are many ways to shape up such tests, you chose one we could live with. > The last message is a WARNING because we exceed the 1% threshold > (hard-coded into test_bloomfilter.c), though only by a tiny margin, > due only to random variations in seed value. We round up to 10 bits > per element for the regression tests. That's where the *actual* > "nelements" argument comes from within the tests, so pg_regress tests > should never see the WARNING (if they do, that counts as a failure). > > I've experimentally observed that we get the 1% false positive rate > with any possible bitset when "nelements" works out at 9.6 bitset bits > per element. Inter-run variation is tiny. With 50 tests, I didn't > observe these same Bloom filter parameters produce a false positive > rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it > got. Nice. That's close to what I would expect with the sizing this is doing. > There is a fairly extensive README, which I hope will clear up the > theory behind the bloomfilter.c strategy on bitset size and false > positives. Also, there was a regression that I had to fix in > bloomfilter.c, in seeding. It didn't reliably cause variation in the > false positives. And, there was bitrot with the documentation that I > fixed up. + /* + * Generate random seed, or use caller's. Seed should always be a positive + * value less than or equal to PG_INT32_MAX, to ensure that any random seed + * can be recreated through callerseed if the need arises. (Don't assume + * that RAND_MAX cannot exceed PG_INT32_MAX.) + */ + seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed; This could use pg_backend_random() instead. +-- +-- These tests don't produce any interesting output, unless they fail. For an +-- explanation of the arguments, and the values used here, see README. +-- +SELECT test_bloomfilter(power => 23, + nelements => 838861, + seed => -1, + tests => 1); This could also test the reproducibility of the tests with a fixed seed number and at least two rounds, a low number of elements could be more appropriate to limit the run time. + /* + * Aim for two bytes per element; this is sufficient to get a false + * positive rate below 1%, independent of the size of the bitset or total + * number of elements. Also, if rounding down the size of the bitset to + * the next lowest power of two turns out to be a significant drop, the + * false positive rate still won't exceed 2% in almost all cases. + */ + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2); + /* Minimum allowable size is 1MB */ + bitset_bytes = Max(1024L * 1024L, bitset_bytes); I would vote for simplifying the initialization routine and just directly let the caller decide it. Are there implementation complications if this is not a power of two? I cannot guess one by looking at the code. I think that the key point is just to document that a false positive rate of 1% can be reached with just having 9.6bits per elements, and that this factor can be reduced by 10 if adding 4.8 bits per elements. So I would expect people using this API are smart enough to do proper sizing. Note that some callers may accept even a larger false positive rate. In short, this basically brings us back to Thomas' point upthread with a size estimation routine, and an extra routine doing the initialization: https://www.postgresql.org/message-id/CAEepm=0Dy53X1hG5DmYzmpv_KN99CrXzQBTo8gmiosXNyrx7+Q@mail.gmail.com Did you consider it? Splitting the size estimation and the actual initialization has a lot of benefits in my opinion. +/* + * What proportion of bits are currently set? + * + * Returns proportion, expressed as a multiplier of filter size. + * [...] + */ +double +bloom_prop_bits_set(bloom_filter *filter) Instead of that, having a function that prints direct information about the filter's state would be enough with the real number of elements and the number of bits set in the filter. I am not sure that using rates is a good idea, sometimes rounding can cause confusion. -- Michael
В списке pgsql-hackers по дате отправления: