Обсуждение: Bloom Filter improvements in postgesql
Hi All,
I would like to propose some improvements to the bloom filter implementation that exists in postgres at the moment. I will describe the changes along with a patch with some tests. I wanted to get feedback before I spend more time on this so the patch is not production ready(WIP) but is sufficient for feedback. The patch code compiles and tests successfully and doesn't break anything(to the best of my knowledge!)
Problem Statement
The existing bloom filter uses an integer number of hash functions. This is completely fine but we can do better. In a recent paper its described how bloom filters can be tuned to use non-integer values of hash functions. This is particularly nice as often the optimal number of hash functions to use is not an integer number. When the hash functions(K) to be used is calculated, it could be something like 2.3. Then we would round up to 3 and use k=3 hash functions for the bloom filter. This means that the filter size will be ceil(k)/k = 1/k times larger than the optimal filter size. If we take floor(k) instead we get an increase in the filter's density which increases the false positive rate which may not be desirable.
Proposed Improvements
Instead we can relax the assumption that it must be an integer value for k. To do this we split the k value into an integer and a rational part. Call this k and r. So in the example above where we calculated k to be equal to 2.3 this would mean k = 2 and r = 0.3. We always use k hash functions but we apply a third hash function 30% of the time(corresponding to r = 0.3 in this case). This is done deterministically so the same element always gets the same treatment(keeps the bloom filter promise of no false negatives). The paper has more information but this is the gist of the rational part of the bloom filter.
However we can take this further and look to make the bloom filter lossless. Using the rational part described above we've given ourselves some wriggle room so to speak. We can compliment the rational 'lossy' bloom filter with an additional array called a witness. This allows us to reconstruct the original input, which just means a false positive rate of 0. The witness(W) is constructed concurrently with the bloom filter bitmap. The witness encodes information for every element that passes the initial bloom filter test(both the true and false positives). Clearly this establishes a direct relationship between the false positive rate and the size of the witness. As this is all done deterministically the decoding process is just the reverse of the encoding process. We can easily figure out true positives from false positives by checking the witness. By extension of the deterministic process there can be no false negatives.
Tradeoffs and Considerations
As stated in the previous section there is a direct relationship between the false positive rate and the size of the witness. The lossless capability comes at the price of more memory for the witness. The expected witness size can be calculated t * (1 + r ^2) where t = true positives and r = 1/ln2. Let's give a concrete example
t = 1000000 (a million items in the filter)
fp = 0.01
size of the filter = 9,585,059 (bits or approx 1.14 mb)
k = 6.65 (assuming we use rational number of hfc's)
For a bloom filter setup for these parameters we can expect a witness size of approximately 3,000,000 bits. The total space complexity is still 0(N). The main tradeoffs are of course the memory overhead of the witness. I have done some preliminary benchmarking and as expected the use of the witness does slow down inserts and lookups. The insert speed is around 3x slower and the lookup speed 2x slower. I definitely think this can be made faster but in any case it will always be slower than the traditional bloom filter if we wanted to also have a 'lossless' option.
Final thoughts and comments
The rational bloom filter would be a nice improvement to the existing bloom filter implementation. The lossless option could be useful in some scenarios at the expense of extra memory and slower lookup and insertion times. I have attached two patches showing a WIP of the changes(mainly to do with the lossless option). I would appreciate any feedback on whether any/all of this proposal would be considered. I'd be happy to make the changes myself.
Вложения
Hi David,
I appreciate you taking the time to explore this. I plan to work on getting the patches more production ready state and do some more benchmarking. I also appreciate you highlighting that I did not reply to all in my previous email. This was an oversight on my part. I will paste my message below so that others can see.
Your observation that the witness adds "around 25% of additional memory" is correct. However, the crucial distinction is that the witness data structure itself does not incur a false positive rate. Its fundamental purpose is to enable lossless reconstruction of the original input. A standard Bloom filter is inherently lossy due to its probabilistic nature; it can produce false positives (reporting an element as present when it is not) but never false negatives. The witness acts as a definitive resolver for these false positives. For any element that the Bloom filter might indicate as present (i.e., it passes the Bloom filter test), the witness stores a single bit that unequivocally declares whether that element is a true positive (originally in the set) or a false positive (not originally in the set). The witness does not store information for every element in the entire universe. Instead, it only needs to store a bit for each element that passes the Bloom filter test. This set of elements comprises all the true positives (elements that were actually inserted) and all the false positives (elements not inserted but for which the Bloom filter incorrectly returns a positive). This establishes a direct correlation between the witness size and the false positive rate.
In effect this is a compression scheme and a pretty decent one at that. I have posted on hacker news about this scheme and I have a repo where I am using it to compress video(a different topic not related to postgres) .
I'll publish this as a paper at some point. As with any compression scheme there are some caveats. The ratio of true positives to the universe of possible values must be below 0.32 for it to be effective. For example in postgres, suppose we use a bloom filter to query indexes. Therefore the universe of possible values would be all possible 64 bit integers. The true values would be the values of the actual indexes we wish to store in the bloom filter. Therefore the fact that the ratio must not exceed 0.32 isn't that big of a deal given the massive number a 64 bit integer can hold.
I hope this maybe explains the witness data structure a bit better.Feel free to ask me any further questions. I would be more than happy to collaborate on this patch or any future patches.
Regards,
Ross
On Fri, Jul 4, 2025 at 3:02 PM David E. Wheeler <david@justatheory.com> wrote:
Hi Ross,
I hope to put some time into exploring this in the next week, but wanted to point out here, in case you hadn’t noticed, that your reply went only to me and not pgsql-hackers. The convention on that list is to reply to all for most messages unless people need/want to sidebar on a topic. Not sure if that’s what you intended, but I suspect other people would find your reply interesting.
Best,
David
> On Jul 3, 2025, at 16:44, Ross Heaney <rheaney26@gmail.com> wrote:
>
> Hi,
>
> Your observation that the witness adds "around 25% of additional memory" is correct. However, the crucial distinction is that the witness data structure itself does not incur a false positive rate. Its fundamental purpose is to enable lossless reconstruction of the original input. A standard Bloom filter is inherently lossy due to its probabilistic nature; it can produce false positives (reporting an element as present when it is not) but never false negatives. The witness acts as a definitive resolver for these false positives. For any element that the Bloom filter might indicate as present (i.e., it passes the Bloom filter test), the witness stores a single bit that unequivocally declares whether that element is a true positive (originally in the set) or a false positive (not originally in the set). The witness does not store information for every element in the entire universe. Instead, it only needs to store a bit for each element that passes the Bloom filter test. This set of elements comprises all the true positives (elements that were actually inserted) and all the false positives (elements not inserted but for which the Bloom filter incorrectly returns a positive). This establishes a direct correlation between the witness size and the false positive rate.
>
> In effect this is a compression scheme and a pretty decent one at that. I have posted on hacker news about this scheme and I have a repo where I am using it to compress video(a different topic not related to postgres) .
> I'll publish this as a paper at some point. As with any compression scheme there are some caveats. The ratio of true positives to the universe of possible values must be below 0.32 for it to be effective. For example in postgres, suppose we use a bloom filter to query indexes. Therefore the universe of possible values would be all possible 64 bit integers. The true values would be the values of the actual indexes we wish to store in the bloom filter. Therefore the fact that the ratio must not exceed 0.32 isn't that big of a deal given the massive number a 64 bit integer can hold.
>
> I hope this maybe explains the witness data structure a bit better.
>
> Regards,
> Ross
>
> On Thu, Jul 3, 2025 at 4:59 PM David E. Wheeler <david@justatheory.com> wrote:
> Hello,
>
> This is an interesting proposal, thanks for posting.
>
> On Jul 3, 2025, at 11:00, Ross Heaney <rheaney26@gmail.com> wrote:
>
> > Proposed Improvements
> > Instead we can relax the assumption that it must be an integer value for k. To do this we split the k value into an integer and a rational part. Call this k and r. So in the example above where we calculated k to be equal to 2.3 this would mean k = 2 and r = 0.3. We always use k hash functions but we apply a third hash function 30% of the time(corresponding to r = 0.3 in this case). This is done deterministically so the same element always gets the same treatment(keeps the bloom filter promise of no false negatives). The paper has more information but this is the gist of the rational part of the bloom filter.
>
> Intuitively this makes sense to me. I skimmed the paper and my eyes rolled up into my head, but the basic idea certainly seems sound. I’m curious about the "Variably-Sized Block Bloom filter,” bit, though, which I don’t follow and doesn’t seem to be part of your proposal.
>
> > However we can take this further and look to make the bloom filter lossless. Using the rational part described above we've given ourselves some wriggle room so to speak. We can compliment the rational 'lossy' bloom filter with an additional array called a witness. This allows us to reconstruct the original input, which just means a false positive rate of 0.
>
> I have a bit of a harder time with the witness. I get that the size is limited, so it won’t grow to ginormous size based on the number of entries, but don’t understand how it could be so small (adding around 25% of additional memory in your example) unless it, too, accepts a certain false positivity rate. In which case I wonder whether there is a comparable reduction in the false positive rate by simply increasing the size of the bloom filter itself by 25%.
>
> But since you suggest that the false positive rate can be reduced to effectively zero, I must be missing something. I’d be curious to read more on this pattern.
>
> Best,
>
> David
>
On Fri, 4 Jul 2025 at 17:43, Ross Heaney <rheaney26@gmail.com> wrote: > The witness does not store information for every element in the entire universe. Instead, it only needs to store a bitfor each element that passes the Bloom filter test. I think this has a major flaw, in that it is nearly impossible to guarantee zero false positives in practice using the scheme you describe. Assuming 1 bit for each element that passes the main Bloom filter test, the witness list needs to be sized to O(universe * fp_rate) to accomodate a bit for all passed elements (with some margin to either direction to account for variability in the practical Bloom filter false positive rate). When you store 32-bit integers, this would be ~2^22 bits for a false positive rate of 1/1024. Storing 64-bit integers at the same fp_rate would require ~2^54 bits in this witness list. Assuming the bloom filter itself contains only N elements, and the size of a bloom filter is proportional to -log(fp_rate) * N, which for any complex data type is orders of magnitude smaller than the witness list. The bloom filter is orders of magnitude smaller, because this witness list only works when we assume that there is no pidgeon hole issue with the values stored, that we're only looking for small integer types, or hashes and don't mind hash collisions and the related false positives. It is quite possible (guaranteed even!) that two different strings hash to exactly the same values when using a fixed-size hash output, like the hash methods used in PostgreSQL. A witness list that uses your described method to guarantee no false positives even for strings would have to be sized to the order of 2^(8 * 2^30), give or take a few zeroes, to account for all possible string values in PostgreSQL. That just isn't practical. Actually, after looking at your 0001 patch, it looks like your 'witness' is not a list of both true and false positives, but a hashmap of only the true positive values, i.e. the list of inserted values. That does indeed make the combined data structure mostly "lossless", but 1.) this adds O(n_inserted) overhead with a large constant multiplier on n_inserted, as hash maps have about 24B/192bits of overhead per entry, which is probably an order of magnitude more than the approximately O(-log(p)) of the bloom filter itself; and 2.) this won't solve the pidgeon hole issue with trying to use a bloom filter on strings or other data types that might have more than the kept 64 bits of entropy. So, I'm sorry, but I don't think this is a lossless Bloom filter. Kind regards, Matthias van de Meent Databricks
Hi,
I appreciate you taking the time to leave feedback. I would like to address your points
You've highlighted an important point. I was quite lazy in how I defined the universe of possible values to be. I defined it to be all possible 64 bit values and so your calculation is correct. I have been primarily using this scheme to compress video where the universe of possible values is much much smaller and I did not fully think through the implications of allowing the universe of possible values to be that large.
After thinking about this a bit more I don't think it's practical at all to use this scheme in PostgreSQL. I still think the ideas from the paper around using a rational number of hash functions is worthwhile and I would be happy to provide a patch for this
I appreciate you taking the time to leave feedback. I would like to address your points
I think this has a major flaw, in that it is nearly impossible to
guarantee zero false positives in practice using the scheme you
describe. Assuming 1 bit for each element that passes the main Bloom
filter test, the witness list needs to be sized to O(universe *
fp_rate) to accomodate a bit for all passed elements (with some margin
to either direction to account for variability in the practical Bloom
filter false positive rate).
When you store 32-bit integers, this would be ~2^22 bits for a false
positive rate of 1/1024. Storing 64-bit integers at the same fp_rate
would require ~2^54 bits in this witness list. Assuming the bloom
filter itself contains only N elements, and the size of a bloom filter
is proportional to -log(fp_rate) * N, which for any complex data type
is orders of magnitude smaller than the witness list.
You've highlighted an important point. I was quite lazy in how I defined the universe of possible values to be. I defined it to be all possible 64 bit values and so your calculation is correct. I have been primarily using this scheme to compress video where the universe of possible values is much much smaller and I did not fully think through the implications of allowing the universe of possible values to be that large.
After thinking about this a bit more I don't think it's practical at all to use this scheme in PostgreSQL. I still think the ideas from the paper around using a rational number of hash functions is worthwhile and I would be happy to provide a patch for this
Kind regards,
Ross
On Fri, Jul 4, 2025 at 6:26 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Fri, 4 Jul 2025 at 17:43, Ross Heaney <rheaney26@gmail.com> wrote:
> The witness does not store information for every element in the entire universe. Instead, it only needs to store a bit for each element that passes the Bloom filter test.
I think this has a major flaw, in that it is nearly impossible to
guarantee zero false positives in practice using the scheme you
describe. Assuming 1 bit for each element that passes the main Bloom
filter test, the witness list needs to be sized to O(universe *
fp_rate) to accomodate a bit for all passed elements (with some margin
to either direction to account for variability in the practical Bloom
filter false positive rate).
When you store 32-bit integers, this would be ~2^22 bits for a false
positive rate of 1/1024. Storing 64-bit integers at the same fp_rate
would require ~2^54 bits in this witness list. Assuming the bloom
filter itself contains only N elements, and the size of a bloom filter
is proportional to -log(fp_rate) * N, which for any complex data type
is orders of magnitude smaller than the witness list.
The bloom filter is orders of magnitude smaller, because this witness
list only works when we assume that there is no pidgeon hole issue
with the values stored, that we're only looking for small integer
types, or hashes and don't mind hash collisions and the related false
positives.
It is quite possible (guaranteed even!) that two different strings
hash to exactly the same values when using a fixed-size hash output,
like the hash methods used in PostgreSQL. A witness list that uses
your described method to guarantee no false positives even for strings
would have to be sized to the order of 2^(8 * 2^30), give or take a
few zeroes, to account for all possible string values in PostgreSQL.
That just isn't practical.
Actually, after looking at your 0001 patch, it looks like your
'witness' is not a list of both true and false positives, but a
hashmap of only the true positive values, i.e. the list of inserted
values. That does indeed make the combined data structure mostly
"lossless", but
1.) this adds O(n_inserted) overhead with a large constant multiplier
on n_inserted, as hash maps have about 24B/192bits of overhead per
entry, which is probably an order of magnitude more than the
approximately O(-log(p)) of the bloom filter itself; and
2.) this won't solve the pidgeon hole issue with trying to use a bloom
filter on strings or other data types that might have more than the
kept 64 bits of entropy.
So, I'm sorry, but I don't think this is a lossless Bloom filter.
Kind regards,
Matthias van de Meent
Databricks