Re: scoring differences between bitmasks
От | Todd A. Cook |
---|---|
Тема | Re: scoring differences between bitmasks |
Дата | |
Msg-id | 43407846.3080601@blackducksoftware.com обсуждение исходный текст |
Ответ на | Re: scoring differences between bitmasks (Ben <bench@silentmedia.com>) |
Список | pgsql-general |
Hi, Sorry for being so abtuse; my 9 year old has been helping me with my work today. :) The hamming distance between two bit vectors can also be found as the sum of distances between sub-vectors. Let's assume you're looking for all vectors less than d bits away from a search vector, and we'll divide the vectors into k-bit subvectors. The value of k will depend on d, how sparse you data is, and how much memory you have for lookup tables. The goal is to eliminate the vectors that are more than d bits away. (This assumes that most of the search vectors are more than d bits away; for the opposite case, just reverse the sense of the comparisons below.) Compute the set of k-bit vectors that <d away from the first k-bits of the search vector, using a lookup table if feasible. Find the set of full vectors whose initial k-vector is in this set. For those whose distance is d-1, the remaining bits have to be identical, which is easy to check. Repeat for the second set of k-vectors where their distance has to be less than d-1, then d-2, etc. You can do these searches using the linked tries I mentioned earlier. (I've described the searches as happening sequentially, but that needn't be the case. You can search the subvectors simultaneously and then filter the results at the client.) You don't have to use constant length subvectors. You can start with small subvectors to keep the size of the search sets down, and then increase the vector length as the set of remaining possibilities shrinks. Also, you don't have to work from one end of the full vector to the other; start with a subvector where you expect a lot of variability. I think this method is similar in spirit, if not detail, to what Martijn suggested. I've actually implemented this approach before, though using my own file-based lookups rather than a real database. I had a set of 128-bit vectors used to filter an input stream; most of the comparisons never looked past 64 bits. -- todd Ben wrote: > Sure, but unless I can figure out some way to choose a small number of > vectors, I'm left with computing the full N^2 set. Given that I'm > shooting for N to be 4 million or larger, that's a lot of data to > store..... > > On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote: > >> Ben wrote: >> >>> Just the number of bits, not which ones. Basically, the hamming >>> distance. >>> >> >> I see. Could you pre-compute the bit counts for the vectors in the >> table? >> You could count the bits in the search vector as Martijn suggested, >> and then >> do a lookup based on the count.
В списке pgsql-general по дате отправления: