Re: Creating large database of MD5 hash values
От | Florian Weimer |
---|---|
Тема | Re: Creating large database of MD5 hash values |
Дата | |
Msg-id | 82skxsqq67.fsf@mid.bfk.de обсуждение исходный текст |
Ответ на | Re: Creating large database of MD5 hash values ("Jon Stewart" <jonathan.l.stewart@gmail.com>) |
Список | pgsql-performance |
* Jon Stewart: >> BYTEA is slower to load and a bit inconvenient to use from DBI, but >> occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33 >> bytes with PostgreSQL 8.3). > Can you clarify the "slower to load" point? Where is that pain point > in the postgres architecture? COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex form of double-decoding is necessary. > Storing the values in binary makes intuitive sense to me since the > data is twice as dense, thus getting you more bang for the buck on > comparisons, caching, and streaming reads. I'm not too concerned about > raw convenience, as there's not going to be a lot of code around my > application. The main issue is that you can't use the parameter-providing version of $sth->execute (or things like $sth->selectarray, $sth->do), you must use explicit binding by parameter index in order to specify the type information. > The idea is that you have named sets of hash values, and hash values > can be in multiple sets. The ID step is only going to help you if your sets are very large and you use certain types of joins, I think. So it's better to denormalize in this case (if that's what you were alluding to in your original post). > The big operations will be to calculate the unions, intersections, and > differences between sets. That is, I'll load a new set into the > database and then see whether it has anything in common with another > set (probably throw the results into a temp table and then dump it > out). In this case, PostgreSQL's in-memory bitmap indices should give you most of the effect of your hash <-> ID mapping anyway. > I will also periodically run queries to determine the size of > the intersection of two sets for all pairs of sets (in order to > generate some nice graphs). I think it's very difficult to compute that efficiently, but I haven't thought much about it. This type of query might benefit from your hash <-> ID mapping, however, because the working set is smaller. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
В списке pgsql-performance по дате отправления: