Re: Postgresql as a dictionary coder backend?
От | Attila Nagy |
---|---|
Тема | Re: Postgresql as a dictionary coder backend? |
Дата | |
Msg-id | 4D3DE255.7030605@fsn.hu обсуждение исходный текст |
Ответ на | Re: Postgresql as a dictionary coder backend? (Fredric Fredricson <Fredric.Fredricson@bonetmail.com>) |
Список | pgsql-general |
On 01/24/2011 05:27 AM, Fredric Fredricson wrote: >> >> I can easily do this with any RDBMS, with a table of three columns: >> auto incremented ID, word and refcount, with a unique index on word. >> The challenge could be: >> - that it should scale to several TBs of size and several (hundred) >> billion of records. One scenario would be to store about 40 TBs of >> words and the average word length would be about 50-60 bytes (that's >> about 800*10^9 records). It should work well both for inserting and >> searching (encoding and decoding) words. > 50-60 bytes! Oh, so we are not talking about natural language here. > Sorry, I just assumed that. Yes, sorry. > Still, I think performance will be a big issue here. I have never > tried postgresql on anything faster than a fast PC but in my humble > experience an insert will take at least one ms. With this speed > 800*10^9 records would take 25 years to insert. > I think you have to think bigger than a single server (ok, that was > stating the obvious). Well, to be honest, that would be for a single server. And I would like to have more than a single server. :) One ms is 1000 TPS, if you can't utilize the DB more with multiple threads. I haven't really got a fast PC yet, but I could easily achieve 12k TPS with some multicore stuff and BBWC RAID controllers. It's only 771 years then. :) >> - I need atomicity and durability, but having these on a word >> (record) level takes too much IOPS and have no use, so it would be >> good to have an interface for inserting about 1000-500000 words in >> one call, assign a unique ID to each unique words and store them (if >> the word has had already an ID, increment its refcount) and give back >> the IDs for each words. This transaction could be committed as one, >> so the transactions could be big, sparing IOPS. >> - I need concurrency, so when the above happens from two sources at >> the same time, the same word in the two transactions must get the >> same ID > Unless the ID is some kind of hash you will have to serialize inserts > of new words. I start to think this too, if the unique constraint is there... >> >> Is postgresql a good choice for doing this and if yes, what would be >> the optimal (for both time and space efficiency at encoding and >> decoding) use case? > I might be wrong but this kind of project should probably not rely on > a "standard" RDBMS. The data structure itself does not seem to complex > and the performance requirements are quite demanding. I've done some experiments with BerkeleyDB and Tokyo/Kyoto Cabinet, but I was curious about the opinion of the RDBMS masters. Thanks for sharing.
В списке pgsql-general по дате отправления: