Re: Adding a suffix array index
От | Hannu Krosing |
---|---|
Тема | Re: Adding a suffix array index |
Дата | |
Msg-id | 1100867900.3919.15.camel@fuji.krosing.net обсуждение исходный текст |
Ответ на | Adding a suffix array index (Troels Arvin <troels@arvin.dk>) |
Список | pgsql-hackers |
On R, 2004-11-19 at 12:42, Troels Arvin wrote: > The basic parts of the type are pretty much done. Those interested may > have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the > code organization needs some clean-up). The basic type implementation > should be improved by adding more string functions and by implementing a > set of specialized selectivity functions. I cant answer your immediate questions, just rant on general issues ;) > Part of my current code concerns packing DNA characters: As the alphabet > of DNA strings is very small (four characters), it seems like a > straigt-forward optimization to store each character in two bits. My advice would be to get it to work first, oprimize later. Thus I guess you could start by storing DNA sequences as character strings. > A warning: This is my first C project, so please > don't laugh too much (publicly) if you find strange constructs in my code... Then even more so - get the novel and generally interesting part (indexing huge arrays) right first, and optimise for space (compressing 4 chars into 1) later. You could do this 4->1 compression when storing the string into tuple, but I strongly recommend doing actual work (indexing/searching) at a level that C directly supports (i.e. bytes/characters) This enables you to get the basics right first without distraction from all bit-shifting inside bytes. A good tuned algorithm will almost certainly offset the 4 time space disadvantage. ... > My first and most immediate goal is to support efficient answering of a > question like "which rows contain the sequence TTGACCACTTG in column foo?". If you store your sequences as strings, you may try to use trigrams (or modify them to 4,5,6 or 7-grams ;) to get some feel how that works. trigram module is in contrib/pg_trgm. ------------ Hannu
В списке pgsql-hackers по дате отправления: