Questions about indexes?
От | Ryan Bradetich |
---|---|
Тема | Questions about indexes? |
Дата | |
Msg-id | 1045462348.27140.54.camel@beavis.ybsoft.com обсуждение исходный текст |
Ответы |
Re: Questions about indexes?
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: Questions about indexes? (Curt Sampson <cjs@cynic.net>) |
Список | pgsql-hackers |
Hello postgres hackers, Been a while since I have participated on this list ... but I have a new itch to scratch.... Although the table schema is immaterial, I will provide it so we have a common framework for this discussion: host_id integer (not null)timestamp datetime (not null)category text (not null) [<= 5 chars]anomaly text (not null) [<= 1024 chars] This table is used to store archived data, so each row in the table must be unique. Currently I am using a primary key across each column to enforce this uniqueness. This table currently has ~86 million rows and is 16+ GB in size. This primary key index is also 16+ GB in size, because it appears all the data is duplicated in the index. (I have only done some preliminary looking at the database file with strings, etc ... so this assumption is purly based on these observations). I am not sure why all the data is duplicated in the index ... but i bet it has to do with performance since it would save a lookup in the main table. Is there any benchmarks or papers related to this topic I should locate and read? I am curious about this because it seems the only advantaged gained is searching the index for the specified values.... Once the entry is found, the full entry needs to be pulled from the main table anyhow since the index does not contain all the data. Also with the increased size, it seems additional pressure would be put on the shared memory caches (no idea how this really works, just guessing! :)) Since my only requirement is that the rows be unique, I have developed a custom MD5 function in C, and created an index on the MD5 hash of the concatanation of all the fields. This has reduced the disk space usage considerably, as show below against my test database ~6 million rows at 1+ GB. All this data is based off the test database running 7.3.2: Type Size-------------------------------------------Database Table 1188642816All columns pkey 1510252544MD5columns pkey 370999296 Just using MD5 hash data instead of all the columns is a considerable diskspace win going from 1.5 GB to 370 MB. Has anyone else solved this problem? Has anyone else looked into something like this and mind sharing so I do not have to re-invent the wheel? :) Also (assuming there is no papers / benchmarks proving data in index is a good idea), how difficult would it be to impliment an index type that extracts the data from the main table? Thanks for reading. I will be happy to field any question that I can, or read any papers, research, etc that relates to this topic. - Ryan P.S. the production database is running 7.2.4 if that makes a difference. -- Ryan Bradetich <rbradetich@uswest.net>
В списке pgsql-hackers по дате отправления: