Questions about indexes?

Поиск
Список
Период
Сортировка
От Ryan Bradetich
Тема Questions about indexes?
Дата
Msg-id 1045462348.27140.54.camel@beavis.ybsoft.com
обсуждение исходный текст
Ответы Re: Questions about indexes?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Questions about indexes?  (Curt Sampson <cjs@cynic.net>)
Список pgsql-hackers
Hello postgres hackers,

Been a while since I have participated on this list ... but I have a new
itch to scratch....

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:

host_id        integer     (not null)timestamp    datetime    (not null)category    text        (not null)    [<=    5
chars]anomaly       text        (not null)    [<= 1024 chars]
 

This table is used to store archived data, so each row in the table must
be unique.  Currently I am using a primary key across each column to
enforce this uniqueness.  This table currently has ~86 million rows and
is 16+ GB in size.  This primary key index is also 16+ GB in size,
because it appears all the data is duplicated in the index.  (I have
only done some preliminary looking at the database file with strings,
etc ... so this assumption is purly based on these observations).

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table.  Is there any benchmarks or papers related to this topic I should
locate and read?  I am curious about this because it seems the only
advantaged gained is searching the index for the specified values....
Once the entry is found, the full entry needs to be pulled from the main
table anyhow since the index does not contain all the data.  Also with
the increased size, it seems additional pressure would be put on the
shared memory caches (no idea how this really works, just guessing! :))


Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields.  This has reduced the disk space usage
considerably, as show below against my test database ~6 million rows
at 1+ GB.

All this data is based off the test database running 7.3.2:
Type            Size-------------------------------------------Database Table        1188642816All columns pkey
1510252544MD5columns pkey     370999296
 

Just using MD5 hash data instead of all the columns is a considerable
diskspace win going from 1.5 GB to 370 MB.

Has anyone else solved this problem?  Has anyone else looked into
something like this and mind sharing so I do not have to re-invent the
wheel? :)  Also (assuming there is no papers / benchmarks proving data
in index is a good idea), how difficult would it be to impliment an
index type that extracts the data from the main table?


Thanks for reading.  I will be happy to field any question that I can,
or read any papers, research, etc that relates to this topic.

- Ryan

P.S. the production database is running 7.2.4 if that makes a
difference.

-- 
Ryan Bradetich <rbradetich@uswest.net>



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Christopher Kings-Lynne"
Дата:
Сообщение: Re: Hard problem with concurrency
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Questions about indexes?