Обсуждение: Re: Multi field hash indexes

Поиск
Список
Период
Сортировка

Re: Multi field hash indexes

От
Hannu Krosing
Дата:
Ocie Mitchell wrote:

> I was playing around with my latest compile and tried to make a unique

> index on two columns, using a hash method.  Both of these (more than
> one column and unique) are currently not allowed for hash indexes.
>
> I thought about this for a bit and realized that making a NESTED hash
> index (index on a and b also serves as an index on a) would be a
> trick, but allowing the unique clause should not be a problem.

It can be complicated (especially for extensible hashing) but the
theory
for this seems to be on in the database handbooks under the name of
'segmented hash' or some like fancy name.

The trick is to hash each field separately and then have a concatenation

of the hash values.

so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
hash
to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'

then the index can be used not only when selecting  = a,b,c but also
when
selecting on _any_ of the fields in this index. For example when
selecting
for b='friday' one would examine only buckets where the middle is 'bb'

> Therefore, I would like to try implementing unique constraints on hash

> indexes.  Has this come up before?  Are there any reasons not to
> support this?  As far as I understand, specifying an index method is
> non standard (above and beyond standard) to begin with.

While you are at it could you please comment if the GIST indexes are
used or
at least easily usable?

I have asked this question once in each two months for about a year now,
but
it seems that noone is in the position to answer it.

Cheers,

Hannu Krosing


Re: [HACKERS] Re: Multi field hash indexes

От
Michael Meskes
Дата:
Hannu Krosing writes:
> It can be complicated (especially for extensible hashing) but the
> theory
> for this seems to be on in the database handbooks under the name of
> 'segmented hash' or some like fancy name.
>
> The trick is to hash each field separately and then have a concatenation
>
> of the hash values.

More or less.

> so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
> hash
> to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'

I had a research project on this several years ago and we found that all
these multidimensional hash trees have one or the other problem. But one
index type named Balanced Multidimensional Hash Tree (from the top of my
head) performed excellently by combining hash and tree structures.

Michael

--
Dr. Michael Meskes, Project-Manager    | topsystem Systemhaus GmbH
meskes@topsystem.de                    | Europark A2, Adenauerstr. 20
meskes@debian.org                      | 52146 Wuerselen
Go SF49ers! Go Rhein Fire!             | Tel: (+49) 2405/4670-44
Use Debian GNU/Linux!                  | Fax: (+49) 2405/4670-10

Re: [HACKERS] Re: Multi field hash indexes

От
ocie@paracel.com
Дата:
Hannu Krosing wrote:
>
> Ocie Mitchell wrote:
>
> > I was playing around with my latest compile and tried to make a unique
>
> > index on two columns, using a hash method.  Both of these (more than
> > one column and unique) are currently not allowed for hash indexes.
> >
> > I thought about this for a bit and realized that making a NESTED hash
> > index (index on a and b also serves as an index on a) would be a
> > trick, but allowing the unique clause should not be a problem.
>
> It can be complicated (especially for extensible hashing) but the
> theory
> for this seems to be on in the database handbooks under the name of
> 'segmented hash' or some like fancy name.
>
> The trick is to hash each field separately and then have a concatenation
>
> of the hash values.
>
> so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
> hash
> to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'
>
> then the index can be used not only when selecting  = a,b,c but also
> when
> selecting on _any_ of the fields in this index. For example when
> selecting
> for b='friday' one would examine only buckets where the middle is 'bb'
>

HMM, this doesn't feel right.  If I have an index on four int4s
a,b,c,d and I only know d, then it seems like searching for these in
the hash table could be as much work, or more work than a table scan.
Now if a hash on two columns implicitly make a seperate hash on each,
and then took their intersection, that might be fast.

The kind of thing I am thinking that these two-or-more-column hash
indexes would be use for supporting intersection tests, where the
order of the items doesn't matter, but it is good to be able to come
up with a quick answer to the questions:

select x from t where y=5;
select y from t where x in (1, 4, 6, 7);
select count(*) from t where x=1 and y=10;


I was originally thinking that this would be supported like the btree
indexes are now -- an index on (a,b,c,d) serves as in index on a,
(a,b), (a,b,c) and (a,b,c,d), but it doesn't serve as an index on b,
or (b,c), etc.  My original idea was that the first item in the index
would define a hash table whose entries were hash tables on the second
item, etc.  I now think that this would waste quite a bit of space,
and would have the same restriction as btrees, which is unnatural.


> > Therefore, I would like to try implementing unique constraints on hash
>
> > indexes.  Has this come up before?  Are there any reasons not to
> > support this?  As far as I understand, specifying an index method is
> > non standard (above and beyond standard) to begin with.
>
> While you are at it could you please comment if the GIST indexes are
> used or
> at least easily usable?

What are GIST indexes?


Ocie Mitchell

Re: [HACKERS] Re: Multi field hash indexes

От
Bruce Momjian
Дата:
> I was originally thinking that this would be supported like the btree
> indexes are now -- an index on (a,b,c,d) serves as in index on a,
> (a,b), (a,b,c) and (a,b,c,d), but it doesn't serve as an index on b,
> or (b,c), etc.  My original idea was that the first item in the index
> would define a hash table whose entries were hash tables on the second
> item, etc.  I now think that this would waste quite a bit of space,
> and would have the same restriction as btrees, which is unnatural.

This is a standard restriction.  If you need an index on a lower-level
field, create one.  I don't think you are going to be able to improve on
(a,b), (a,b,c).  If you allowed (b,c) that is another index.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: Multi field hash indexes

От
ocie@paracel.com
Дата:
Bruce Momjian wrote:
>
> > I was originally thinking that this would be supported like the btree
> > indexes are now -- an index on (a,b,c,d) serves as in index on a,
> > (a,b), (a,b,c) and (a,b,c,d), but it doesn't serve as an index on b,
> > or (b,c), etc.  My original idea was that the first item in the index
> > would define a hash table whose entries were hash tables on the second
> > item, etc.  I now think that this would waste quite a bit of space,
> > and would have the same restriction as btrees, which is unnatural.
>
> This is a standard restriction.  If you need an index on a lower-level
> field, create one.  I don't think you are going to be able to improve on
> (a,b), (a,b,c).  If you allowed (b,c) that is another index.

What I meant is that this is a perfectly reasonable restriction for
btree indexes.  If the index is built on (a,b,c,d), then finding the
nodes where a=5 gives you a few branches of the tree, and if you
further restrict this to b=10, then you need only look at these
branches.  The same could be said for ranges of a values.

Hash tables are different in that they do not have this tree structure
(unless we impose this structure on them.  The problem I see with
building this tree structure as I described earlier is that it has the
possibility for a lot of wasted space (even more than btrees).  Also,
if we can provide the added functionality that a hash index on (a,b)
is also an index on b, with equal or lesser memory usage than the
hash-tree, wouldn't this be a "good thing".

Anyway, I think this needs some thinking.  I will look at the
alternatives and post something a bit later.

Ocie

Re: [HACKERS] Re: Multi field hash indexes

От
Hannu Krosing
Дата:
ocie@paracel.com wrote:

> Hannu Krosing wrote:
> >
> > The trick is to hash each field separately and then have a concatenation
> >
> > of the hash values.
> >
> > so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
> > hash
> > to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'
> >
> > then the index can be used not only when selecting  = a,b,c but also
> > when
> > selecting on _any_ of the fields in this index. For example when
> > selecting
> > for b='friday' one would examine only buckets where the middle is 'bb'
> >
>
> HMM, this doesn't feel right.  If I have an index on four int4s
> a,b,c,d and I only know d, then it seems like searching for these in
> the hash table could be as much work, or more work than a table scan.

I was assuming that the hash table scan would be cheaper than the table scan

> > > Therefore, I would like to try implementing unique constraints on hash
> >
> > > indexes.  Has this come up before?  Are there any reasons not to
> > > support this?  As far as I understand, specifying an index method is
> > > non standard (above and beyond standard) to begin with.
> >
> > While you are at it could you please comment if the GIST indexes are
> > used or
> > at least easily usable?
>
> What are GIST indexes?

Some kind of generalised binary tree indexes that should make it easy to
define additional indexing strategies.

There is a directory access/gist in src/backend, and they are briefly
mentioned in the PostgreSQL programmers guide.

They seem to be offspring of some independent (of postgres) Berkeley project.

Hannu


Re: [HACKERS] Re: Multi field hash indexes

От
ocie@paracel.com
Дата:
Hannu Krosing wrote:
>
> ocie@paracel.com wrote:
>
> > Hannu Krosing wrote:
> > >
> > > The trick is to hash each field separately and then have a concatenation
> > >
> > > of the hash values.
> > >
> > > so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
> > > hash
> > > to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'
> > >
> > > then the index can be used not only when selecting  = a,b,c but also
> > > when
> > > selecting on _any_ of the fields in this index. For example when
> > > selecting
> > > for b='friday' one would examine only buckets where the middle is 'bb'
> > >
> >
> > HMM, this doesn't feel right.  If I have an index on four int4s
> > a,b,c,d and I only know d, then it seems like searching for these in
> > the hash table could be as much work, or more work than a table scan.
>
> I was assuming that the hash table scan would be cheaper than the table scan

I think I see what you're getting at.  The different parts are
separately hashed and then concatenated.  I was thinking the other way
concatenate and THEN hash, which would spread the values all over the
table.

One problem with this (don't know how severe) is that a hash index on
a can answer 'select blah from t where a=10' by computing one hash
value and scaning the list at that point, whereas with an index on
(a,b), several hash values must be computed (one for each possible
resulting hash value of b).

>
> > > > Therefore, I would like to try implementing unique constraints on hash
> > >
> > > > indexes.  Has this come up before?  Are there any reasons not to
> > > > support this?  As far as I understand, specifying an index method is
> > > > non standard (above and beyond standard) to begin with.
> > >
> > > While you are at it could you please comment if the GIST indexes are
> > > used or
> > > at least easily usable?
> >
> > What are GIST indexes?
>
> Some kind of generalised binary tree indexes that should make it easy to
> define additional indexing strategies.
>
> There is a directory access/gist in src/backend, and they are briefly
> mentioned in the PostgreSQL programmers guide.
>
> They seem to be offspring of some independent (of postgres) Berkeley project.

Interesting.  I saw something in Dr Dobbs Journal (the last bastion of
programming magazines) about trinary trees.  The idea is that there is
a less than branch for items less than the given node and one for
those greater and another for those equal.  This can be used to index
items by using the less than/ greater than links to find the first
letter, then taking the 'equal' link and looking for the second
letter, etc.  The problem is that these are not self balancing (like
red black trees or btrees), but maybe there is a way to make them so.

Ocie

Re: [HACKERS] Re: Multi field hash indexes

От
Hannu Krosing
Дата:
ocie@paracel.com wrote:

> Hannu Krosing wrote:
> >
> > ocie@paracel.com wrote:
> >
> > > Hannu Krosing wrote:
> > > >
> > > > The trick is to hash each field separately and then have a concatenation
> > > >
> > > > of the hash values.
> > > >
> > > > so assuming that for fields (a,b,c) values (120, 'friday', 3.1415927)
> > > > hash
> > > > to 'aa', 'bb', 'cc' the hash value of the whole tuple will be 'aabbcc'
> > > >
> > > > then the index can be used not only when selecting  = a,b,c but also
> > > > when
> > > > selecting on _any_ of the fields in this index. For example when
> > > > selecting
> > > > for b='friday' one would examine only buckets where the middle is 'bb'
> > > >
> > >
> > > HMM, this doesn't feel right.  If I have an index on four int4s
> > > a,b,c,d and I only know d, then it seems like searching for these in
> > > the hash table could be as much work, or more work than a table scan.
> >
> > I was assuming that the hash table scan would be cheaper than the table scan
>
> I think I see what you're getting at.  The different parts are
> separately hashed and then concatenated.  I was thinking the other way
> concatenate and THEN hash, which would spread the values all over the
> table.
>
> One problem with this (don't know how severe) is that a hash index on
> a can answer 'select blah from t where a=10' by computing one hash
> value and scaning the list at that point, whereas with an index on
> (a,b), several hash values must be computed (one for each possible
> resulting hash value of b).

The idea is that after computing the hash value you still have to find that value
and
that values are arranged in some way that makes finding consecutive values cheap
(like ordered or tree or matrix (using sparse files)), at least for values with the
same
start or possibly for all other kinds of same parts.

> > > > > indexes.  Has this come up before?  Are there any reasons not to
> > > > > support this?  As far as I understand, specifying an index method is
> > > > > non standard (above and beyond standard) to begin with.
> > > >
> > > > While you are at it could you please comment if the GIST indexes are
> > > > used or
> > > > at least easily usable?
> > >
> > > What are GIST indexes?
> >
> > Some kind of generalised binary tree indexes that should make it easy to
> > define additional indexing strategies.
> >
> > There is a directory access/gist in src/backend, and they are briefly
> > mentioned in the PostgreSQL programmers guide.
> >
> > They seem to be offspring of some independent (of postgres) Berkeley project.
>
> Interesting.  I saw something in Dr Dobbs Journal (the last bastion of
> programming magazines) about trinary trees.  The idea is that there is
> a less than branch for items less than the given node and one for
> those greater and another for those equal.  This can be used to index
> items by using the less than/ greater than links to find the first
> letter, then taking the 'equal' link and looking for the second
> letter, etc.  The problem is that these are not self balancing (like
> red black trees or btrees), but maybe there is a way to make them so.

The information about GIST is at:

http://GiST.CS.Berkeley.EDU:8000/gist/

with more on different indexing and sorting schemes at:

http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/

And there is more interesting reading at the Berkely databese site at :

http://epoch.cs.berkeley.edu:8000/


Hannu