Обсуждение: optimizing queries and indexes...

Поиск
Список
Период
Сортировка

optimizing queries and indexes...

От
"Robert J. Sanford, Jr."
Дата:
i'm fairly new at this whole database design thing and my grasp of set
theory is not what it was when i was in college lo these many years
past. but i want to get a better idea of how to optimize sql
statements.

i have several friends that are DBA's by profession and work on oracle
and/or ms sql server. they have all told me that while there are some
general rules to follow that each database is different.

for example, one thing that one of my friends said is:  select X  from big_table      ,little_table
  Generally speaking, Oracle optimizes better  when the smaller/reducing tables are on the  bottom and the larger
tablesare on the top.  I believe SQLServer likes them in the opposite  direction.
 

and also:  Generally speaking indexes should be built  with column names in the order of higher  cardinality.  I
frequentlyscrew this up  because it runs counter to the way you think  about building your joins.
 
  An example might be as follows:  An index with:  Company_Id   (distinct count = 4)  Dept_Id      (distinct count =
40) Employee_ID  (distinct count = 1000+)
 
  This index should probably be in the illogical  order of:     Employee_Id     Dept_Id     Company_Id

so, i am hopeful that there is some sort of postgresql performance faq
for queries.

also, is there a general rule as to what is considered expensive when
running explain on a query? i had one query running a join and
calculation across two unindexed tables. the initial plan cost about
800 with a majority of the cost being taken up in a sequential scan of
about 5000 rows costing 210 and then a hash join of 225 rows across
the two tables costing 585. after creating indexes on the costs
dropped about in half to 106 and 299 respectively.

is 800 expensive? is 400 expensive? will the cost go up when the data
baloons to 100,000 rows?

many thanks!

rjsjr



Re: optimizing queries and indexes...

От
"Josh Berkus"
Дата:
Robert,

I'd advise you to buy a book, but frankly I don't know a good one on DB
performance optimization.  The DB optimizers I've met tend to guard
their secrets closely.

Suggestions, anyone?

> i have several friends that are DBA's by profession and work on
> oracle
> and/or ms sql server. they have all told me that while there are some
> general rules to follow that each database is different.

Yup.  They told ya right.

> for example, one thing that one of my friends said is:
>    select X
>    from big_table
>        ,little_table
>
>    Generally speaking, Oracle optimizes better
>    when the smaller/reducing tables are on the
>    bottom and the larger tables are on the top.
>    I believe SQLServer likes them in the opposite
>    direction.

Generally speaking, in PG SQL it pays to leave the join order up to the
optimizer as much as possible.  Tom and Stephan have built a phenominal
query optimizer, and you are much more likely to slow it down if you
limit its choices.  Also, for the implicit join style in Postgres the
order in which you give tables is largely ignored by the optimizer.
Order only matters in explicit joins.

That being said, there are a few practices that can help:

1. When joining a limited subset of an exceptionally large table to
several smaller tables, consider using a subselect for the large table.
In some cases this will speed up query execution.  Sometimes it won't.
Example:

2. When doing several Inner (normal) Joins and several Outer Joins, do
the inner joins first and the outer joins second, as the inner joins
should limit the result set that is being matched for the outer joins.

3. All joins and where conditions, ideally, should be executed on
indexed columns.

4. EXISTS and NOT EXISTS are almost always faster than IN and NOT IN for
a sub-select in the WHERE clause.

5. Some Postgres-specific tips:
    a. VACUUM ANALYZE regularly!
    b. If you have indexes on tables with a large number of regular
deletions, you will need to drop and re-create the indexes during
periods of inactivity.  This is a Postgres defect that will be fixed in
ver. 7.3 or later.

6. Make sure to establish Foriegn Key constraints wherever appropriate.
This will speed up joins on the constrained columns considerably, as the
parser does not have to worry about unmatched rows.

7. Set Unique indexes on all unique columns.  This also helps the
parser.

> and also:
>    Generally speaking indexes should be built
>    with column names in the order of higher
>    cardinality.  I frequently screw this up
>    because it runs counter to the way you think
>    about building your joins.

This may be true in Oracle (opinions?) but is is neither true in
Postgres nor true in SQL Server (or Frontbase).  If you build your
indexes according to this rule you will be dissapointed in the results.

The real rule is:
    Indexes should always be built according to the manner and order in
which they will be queried.

Example:  If you have a database in which there is a unique set of
EmployeeIDs for each Region, but not unique between Regions.  Thus all
joins to Employees join on both the regionID and the employeeID.
Further, you force the user to select a region before s/he can look up
an employee.  In this case, you would want to establish your indexes as
CREATE INDEX reg_emp_ix UNIQUE ( regionID, employeeID )
Create all your joins as:
JOIN employees e ON (t.regionID = e.regionID AND t.employeeID =
e.employeeID)
And your WHERE clauses as:
WHERE regionID = $region AND employeeID = $emp
For all of these, order is immensely important.

However, this index, while immensely useful for the above purposes,
would be useless if there are more than 2 regions and you wanted to
select on employeeID only for some reason.  In that case, you would need
to add a second index on employeeID alone, as the regionID, employeeID
index would not be used in a query that filtered only by employeeID and
ignored the regionID.

One of my clients didn't know any of the above and established a number
of indexes on tables > 3,000,000 records using columns in alphabetical
order (!?!).  They then had to call me, and re-ordering the index
columns cut the delay in single-row queries (especially DELETE queries)
by 80%.

> so, i am hopeful that there is some sort of postgresql performance
> faq
> for queries.

Somebody wanna re-organize the above with more examples?

-Josh Berkus

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Вложения

Re: optimizing queries and indexes...

От
Stephan Szabo
Дата:
On Sun, 9 Sep 2001, Josh Berkus wrote:

> > i have several friends that are DBA's by profession and work on
> > oracle
> > and/or ms sql server. they have all told me that while there are some
> > general rules to follow that each database is different.
> 
> Yup.  They told ya right.
> 
> > for example, one thing that one of my friends said is:
> >    select X
> >    from big_table
> >        ,little_table
> > 
> >    Generally speaking, Oracle optimizes better
> >    when the smaller/reducing tables are on the
> >    bottom and the larger tables are on the top.
> >    I believe SQLServer likes them in the opposite
> >    direction.
> 
> Generally speaking, in PG SQL it pays to leave the join order up to the
> optimizer as much as possible.  Tom and Stephan have built a phenominal

Just wanted to jump in to say that I've actually had nothing to do with
the optimizer, I just often field questions. :)  

> query optimizer, and you are much more likely to slow it down if you
> limit its choices.  Also, for the implicit join style in Postgres the
> order in which you give tables is largely ignored by the optimizer.
> Order only matters in explicit joins.

This bit above is very important.  If explain (see below) is showing you a
plan you think is suboptimal, using explicit joins to force join order
is sometimes a win.  The downside is that if the situation changes, this
may no longer be an optimization.

> 4. EXISTS and NOT EXISTS are almost always faster than IN and NOT IN for
> a sub-select in the WHERE clause.  

As a mention, a little info is given on this in the FAQ entry 4.23 with a
simple example of a conversion from IN to EXISTS.

> 
> 5. Some Postgres-specific tips:
>     a. VACUUM ANALYZE regularly!
>     b. If you have indexes on tables with a large number of regular
> deletions, you will need to drop and re-create the indexes during
> periods of inactivity.  This is a Postgres defect that will be fixed in
> ver. 7.3 or later.

For b, I think you may also be able to use REINDEX to do that, but I'm not
sure.

> 6. Make sure to establish Foriegn Key constraints wherever appropriate.
> This will speed up joins on the constrained columns considerably, as the
> parser does not have to worry about unmatched rows.

I don't actually think that the optimizer takes that into account
(although I could be wrong).  It's possible that it will in the future
however.  As a note however, foreign keys *do* make insert/update on the
fk table and update/delete on the pk table a small bit more
expensive.  It's a balancing act, just like indexes.

> The real rule is:
>     Indexes should always be built according to the manner and order in
> which they will be queried.
> 
> Example:  If you have a database in which there is a unique set of
> EmployeeIDs for each Region, but not unique between Regions.  Thus all
> joins to Employees join on both the regionID and the employeeID.
> Further, you force the user to select a region before s/he can look up
> an employee.  In this case, you would want to establish your indexes as
> CREATE INDEX reg_emp_ix UNIQUE ( regionID, employeeID )
> Create all your joins as:
> JOIN employees e ON (t.regionID = e.regionID AND t.employeeID =
> e.employeeID)
> And your WHERE clauses as:
> WHERE regionID = $region AND employeeID = $emp
> For all of these, order is immensely important.
> 
> However, this index, while immensely useful for the above purposes,
> would be useless if there are more than 2 regions and you wanted to
> select on employeeID only for some reason.  In that case, you would need
> to add a second index on employeeID alone, as the regionID, employeeID
> index would not be used in a query that filtered only by employeeID and
> ignored the regionID.

The general rule for this is that AFAIK that for multi-column indexes the
index can only be used up to the first missing column in its order of
definition.
Index (a,b,c) can be used for a query on a, a & b, a & b & c, or to a
limited extent on a & c (it'll only look up a in the index however).


Additional things (in no particular order):Explain is your friend.  Whenever you have a query that you expect to do
alot on your dataset (well, except straight inserts) use explain to get
an idea of what the database thinks is necessary to do the query.  It's
simple: EXPLAIN <query>; and is not immediately understandable, but it's
enough info to start getting help from the mailing lists. :)Sometimes a sequence scan across the heap will be faster in
postgresthan
 
an index scan.  If a reasonable % of rows are going to be returned, the
index scan will actually often be slower.  Sometimes postgres gets it
wrong, but you can test that by experimentation.
Indexes are a double-edged sword. Too many indexes can be bad, especially
ones that aren't likely to actually be used in queries since there is a
cost involved on updates and inserts for updating the indexes.
Aggregates like count(*) on a single table are not done via index
information or stored information and will require a sequence scan.  Often
people expect count to be very cheap, but it isn't always.
If you're using UNION but know that there will not be overlapping rows in
the two queries, use UNION ALL instead to prevent needing to do the
duplicate removal checking.
At least on 7.1 and below, if you have a dummy value that is very common
but doesn't really pass any information (like 'N/A' for example), consider
using NULL instead.  The optimizer statistics can often be thrown
off-kilter by values that are much more common than the real data.




Re: optimizing queries and indexes...

От
"Josh Berkus"
Дата:
Stephan,

Ah.  Well, you answer optimization questions so often that I'd assumed
that you had a hand in it.  Is the optimizer all Tom and Bruce's work?

>  At least on 7.1 and below, if you have a dummy value that is very
> common
> but doesn't really pass any information (like 'N/A' for example),
> consider
> using NULL instead.  The optimizer statistics can often be thrown
> off-kilter by values that are much more common than the real data.

Personally, I cannot reccomend this.  There are a number of
normalization problems with using NULL instead of 'N/A' or 0 or another
"no" value.  Some database theorists (Fabian Pascal & co.) even propose
the elimination of NULL from the SQL spec on the grounds that it
encourages bad DB design.  As such, I have a hard time reccommending any
course that involves adding *more* NULLs to the database, especially for
a marginal query performance gain.

To phrase it another way:  Optimization problems cost you seconds.  DB
Design and normalization problems cost you *days*.

-Josh Berkus


______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Вложения

Re: optimizing queries and indexes...

От
Stephan Szabo
Дата:
On Sun, 9 Sep 2001, Josh Berkus wrote:

> Ah.  Well, you answer optimization questions so often that I'd assumed
> that you had a hand in it.  Is the optimizer all Tom and Bruce's work?

Mostly Tom I believe.

> >  At least on 7.1 and below, if you have a dummy value that is very
> > common
> > but doesn't really pass any information (like 'N/A' for example),
> > consider
> > using NULL instead.  The optimizer statistics can often be thrown
> > off-kilter by values that are much more common than the real data.
> 
> Personally, I cannot reccomend this.  There are a number of
> normalization problems with using NULL instead of 'N/A' or 0 or another
> "no" value.  Some database theorists (Fabian Pascal & co.) even propose
> the elimination of NULL from the SQL spec on the grounds that it
> encourages bad DB design.  As such, I have a hard time reccommending any
> course that involves adding *more* NULLs to the database, especially for
> a marginal query performance gain.

NULLs are rather ugly, but the difference is often index scan vs sequence
scan and on billion row tables that starts being those messages about, I 
do this query and it never comes back. This should become much less of an
issue in 7.2 however with Tom's enhancing of the statistics generated.




Re: optimizing queries and indexes...

От
Tom Lane
Дата:
[ Sorry for slow response, I've been out of town ]

"Robert J. Sanford, Jr." <rsanford@nolimitsystems.com> writes:
> [ some questions already ably answered by Josh and Stephan ]

I just wanted to throw in one more tidbit:

> for example, one thing that one of my friends said is:
>    select X
>    from big_table
>        ,little_table
>    Generally speaking, Oracle optimizes better
>    when the smaller/reducing tables are on the
>    bottom and the larger tables are on the top.
>    I believe SQLServer likes them in the opposite
>    direction.

Postgres absolutely does not care: the optimizer will always consider
both A-join-B and B-join-A orders for every join it has to do.  As
Stephan and Josh noted, you can constrain the join pairs the optimizer
will consider if you use explicit-JOIN syntax --- but each pair will be
considered in both directions.
        regards, tom lane


Re: optimizing queries and indexes...

От
"Josh Berkus"
Дата:
Tom,

> [ Sorry for slow response, I've been out of town ]

Taking a much-deserved vacation, hey?  Any new job plans?

> Postgres absolutely does not care: the optimizer will always consider
> both A-join-B and B-join-A orders for every join it has to do.  As
> Stephan and Josh noted, you can constrain the join pairs the
> optimizer
> will consider if you use explicit-JOIN syntax --- but each pair will
> be
> considered in both directions.

Fantastic!  You may want to point out to unbelievers that MS SQL Server
does not do this; if you fail to put your joins/where clauses in the
*exact* order of the indecies in SQL Server, it ignores them and does a
table scan.  This is especially deadly because table scans are about 1/2
as fast in SQL Server as they are in Postgres.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
                                       Josh Berkus
  Complete information technology      josh@agliodbs.com
   and data management solutions       (415) 565-7293
  for law firms, small businesses        fax 621-2533
    and non-profit organizations.      San Francisco

Вложения