Обсуждение: Make the qual cost on index Filter slightly higher than qual cost onindex Cond.

Поиск

Список

Период

Сортировка

Make the qual cost on index Filter slightly higher than qual cost onindex Cond.

От

Andy Fan

Дата:

26 мая 2020 г., 08:22:01

Consider the below example:

create table j1(i int, im5 int, im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
create index j1_i_im5 on j1(i, im5);
create index j1_i_im100 on j1(i, im100);
analyze j1;
explain select * from j1 where i = 100 and im5 = 5;

We may get the plan like this:

demo=# explain select * from j1 where i = 100 and im5 = 1;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using j1_i_im100 on j1 (cost=0.43..8.46 rows=1 width=16)
Index Cond: (i = 100)
Filter: (im5 = 1)
(3 rows)

At this case, optimizer can estimate there are only 1 row to return, so both
indexes have same cost, which one will be choose is un-controlable. This is
fine for above query based on the estimation is accurate. However estimation
can't be always accurate in real life. Some inaccurate estimation can cause an

wrong index choose. As an experience, j1_i_im5 index should always be choose

for above query.

This one line change is the best method I can think.

- cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+ cpu_per_tuple = cpu_tuple_cost + (qpqual_cost.per_tuple * 1.001);

We make the qual cost on index filter is slightly higher than qual cost in Index
Cond. This will also good for QUAL (i=x AND m=y AND n=z). Index are (i, m,
other_col1) and (i, other_col1, other_col2). But this change also
changed the relation between the qual cost on index scan and qual cost on seq
scan. However I think that impact is so tiny that I think we can ignore that (we

can choose a better factor between 1 and 1.001).

Even the root cause of this issue comes from an inaccurate estimation. but I
don't think that is an issue easy/possible to fix, however I'm open for
suggestion on that as well.

Any suggestions?

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Ashutosh Bapat

Дата:

26 мая 2020 г., 13:59:39

On Tue, May 26, 2020 at 1:52 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> Consider the below example:
>
> create table j1(i int, im5 int,  im100 int, im1000 int);
> insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
> create index j1_i_im5 on j1(i, im5);
> create index j1_i_im100 on j1(i, im100);
> analyze j1;
> explain select * from j1 where i = 100 and im5 = 5;
>
> We may get the plan like this:
>
> demo=# explain select  * from  j1 where i = 100 and im5 = 1;
>                               QUERY PLAN
> ----------------------------------------------------------------------
>  Index Scan using j1_i_im100 on j1  (cost=0.43..8.46 rows=1 width=16)
>    Index Cond: (i = 100)
>    Filter: (im5 = 1)
> (3 rows)
>
> At this case, optimizer can estimate there are only 1 row to return, so both
> indexes have same cost, which one will be choose is un-controlable. This is
> fine for above query based on the estimation is accurate. However estimation
> can't be always accurate in real life. Some inaccurate estimation can cause an
> wrong index choose. As an experience, j1_i_im5 index should always be choose
> for above query.

I think we need a better example where choosing an index makes a difference.

An index can be chosen just because it's path was created before some
other more appropriate index but the cost difference was within fuzzy
limit. Purely based on the order in which index paths are created.

>
> This one line change is the best method I can think.
>
> -       cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
> +      cpu_per_tuple = cpu_tuple_cost + (qpqual_cost.per_tuple * 1.001);
>
> We make the qual cost on index filter is slightly higher than qual cost in Index
> Cond. This will also good for QUAL (i=x AND m=y AND n=z). Index are (i, m,
> other_col1) and (i, other_col1, other_col2).  But this change also
> changed the relation between the qual cost on index scan and qual cost on seq
> scan. However I think that impact is so tiny that I think we can ignore that (we
> can choose a better factor between 1 and 1.001).
>
> Even the root cause of this issue comes from an inaccurate estimation. but I
> don't think that is an issue easy/possible to fix, however I'm open for
> suggestion on that as well.
>
> Any suggestions?
>
> --
> Best Regards
> Andy Fan



-- 
Best Wishes,
Ashutosh Bapat

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

26 мая 2020 г., 22:49:40

On Tue, May 26, 2020 at 9:59 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Tue, May 26, 2020 at 1:52 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> Consider the below example:
>
> create table j1(i int, im5 int, im100 int, im1000 int);
> insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 10000000)i;
> create index j1_i_im5 on j1(i, im5);
> create index j1_i_im100 on j1(i, im100);
> analyze j1;
> explain select * from j1 where i = 100 and im5 = 5;
>
> We may get the plan like this:
>
> demo=# explain select * from j1 where i = 100 and im5 = 1;
> QUERY PLAN
> ----------------------------------------------------------------------
> Index Scan using j1_i_im100 on j1 (cost=0.43..8.46 rows=1 width=16)
> Index Cond: (i = 100)
> Filter: (im5 = 1)
> (3 rows)
>
> At this case, optimizer can estimate there are only 1 row to return, so both
> indexes have same cost, which one will be choose is un-controlable. This is
> fine for above query based on the estimation is accurate. However estimation
> can't be always accurate in real life. Some inaccurate estimation can cause an
> wrong index choose. As an experience, j1_i_im5 index should always be choose
> for above query.

I think we need a better example where choosing an index makes a difference.

An index can be chosen just because it's path was created before some
other more appropriate index but the cost difference was within fuzzy
limit. Purely based on the order in which index paths are created.

Here is an further example with the above case:

demo=# insert into j1 select 1, 1, 1, 1 from generate_series(1, 100000)i;

INSERT 0 100000

With the current implementation, it is

demo=# explain analyze select * from j1 where i = 1 and im5 = 2;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using j1_i_im100 on j1 (cost=0.43..8.44 rows=1 width=16) (actual time=63.431..63.431 rows=0 loops=1)
Index Cond: (i = 1)
Filter: (im5 = 2)
Rows Removed by Filter: 100001
Planning Time: 0.183 ms
Execution Time: 63.484 ms
(6 rows)

With the patch above, it can always choose a correct index even the statistics is inaccurate:

demo=# explain analyze select * from j1 where i = 1 and im5 = 2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Index Scan using j1_i_im5 on j1 (cost=0.43..8.46 rows=1 width=16) (actual time=0.030..0.030 rows=0 loops=1)
Index Cond: ((i = 1) AND (im5 = 2))
Planning Time: 1.087 ms
Execution Time: 0.077 ms
(4 rows)

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

26 мая 2020 г., 23:12:52

You can use the attached sql to reproduce this issue, but I'm not sure you can

get the above result at the first time that is because when optimizer think the

2 index scan have the same cost, it will choose the first one it found, the order

depends on RelationGetIndexList. If so, you may try drop and create j1_i_im5 index.

The sense behind this patch is we still use the cost based optimizer, just when we

we find out the 2 index scans have the same cost, we prefer to use the index which

have more qual filter on Index Cond. This is implemented by adjust the qual cost

on index filter slightly higher.

The issue here is not so uncommon in real life. consider a log based application, which

has serval indexes on with create_date as a leading column, when the create_date

first load the for the given day but before the new statistics is gathered, that probably run

into this issue.

Best Regards

Andy Fan

Вложения

index_choose.sql

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Ashutosh Bapat

Дата:

27 мая 2020 г., 12:00:55

On Wed, 27 May 2020 at 04:43, Andy Fan <zhihui.fan1213@gmail.com> wrote:

You can use the attached sql to reproduce this issue, but I'm not sure you can
get the above result at the first time that is because when optimizer think the
2 index scan have the same cost, it will choose the first one it found, the order
depends on RelationGetIndexList. If so, you may try drop and create j1_i_im5 index.

The sense behind this patch is we still use the cost based optimizer, just when we
we find out the 2 index scans have the same cost, we prefer to use the index which
have more qual filter on Index Cond. This is implemented by adjust the qual cost
on index filter slightly higher.

Thanks for the example and the explanation.

The execution time difference in your example is pretty high to account for executing the filter on so many rows. My guess is this has to do with the heap access. For applying the filter the entire row needs to be fetched from the heap. So we should investigate this case from that angle. Another guess I have is the statistics is not correct and hence the cost is wrong.

Best Wishes,

Ashutosh

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

27 мая 2020 г., 13:58:04

On Wed, May 27, 2020 at 8:01 PM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote:

On Wed, 27 May 2020 at 04:43, Andy Fan <zhihui.fan1213@gmail.com> wrote:
You can use the attached sql to reproduce this issue, but I'm not sure you can
get the above result at the first time that is because when optimizer think the
2 index scan have the same cost, it will choose the first one it found, the order
depends on RelationGetIndexList. If so, you may try drop and create j1_i_im5 index.

The sense behind this patch is we still use the cost based optimizer, just when we
we find out the 2 index scans have the same cost, we prefer to use the index which
have more qual filter on Index Cond. This is implemented by adjust the qual cost
on index filter slightly higher.

Thanks for the example and the explanation.

The execution time difference in your example is pretty high to account for executing the filter on so many rows. My guess is this has to do with the heap access. For applying the filter the entire row needs to be fetched from the heap. So we should investigate this case from that angle. Another guess I have is the statistics is not correct and hence the cost is wrong.

I believe this is a statistics issue and then the cost is wrong. More characters of this

issue are: 1). If a data is out of range in the old statistics, optimizer will given an 1 row

assumption. 2). based on the 1 row assumption, for query "col1=out_of_range_val AND

col2 = any_value" Index (col1, col2) and (col1, col3) will have exactly same cost for current

cost model. 3). If the statistics was wrong, (col1, col3) maybe a very bad plan as shown

above, but index (col1, col2) should always better/no worse than (col1, col3) in any case.

4). To expand the rule, for query "col1 = out_of_range_val AND col2 = any_value AND col3 = any_val",

index are (col1, col2, col_m) and (col1, col_m, col_n), the former index will aways has better/no worse

than the later one. 5). an statistics issue like this is not uncommon, for example

an log based application, creation_date is very easy to out of range in statistics.

so we need to optimize the cost model for such case, the method is the patch I mentioned above.

I can't have a solid data to prove oracle did something similar, but based on the talk with my

customer, oracle is likely did something like this.

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qualcost on index Cond.

От

Tomas Vondra

Дата:

29 мая 2020 г., 00:16:02

On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:
>On Wed, May 27, 2020 at 8:01 PM Ashutosh Bapat <
>ashutosh.bapat@2ndquadrant.com> wrote:
>
>>
>>
>> On Wed, 27 May 2020 at 04:43, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>>> You can use the attached sql to reproduce this issue, but I'm not sure
>>> you can
>>> get the above result at the first time that is because when optimizer
>>> think the
>>> 2 index scan have the same cost, it will choose the first one it found,
>>> the order
>>> depends on RelationGetIndexList.  If so,  you may try drop and create
>>> j1_i_im5 index.
>>>
>>> The sense behind this patch is we still use the cost based optimizer,
>>> just when we
>>> we find out the 2 index scans have the same cost,  we prefer to use the
>>> index which
>>> have more qual filter on Index Cond.  This is implemented by adjust the
>>> qual cost
>>> on index filter slightly higher.
>>>
>>
>> Thanks for the example and the explanation.
>>
>> The execution time difference in your example is pretty high to account
>> for executing the filter on so many rows. My guess is this has to do with
>> the heap access. For applying the filter the entire row needs to be fetched
>> from the heap. So we should investigate this case from that angle. Another
>> guess I have is the statistics is not correct and hence the cost is wrong.
>>
>>
>I believe this is a statistics issue and then the cost is wrong.

I think you're both right. Most of the time probably comes from the
heap accesses, but the dabatabase has no chance to account for that
as there was no analyze after inseting the data causing that. So it's
very difficult to account for this when computing the cost.

>More characters of this issue are:  1).  If a data is out of range in
>the old statistics, optimizer will given an 1 row assumption.  2).
>based on the 1 row assumption,  for query "col1=out_of_range_val AND
>col2 = any_value"   Index (col1, col2) and (col1, col3) will have
>exactly same cost for current cost model. 3).  If the statistics was
>wrong, (col1, col3) maybe a very bad plan as shown above, but index
>(col1, col2) should  always better/no worse than (col1, col3) in any
>case.  4). To expand the rule, for query "col1 = out_of_range_val AND
>col2 = any_value AND col3 = any_val", index are (col1, col2, col_m) and
>(col1, col_m, col_n),  the former index will aways has better/no worse
>than the later one.  5). an statistics issue like this is not
>uncommon, for example an log based application, creation_date is very
>easy to out of range in statistics.
>

Right. There are many ways to cause issues like this.

>so we need to optimize the cost model for such case, the method is the
>patch I mentioned above.

Making the planner more robust w.r.t. to estimation errors is nice, but
I wouldn't go as far saying we should optimize for such cases. The stats
can be arbitrarily off, so should we expect the error to be 10%, 100% or
1000000%? We'd probably end up with plans that handle worst cases well,
but the average performance would end up being way worse :-(

Anyway, I kinda doubt making the conditions 1.001 more expensive is a
way to make the planning more robust. I'm pretty sure we could construct
examples in the opposite direction, in which case this change make it
more likely we use the wrong index.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

От

Tom Lane

Дата:

29 мая 2020 г., 01:04:06

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:
>> so we need to optimize the cost model for such case, the method is the
>> patch I mentioned above.

> Making the planner more robust w.r.t. to estimation errors is nice, but
> I wouldn't go as far saying we should optimize for such cases.

Yeah, it's a serious mistake to try to "optimize" for cases where we have
no data or wrong data.  By definition, we don't know what we're doing,
so who's to say whether we've made it better or worse?  And the possible
side effects on cases where we do have good data are not to be ignored.

> Anyway, I kinda doubt making the conditions 1.001 more expensive is a
> way to make the planning more robust. I'm pretty sure we could construct
> examples in the opposite direction, in which case this change make it
> more likely we use the wrong index.

The other serious error we could be making here is to change things on
the basis of just a few examples.  You really need a pretty wide range
of test cases to be sure that you're not making things worse, any time
you're twiddling basic parameters like these.

            regards, tom lane

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

29 мая 2020 г., 01:10:40

>so we need to optimize the cost model for such case, the method is the
>patch I mentioned above.

Making the planner more robust w.r.t. to estimation errors is nice, but
I wouldn't go as far saying we should optimize for such cases. The stats
can be arbitrarily off, so should we expect the error to be 10%, 100% or
1000000%?

I don't think my patch relay on anything like that. My patch doesn't fix the

statistics issue, just adding the extra cost on qual cost on Index Filter part.

Assume the query pattern are where col1= X and col2 = Y. The impacts are :

1). Make the cost of (col1, other_column) is higher than (col1, col2)

2). The relationship between seqscan and index scan on index (col1, other_column)

is changed, (this is something I don't want). However my cost difference between

index scan & seq scan usually very huge, so the change above should has

nearly no impact on that choice. 3). Make the cost higher index scan for

Index (col1) only. Overall I think nothing will make thing worse.

We'd probably end up with plans that handle worst cases well,
but the average performance would end up being way worse :-(

That's possible, that's why I hope to get some feedback on that. Actually I

can't think out such case. can you have anything like that in mind?

----

I'm feeling that (qpqual_cost.per_tuple * 1.001) is not good enough since user

may have some where expensive_func(col1) = X. we may change it

cpu_tuple_cost + qpqual_cost.per_tuple + (0.0001) * list_lenght(qpquals).

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

29 мая 2020 г., 01:20:37

Thanks all of you for your feedback.

On Fri, May 29, 2020 at 9:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Wed, May 27, 2020 at 09:58:04PM +0800, Andy Fan wrote:
>> so we need to optimize the cost model for such case, the method is the
>> patch I mentioned above.

> Making the planner more robust w.r.t. to estimation errors is nice, but
> I wouldn't go as far saying we should optimize for such cases.

Yeah, it's a serious mistake to try to "optimize" for cases where we have
no data or wrong data. By definition, we don't know what we're doing,
so who's to say whether we've made it better or worse?

Actually I think it is a more robust way.. the patch can't fix think all the impact

of bad statistics(That is impossible I think), but it will make some simple things

better and make others no worse. By definition I think I know what we are doing

here, like what I replied to Tomas above. But it is possible my think is wrong.

The other serious error we could be making here is to change things on
the basis of just a few examples. You really need a pretty wide range
of test cases to be sure that you're not making things worse, any time
you're twiddling basic parameters like these.

I will try more thing with this direction, thanks for suggestion.

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Ashutosh Bapat

Дата:

29 мая 2020 г., 13:37:03

On Fri, May 29, 2020 at 6:40 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>>
>> >so we need to optimize the cost model for such case, the method is the
>> >patch I mentioned above.
>>
>> Making the planner more robust w.r.t. to estimation errors is nice, but
>> I wouldn't go as far saying we should optimize for such cases. The stats
>> can be arbitrarily off, so should we expect the error to be 10%, 100% or
>> 1000000%?
>
>
> I don't think my patch relay on anything like that.   My patch doesn't fix the
> statistics issue,  just adding the extra cost on qual cost on Index Filter part.
> Assume the query pattern are where col1= X and col2 = Y. The impacts are :
> 1).  Make the cost of (col1, other_column) is higher than (col1, col2)
> 2). The relationship between seqscan and index scan on index (col1, other_column)
> is changed, (this is something I don't want).  However my cost difference between
> index scan & seq scan usually very huge, so the change above should has
> nearly no impact on that choice.   3). Make the cost higher index scan for
> Index (col1) only.  Overall I think nothing will make thing worse.

When the statistics is almost correct (or better than what you have in
your example), the index which does not cover all the columns in all
the conditions will be expensive anyways because of extra cost to
access heap for the extra rows not filtered by that index. An index
covering all the conditions would have its scan cost cheaper since
there will be fewer rows and hence fewer heap page accesses because of
more filtering. So I don't think we need any change in the current
costing model.

>
>>
>> We'd probably end up with plans that handle worst cases well,
>> but the average performance would end up being way worse :-(
>>
>
> That's possible,  that's why I hope to get some feedback on that.  Actually I
> can't think out such case.   can you have anything like that in mind?
>
> ----
> I'm feeling that (qpqual_cost.per_tuple * 1.001) is not good enough since user
> may have some where expensive_func(col1) = X.   we may change it
> cpu_tuple_cost + qpqual_cost.per_tuple  + (0.0001) * list_lenght(qpquals).
>
> --
> Best Regards
> Andy Fan



-- 
Best Wishes,
Ashutosh Bapat

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

29 мая 2020 г., 13:58:49

On Fri, May 29, 2020 at 9:37 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Fri, May 29, 2020 at 6:40 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>>
>> >so we need to optimize the cost model for such case, the method is the
>> >patch I mentioned above.
>>
>> Making the planner more robust w.r.t. to estimation errors is nice, but
>> I wouldn't go as far saying we should optimize for such cases. The stats
>> can be arbitrarily off, so should we expect the error to be 10%, 100% or
>> 1000000%?
>
>
> I don't think my patch relay on anything like that. My patch doesn't fix the
> statistics issue, just adding the extra cost on qual cost on Index Filter part.
> Assume the query pattern are where col1= X and col2 = Y. The impacts are :
> 1). Make the cost of (col1, other_column) is higher than (col1, col2)
> 2). The relationship between seqscan and index scan on index (col1, other_column)
> is changed, (this is something I don't want). However my cost difference between
> index scan & seq scan usually very huge, so the change above should has
> nearly no impact on that choice. 3). Make the cost higher index scan for
> Index (col1) only. Overall I think nothing will make thing worse.

When the statistics is almost correct (or better than what you have in
your example), the index which does not cover all the columns in all
the conditions will be expensive anyways because of extra cost to
access heap for the extra rows not filtered by that index. An index
covering all the conditions would have its scan cost cheaper since
there will be fewer rows and hence fewer heap page accesses because of
more filtering. So I don't think we need any change in the current

costing model.

Thank you for your reply. Looks you comments is based on the statistics

is almost correct (or better than what I have in my example), That is true.

However my goal is to figure out a way which can generate better plan even

the statistics is not correct (the statistics with such issue is not very uncommon,

I just run into one such case and spend 1 week to handle some non-technology

stuff after that). I think the current issue is even my patch can make the worst case

better, we need to make sure the average performance not worse.

Best Regards

Andy Fan

Re: Make the qual cost on index Filter slightly higher than qual coston index Cond.

От

Andy Fan

Дата:

29 мая 2020 г., 14:11:52

The other serious error we could be making here is to change things on
the basis of just a few examples. You really need a pretty wide range
of test cases to be sure that you're not making things worse, any time
you're twiddling basic parameters like these.

I will try more thing with this direction, thanks for suggestion.

I choose TPC-H for this purpose and the data and index setup based on [1],

the attached normal.log is the plan without this patch, and patched.log is the

plan with the patch. In general, the best path doesn't change due to this patch,

All the plans whose cost changed has the following patten, which is expected.

Index Scan ...

Index Cond: ...

Filter: ...

If you diff the two file, you may find the cost of "Index Scan" doesn't change,

that is mainly because it only show 2 digits in cost, which is not accurate enough

to show the difference. However with a nest loop, the overall plan shows the cost

difference.

[1] https://ankane.org/tpc-h

Best Regards

Andy Fan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Make the qual cost on index Filter slightly higher than qual cost onindex Cond.

Вложения

Вложения