Re: PoC: Partial sort

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: PoC: Partial sort
Дата
Msg-id CAPpHfds7j4_=aAQixSDbWyim1588kboB8LOB8wj+Ei0RbcvozQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PoC: Partial sort  (Marti Raudsepp <marti@juffo.org>)
Ответы Re: PoC: Partial sort  (Marti Raudsepp <marti@juffo.org>)
Список pgsql-hackers
On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, sounds a little steep.  Why is it so expensive?  I'm probably
> missing something here, because I would have thought that planner
> support for partial sorts would consist mostly of considering the same
> sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
 * Avoid rebuilding clause list if we already made one;
 * saves memory in big join trees...
 */

This is not only place that worry me about planning overhead. See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of groups for each sorting column in order to get right fractional path. For partial sort path, cost of first batch should be included into initial cost. 
If don't do so, optimizer can pick up strange plans basing on assumption that it need only few rows from inner node. See an example.

create table test1 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id, 
(random()*100)::int as v1, 
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=198956893.20..198956913.33 rows=10 width=24)
   ->  Partial sort  (cost=198956893.20..19909637942.82 rows=9791031169 width=24)
         Sort Key: t1.v1, t1.id
         Presorted Key: t1.v1
         ->  Nested Loop  (cost=0.42..19883065506.84 rows=9791031169 width=24)
               Join Filter: (t1.v1 = t2.v1)
               ->  Index Scan using test1_v1_idx on test1 t1  (cost=0.42..47600.84 rows=1000000 width=12)
               ->  Materialize  (cost=0.00..25289.00 rows=1000000 width=12)
                     ->  Seq Scan on test2 t2  (cost=0.00..15406.00 rows=1000000 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 order by t1.v1, t1.id limit 10;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=3699913.43..3699913.60 rows=10 width=24)
   ->  Partial sort  (cost=3699913.43..173638549.67 rows=9791031169 width=24)
         Sort Key: t1.v1, t1.id
         Presorted Key: t1.v1
         ->  Merge Join  (cost=150444.79..147066113.70 rows=9791031169 width=24)
               Merge Cond: (t1.v1 = t2.v1)
               ->  Index Scan using test1_v1_idx on test1 t1  (cost=0.42..47600.84 rows=1000000 width=12)
               ->  Materialize  (cost=149244.84..154244.84 rows=1000000 width=12)
                     ->  Sort  (cost=149244.84..151744.84 rows=1000000 width=12)
                           Sort Key: t2.v1
                           ->  Seq Scan on test2 t2  (cost=0.00..15406.00 rows=1000000 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000 rows is odd.

------
With best regards,
Alexander Korotkov. 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: specifying repeatable read in PGOPTIONS
Следующее
От: Tom Lane
Дата:
Сообщение: Re: specifying repeatable read in PGOPTIONS