join to view over custom aggregate seems like it should be faster

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема join to view over custom aggregate seems like it should be faster
Дата
Msg-id b42b73150704091205o668a084y1693520c974f3c6b@mail.gmail.com
обсуждение исходный текст
Ответы Re: join to view over custom aggregate seems like it should be faster  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
I have an odd performance issue on 8.2 that I'd thought I'd document
here.  I have a workaround, but I'm if there is something that I'm not
seeing.

ok, for starters:
I have a large table that is basically organized like this:
create table big
(
  key1 int,
  key2 int,
  ts timestamp
  [other fields]
);

and a view most_recent_big which lists for each combination of key1
and key2, the '[other fields]' that are behind the highest (most
recent) timestamp.  The original view implementation involved a self
join which is the classic sql approach to pulling values from a
denormalized table (the real solution of course is to normalize the
data but I can't do that for various reasons).  This wasn't very fast,
so I wrote a custom aggregate to optimize the view (there are usuallly
very small #s of records for key1, key2 pair:

create view latest_big_view as
  select key1, key2, max_other_fields[other fields]
  from big
  group by key1, key2;

This worked very well, but sometimes the index on key1, key2 does not
get utilized when joining against latest_big_view.  Let's say I have a
number of key1, key2 pairs in another table:

for example:
select * from foo, latest_big_view using (key1, key2);
breaks down.

here is a example of the 'breakdown' plan on real tables. selecting a
single record from the view is very fast...1ms or less.  The join
can't 'see through' the view to filter the index.

dev20400=# explain analyze select * from foo join latest_download
using (host_id, software_binary_id);

     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=15.35..4616.65 rows=25 width=90) (actual
time=229.623..10601.317 rows=494 loops=1)
   Hash Cond: ((latest_download.host_id = foo.host_id) AND
(latest_download.software_binary_id = foo.software_binary_id))
   ->  GroupAggregate  (cost=0.00..4499.01 rows=4535 width=94) (actual
time=0.346..10370.383 rows=37247 loops=1)
         ->  Index Scan using software_download_idx on
software_download  (cost=0.00..2526.53 rows=45342 width=94) (actual
time=0.028..344.591
         SubPlan
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.010 rows=1 loops=37247)
   ->  Hash  (cost=7.94..7.94 rows=494 width=8) (actual
time=5.568..5.568 rows=494 loops=1)
         ->  Seq Scan on foo  (cost=0.00..7.94 rows=494 width=8)
(actual time=0.018..2.686 rows=494 loops=1)
 Total runtime: 10604.260 ms
(18 rows)


Here is the same query but on the root table, instead of the view:
dev20400=# explain analyze select * from foo join software_download
using (host_id, software_binary_id);
                                                                    QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1521.60 rows=19 width=94) (actual
time=0.084..24.992 rows=607 loops=1)
   ->  Seq Scan on foo  (cost=0.00..7.94 rows=494 width=8) (actual
time=0.044..2.753 rows=494 loops=1)
   ->  Index Scan using software_download_idx on software_download
(cost=0.00..3.05 rows=1 width=94) (actual time=0.011..0.019 rows=1
loops=49
         Index Cond: ((foo.host_id = software_download.host_id) AND
(foo.software_binary_id = software_download.software_binary_id))
 Total runtime: 28.385 ms
(5 rows)

I can use a trick with a function to make the view give out reasonalbe results:

create function foo(int, int) returns latest_download as
$$ select * from latest_download where software_binary_id = $1 and
host_id = $2; $$ language sql;

dev20400=# explain analyze select (v).* from (select
foo(software_binary_id, host_id) as v from foo) q;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Subquery Scan q  (cost=0.00..14.12 rows=494 width=32) (actual
time=1.436..139.644 rows=494 loops=1)
   ->  Seq Scan on foo  (cost=0.00..9.18 rows=494 width=8) (actual
time=1.414..131.144 rows=494 loops=1)
 Total runtime: 142.887 ms
(3 rows)

Time: 144.306 ms

merlin

В списке pgsql-performance по дате отправления:

Предыдущее
От: "Alex Deucher"
Дата:
Сообщение: Re: postgres 8.2 seems to prefer Seq Scan
Следующее
От: "Alex Deucher"
Дата:
Сообщение: Re: postgres 8.2 seems to prefer Seq Scan