Re: Select in subselect vs select = any array

Поиск
Список
Период
Сортировка
От	Adam Tistler
Тема	Re: Select in subselect vs select = any array
Дата	21 марта 2011 г. 03:17:13
Msg-id	035B29F0-1441-4FC7-8B24-FAC8DB745C89@gmail.com обсуждение исходный текст
Ответ на	Re: Select in subselect vs select = any array (Pavel Stehule <pavel.stehule@gmail.com>)
Ответы	Re: Select in subselect vs select = any array Re: Select in subselect vs select = any array
Список	pgsql-performance
Дерево обсуждения
Pavel, thanks for the help.

I increased work_mem from 16MB to 64MB, no difference.  The queries are really just a test case.  My actual queries are
actualjust large number of primary keys that I am selecting from the db: 

For example:
   select * from nodes where node_id in ( 1, 2, 3 ..... )

I found that even for small queries, the following is faster:
   select * from nodes where node_in = any (array[1,2,3 .... ])


Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if
perhapsthis is an issue that I should point out to postgres-dev. 


Results below:

logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 );
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1)
   ->  Nested Loop  (cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1)
         ->  HashAggregate  (cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000 loops=1)
               ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1)
                     ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428
rows=100000loops=1) 
         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006
rows=1loops=100000) 
               Index Cond: (public.nodes.node_id = public.nodes.node_id)
 Total runtime: 1017.794 ms
(8 rows)

logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit
100000)); 
                                                               QUERY PLAN
                 

-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1)
           ->  Seq Scan on nodes  (cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000
loops=1)
   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000
loops=1)
         Recheck Cond: (node_id = ANY ($0))
         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810
rows=100000loops=1) 
               Index Cond: (node_id = ANY ($0))
 Total runtime: 485.638 ms
(9 rows)

On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote:

> Hello
>
> I think so HashAggregate goes out of memory - you can try to increase
> a work_mem.
>
> There are better queries for counting duplicit then cross join
>
> Regards
>
> Pavel Stehule
>
> 2011/3/21 Adam Tistler <atistler@gmail.com>:
>> logicops2=# explain analyze select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit
100000)); 
>>                                                               QUERY PLAN
>>
-----------------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>>   InitPlan 1 (returns $0)
>>     ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>>           ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000
loops=1)
>>   ->  Bitmap Heap Scan on nodes  (cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000
loops=1)
>>         Recheck Cond: (node_id = ANY ($0))
>>         ->  Bitmap Index Scan on n_node_id_index  (cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074
rows=100000loops=1) 
>>               Index Cond: (node_id = ANY ($0))
>>  Total runtime: 509.209 ms
>> (9 rows)
>>
>> Time: 510.009 ms
>>
>>
>> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>>                                                               QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>>   ->  Nested Loop  (cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>>         ->  HashAggregate  (cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000
loops=1)
>>               ->  Limit  (cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>>                     ->  Seq Scan on nodes  (cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566
rows=100000loops=1) 
>>         ->  Index Scan using n_node_id_index on nodes  (cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007
rows=1loops=100000) 
>>               Index Cond: (public.nodes.node_id = public.nodes.node_id)
>>  Total runtime: 1053.523 ms
>> (8 rows)
>>
>> Time: 1054.864 ms
>>
>>
>>
>> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:
>>
>>> Hello
>>>
>>> 2011/3/20 Adam Tistler <atistler@gmail.com>:
>>>> I have noticed that SELECT ... = ANY(ARRAY(...))  is about twice as fast as SELECT IN ( ... ).
>>>> Can anyone explain a reason for this?  Results are the bottom and are reproducible.  I can test with other
versionsif that is necessary. 
>>>>
>>>
>>> send a result of EXPLAIN ANALYZE SELECT ..., please
>>>
>>> The reasons can be different - less seq scans, indexes
>>>
>>> Regards
>>>
>>> Pavel Stehule
>>>
>>>
>>>
>>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>>>> CentOS release 5.4 (Final)
>>>> psql (PostgreSQL) 8.4.1
>>>>
>>>> prompt2=# select count(*) from nodes;
>>>>  count
>>>> --------
>>>>  754734
>>>> (1 row)
>>>>
>>>>
>>>> prompt2=# \d nodes
>>>>                                        Table "public.nodes"
>>>>    Column    |           Type           |                         Modifiers
>>>> --------------+--------------------------+-----------------------------------------------------------
>>>>  node_id      | integer                  | not null default nextval(('node_id_seq'::text)::regclass)
>>>>  node_type_id | integer                  | not null
>>>>  template_id  | integer                  | not null
>>>>  timestamp    | timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>>>> Indexes:
>>>>    "nodes_pkey" PRIMARY KEY, btree (node_id)
>>>>    "n_node_id_index" btree (node_id)
>>>>    "n_node_type_id_index" btree (node_type_id)
>>>>    "n_template_id_index" btree (template_id)
>>>>
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 404.530 ms
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 407.316 ms
>>>> prompt2=# select count(*) from nodes where node_id = any(  Array(select node_id from nodes limit 100000) );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 408.728 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 793.840 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 779.137 ms
>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>  count
>>>> --------
>>>>  100000
>>>> (1 row)
>>>>
>>>> Time: 781.820 ms
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>
>>
>>
В списке pgsql-performance по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Select in subselect vs select = any array