Обсуждение: query planner woes
hi all,
I've constructed a simple query which takes too long
to finish. EXPLAIN ANALYZE ( all data shown below )
reports that at one stage there are >40K rows ( on a toy database,
>2M on the real one), and while I undertstand that this is the
most probable cause of the slowness, I cannot think of any change to
the query to aviod this. Any help?
Thanks,
Dmitry
mts=# \d addresses
Table "public.addresses"
Column | Type |
Modifiers
---------+--------------------------+-----------------------------------------------------------
id | integer | not null default
nextval('public.addresses_id_seq'::text)
address | text | not null
Indexes:
"addresses_pkey" primary key, btree (id)
"addresses_idx_address" btree (address)
mts=# \d recipients
Table "public.recipients"
Column | Type |
Modifiers
---------+--------------------------+------------------------------------------------------------
id | integer | not null default
nextval('public.recipients_id_seq'::text)
msgto | integer | not null
Indexes:
"recipients_pkey" primary key, btree (id)
"recipients_idx_msgto" btree (msgto)
mts=# explain analyze select *
mts-# from recipients,addresses
mts-# where addresses.address ~ '@cat' and
mts-# recipients.msgto = addresses.id;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=535.70..1603.96 rows=352 width=63)
(actual time=88.387..422.922 rows=576 loops=1)
Hash Cond: ("outer".msgto = "inner".id)
-> Seq Scan on recipients (cost=0.00..737.42 rows=43642 width=24)
(actual time=0.012..178.258 rows=43642
loops=1)
^^^^^
-> Hash (cost=535.34..535.34 rows=147 width=39)
(actual time=87.866..87.866 rows=0 loops=1)
-> Seq Scan on addresses (cost=0.00..535.34 rows=147 width=39)
(actual time=0.958..87.806 rows=3 loops=1)
Filter: (address ~ '@cat'::text)
Total runtime: 425.061 ms
(7 rows)
mts# select count(*) from recipients,addresses where
addresses.address ~ '@cat' and recipients.msgto = addresses.id;
count
-----
576
Dmitry Karasik <dmitry@karasik.eu.org> writes:
> I've constructed a simple query which takes too long
> to finish.
> mts=# explain analyze select *
> mts-# from recipients,addresses
> mts-# where addresses.address ~ '@cat' and
> mts-# recipients.msgto = addresses.id;
> QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=535.70..1603.96 rows=352 width=63)
> (actual time=88.387..422.922 rows=576 loops=1)
> Hash Cond: ("outer".msgto = "inner".id)
> -> Seq Scan on recipients (cost=0.00..737.42 rows=43642 width=24)
> (actual time=0.012..178.258 rows=43642
> loops=1)
> ^^^^^
> -> Hash (cost=535.34..535.34 rows=147 width=39)
> (actual time=87.866..87.866 rows=0 loops=1)
> -> Seq Scan on addresses (cost=0.00..535.34 rows=147 width=39)
> (actual time=0.958..87.806 rows=3 loops=1)
> Filter: (address ~ '@cat'::text)
> Total runtime: 425.061 ms
> (7 rows)
Given that there are actually only 3 rows in addresses matching '@cat',
it would have been best to use a plan like
Nestloop
Seq Scan on addresses
Index Scan on recipients
Index Cond: msgto = outer.id
ie, use the index to visit each of those three recipients. However,
for the 147 rows that the planner was expecting, it's not clear that
the hash plan choice is wrong; and with an even-less-selective address
filter, the hash plan will considerably beat the nestloop. So ISTM that
the real problem is the inaccuracy of the estimate about how many
rows will be pulled from addresses.
Unfortunately there's not a lot you can do to really solve that problem
for arbitrary pattern-match queries :-(. The planner has no stats that
would let it derive a non-guess estimate.
Depending on what your real workload will be like, you might be able to
formulate things so that common cases go faster. For instance, I gather
that what you're doing here is looking at the site portion of email
addresses? If you do that a lot, it might be worth splitting the site
portion out and storing it as a separate column. Then the search
becomes an anchored-left match:
... where addresses.site ~ '^cat'
which the planner has significantly better chance of handling well.
regards, tom lane