Improving NOT IN
От | Simon Riggs |
---|---|
Тема | Improving NOT IN |
Дата | |
Msg-id | 1170194592.3681.271.camel@silverbirch.site обсуждение исходный текст |
Ответы |
Re: Improving NOT IN
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
It's a fairly common case to want to improve a query along the lines of TableA intersect ~TableB. We can write this as select * from tableA where key not in (select * from tableB) or we can get more fancy select tableA.*from tableA left outer join tableB on tableA.key = tableB.keywhere tableB is null; I've worked out a new join method that will improve performance over and above the second case. This is effective where the referenced table (tableB) is fairly large and the join columns are discrete. Currently that mostly means they're integers. The plan seeks to *prove* that there are no matches, rather than taking the exhaustive join approach taken currently. First we need to show that the referenced table's PK values are a fully continuous sequence of integers with no gaps. One this has been proved, we can then use that fact to scan the FK table using the values of the min and max PK to see if any outliers exist. There is no actual comparison of the values, just a proof that none is required. I'll describe this using SQL statements, which execute as SeqScans of the PK and then FK tables. There is no preparatory step - no building a sort table or preparing a hash table, so these SQL statements always execute faster than the fastest current plan. Most importantly there is no step that consumes large amounts of memory, so the case where two tables are very large performs much, much better. 1. Scan referenced table a) select max(aid), min(aid) from accounts; b) select count(*) from accounts; Sometimes this is faster using two queries when the table has a PK. 2. Decision Step if max - min - count == 0 then we have a contiguous range and because we know we have a discrete datatype we can now *prove* that there are no missing values from the set bounded by the min and the max. We can then use that directly in a new query: 3. a) Scan referencing table select aid from history where aid > ? or aid < ?;using parameters of max and min from step 1 b) normal query Step 1 & 2 can fail to find a contiguous range, in which case we need to fall back to an existing query plan. So there is only small overhead in the case where we run the first query but fail to use the optimisation at all and need to fall back to existing query. We can estimate whether this is the case by estimating the row count of the table and see if that compares favourably with the expected number of values if the whole range min-max of values is actually present. The min/max query uses the Primary Key index (which must always be present) so takes very little time. So overall this looks like a win, in certain common cases, but not a particular loss in any case. Try this SQL 1. select max(aid), min(aid) from accounts; 2. select count(*) from accounts; 3. select aid from history where aid > (select max(aid) from accounts) or aid < (select min(aid) from accounts) limit 1; against alter table history add foreign key (aid) references accounts; I get (1) about 0.2secs (2) 6secs (3) 9secs against Alter Table 27secs Using work_mem = 64MB and data that fits in memory We could implement the new SQL directly within ALTER TABLE, or we could actually create this as a new plan type that would then allow the existing SQL to perform better in specific cases. I've not seen such a plan on any other RDBMS and think it might be completely new, which I'm calling a Proof Join, for want of a better description. The preparatory steps are completely excluded, hence the x2 speedup. For larger referenced tables the performance improvement could be much more. ISTM that even though this is a special case it is actually a common one, so would be worth optimising for. Ideas stage at the moment: thoughts? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: