Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
От | David Rowley |
---|---|
Тема | Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant |
Дата | |
Msg-id | CAApHDvqS0j8RUWRUSgCAXxOqnYjHUXmKwspRj4GzVfOO25ByHA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant |
Список | pgsql-hackers |
Over on [1], Klint highlights a query with a DISTINCT which is a little sub-optimal in PostgreSQL. ISTM that in cases where all DISTINCT pathkeys have been marked as redundant due to constants existing in all of the EquivalenceClasses of the DISTINCT columns, then it looks like it should be okay not to bother using a Unique node to remove duplicate results. When all the distinct pathkeys are redundant then there can only be, at most, 1 single distinct value. There may be many rows with that value, but we can remove those extra ones with a LIMIT 1 rather than troubling over needlessly uniquifing them. This might not be a hugely common case, but; 1) it is very cheap to detect and 2) the speedups are likely to be *very* good. With the attached we get: regression=# explain (analyze, costs off, timing off) SELECT DISTINCT four,1,2,3 FROM tenk1 WHERE four = 0; QUERY PLAN ------------------------------------------------- Limit (actual rows=1 loops=1) -> Seq Scan on tenk1 (actual rows=1 loops=1) Filter: (four = 0) Planning Time: 0.215 ms Execution Time: 0.071 ms naturally, if we removed the WHERE four = 0, we can't optimise this plan using this method. I see no reason why this also can't work for DISTINCT ON too. regression=# explain (analyze, costs off, timing off) SELECT DISTINCT ON (four,two) four,two FROM tenk1 WHERE four = 0 order by 1,2; QUERY PLAN ---------------------------------------------------------- Unique (actual rows=1 loops=1) -> Sort (actual rows=2500 loops=1) Sort Key: two Sort Method: quicksort Memory: 175kB -> Seq Scan on tenk1 (actual rows=2500 loops=1) Filter: (four = 0) Rows Removed by Filter: 7500 Planning Time: 0.123 ms Execution Time: 4.251 ms (9 rows) then, of course, if we introduce some column that the pathkey is not redundant for then we must do the distinct operation as normal. regression=# explain (analyze, costs off, timing off) SELECT DISTINCT four,two FROM tenk1 WHERE four = 0 order by 1,2; QUERY PLAN ---------------------------------------------------------- Sort (actual rows=1 loops=1) Sort Key: two Sort Method: quicksort Memory: 25kB -> HashAggregate (actual rows=1 loops=1) Group Key: four, two Batches: 1 Memory Usage: 24kB -> Seq Scan on tenk1 (actual rows=2500 loops=1) Filter: (four = 0) Rows Removed by Filter: 7500 Planning Time: 0.137 ms Execution Time: 4.274 ms (11 rows) Does this seem like something we'd want to do? Patch attached. David [1] https://postgr.es/m/MEYPR01MB7101CD5DA0A07C9DE2B74850A4239@MEYPR01MB7101.ausprd01.prod.outlook.com
Вложения
В списке pgsql-hackers по дате отправления: