Re: Group by more efficient than distinct?
От | Mark Mielke |
---|---|
Тема | Re: Group by more efficient than distinct? |
Дата | |
Msg-id | 480D283E.1060602@mark.mielke.cc обсуждение исходный текст |
Ответ на | Re: Group by more efficient than distinct? (PFC <lists@peufeu.com>) |
Ответы |
Re: Group by more efficient than distinct?
Re: Group by more efficient than distinct? |
Список | pgsql-performance |
PFC wrote: > Actually, the memory used by the hash depends on the number of > distinct values, not the number of rows which are processed... > Consider : > > SELECT a GROUP BY a > SELECT a,count(*) GROUP BY a > > In both cases the hash only holds discinct values. So if you have > 1 million rows to process but only 10 distinct values of "a", the hash > will only contain those 10 values (and the counts), so it will be very > small and fast, it will absorb a huge seq scan without problem. If > however, you have (say) 100 million distinct values for a, using a > hash would be a bad idea. As usual, divide the size of your RAM by the > number of concurrent connections or something. > Note that "a" could be a column, several columns, anything, the > size of the hash will be proportional to the number of distinct > values, ie. the number of rows returned by the query, not the number > of rows processed (read) by the query. Same with hash joins etc, > that's why when you join a very small table to a large one Postgres > likes to use seq scan + hash join on the small table. This surprises me - hash values are lossy, so it must still need to confirm against the real list of values, which at a minimum should require references to the rows to check against? Is PostgreSQL doing something beyond my imagination? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
В списке pgsql-performance по дате отправления: