Re: Group by more efficient than distinct?
От | Mark Mielke |
---|---|
Тема | Re: Group by more efficient than distinct? |
Дата | |
Msg-id | 480D41C3.3070608@mark.mielke.cc обсуждение исходный текст |
Ответ на | Re: Group by more efficient than distinct? (Mark Mielke <mark@mark.mielke.cc>) |
Список | pgsql-performance |
Mark Mielke wrote: > PFC wrote: >> Actually, the memory used by the hash depends on the number of >> distinct values, not the number of rows which are processed... >> Consider : >> >> SELECT a GROUP BY a >> SELECT a,count(*) GROUP BY a >> >> In both cases the hash only holds discinct values. So if you have >> 1 million rows to process but only 10 distinct values of "a", the >> hash will only contain those 10 values (and the counts), so it will >> be very small and fast, it will absorb a huge seq scan without >> problem. If however, you have (say) 100 million distinct values for >> a, using a hash would be a bad idea. As usual, divide the size of >> your RAM by the number of concurrent connections or something. >> Note that "a" could be a column, several columns, anything, the >> size of the hash will be proportional to the number of distinct >> values, ie. the number of rows returned by the query, not the number >> of rows processed (read) by the query. Same with hash joins etc, >> that's why when you join a very small table to a large one Postgres >> likes to use seq scan + hash join on the small table. > > This surprises me - hash values are lossy, so it must still need to > confirm against the real list of values, which at a minimum should > require references to the rows to check against? > > Is PostgreSQL doing something beyond my imagination? :-) Hmmm... You did say distinct values, so I can see how that would work for distinct. What about seq scan + hash join, though? To complete the join, wouldn't it need to have a reference to each of the rows to join against? If there is 20 distinct values and 200 rows in the small table - wouldn't it need 200 references to be stored? Cheers, mark -- Mark Mielke <mark@mielke.cc>
В списке pgsql-performance по дате отправления: