Re: Large Scale Aggregation (HashAgg Enhancement)
От | Greg Stark |
---|---|
Тема | Re: Large Scale Aggregation (HashAgg Enhancement) |
Дата | |
Msg-id | 87y81fwiq5.fsf@stark.xeocode.com обсуждение исходный текст |
Ответ на | Re: Large Scale Aggregation (HashAgg Enhancement) (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Large Scale Aggregation (HashAgg Enhancement)
|
Список | pgsql-hackers |
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Why would we continue to dynamically build the hash table after the > > start of the outer scan? > > The number of tuples written to a temp file might exceed what we want to > hold in memory; we won't detect this until the batch is read back in, > and in that case we have to split the batch at that time. For hashagg > this point would apply to the aggregate states not the input tuples, but > it's still a live problem (especially if the aggregate states aren't > fixed-size values ... consider a "concat" aggregate for instance). For a hash aggregate would it be possible to rescan the original table instead of spilling to temporary files? Then when you run out of working memory you simply throw out half the hash table and ignore subsequent tuples that fall in those hash buckets. Then you rescan for the discarded hash bucket regions. This avoids having to do any disk writes at the expense possibly of additional reads. I think in terms of i/o it would be much faster in most cases. The downsides are: a) volatile aggregates or aggregates with side-effects would be confused by being executed twice. I'm not clear that volatile aggregate functions make any sense anyways though. b) I'm unclear whether rescanning the table could potentially find tuples in a different state than previous scans. If so then the idea doesn't work at all. But I don't think that's possible is it? The main problem is c) it may lose in terms of i/o for cases where the cardinality is low (ie, it's overflowing despite having low cardinality because the table is really really big too). But most cases will be spilling because the cardinality is high. So the table may be big but the spill files are nearly as big anyways and having to write and then read them means double the i/o. The upside of not having to write out temporary files is big. I find queries that require temporary sort files get hit with a *huge* performance penalty. Often an order of magnitude. Part of that could probably be mitigated by having the sort files on a separate spindle but I think it's always going to hurt especially if there are multiple operations spilling to disk simultaneously. -- greg
В списке pgsql-hackers по дате отправления: