Re: parallel distinct union and aggregate support patch
От | Dilip Kumar |
---|---|
Тема | Re: parallel distinct union and aggregate support patch |
Дата | |
Msg-id | CAFiTN-vaVErT4s_XC4ZeAbN9pP2g_BeKwo2uzD_sUohdU2dg=g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: parallel distinct union and aggregate support patch (Heikki Linnakangas <hlinnaka@iki.fi>) |
Список | pgsql-hackers |
On Fri, Nov 27, 2020 at 9:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I also had a quick look at the patch and the comments made so far. Summary: > > 1. The performance results are promising. > > 2. The code needs comments. > > Regarding the design: > > Thomas Munro mentioned the idea of a "Parallel Repartition" node that > would redistribute tuples like this. As I understand it, the difference > is that this BatchSort implementation collects all tuples in a tuplesort > or a tuplestore, while a Parallel Repartition node would just > redistribute the tuples to the workers, without buffering. I think the advantage of the "Parallel BatchSort" is that it give flexibility to pick the batches dynamically by the worker after the repartition. OTOH if we distribute batches directly based on the worker number the advantage is that the operator will be quite flexible, e.g. if we want to implement the merge join we can just place the "Parallel Repartition" node above both side of the scan node and we will simply get the batch wise merge join because each worker knows their batch. Whereas if we allow workers to dynamically pick the batch the right side node needs to know which batch to pick because it is dynamically picked, I mean it is not as difficult because it is the same worker but it seems less flexible. The receiving > worker could put the tuples to a tuplestore or sort if needed. If we are using it without buffering then the sending worker can directly put the tuple into the respective sort/tuplestore node. > I think a non-buffering Reparttion node would be simpler, and thus > better. In these patches, you have a BatchSort node, and batchstore, but > a simple Parallel Repartition node could do both. For example, to > implement distinct: > > Gather > - > Unique > -> Sort > -> Parallel Redistribute > -> Parallel Seq Scan > > And a Hash Agg would look like this: > > Gather > - > Hash Agg > -> Parallel Redistribute > -> Parallel Seq Scan > > > I'm marking this as Waiting on Author in the commitfest. I agree that the simple parallel redistribute/repartition node will be flexible and could do both, but I see one problem. Basically, if we use the common operator then first the Parallel Redistribute operator will use the tuplestore for redistributing the data as per the worker and then each worker might use the disk again to sort their respective data. Instead of that while redistributing the data itself we can use the parallel sort so that each worker gets their respective batch in form of sorted tapes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: