Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
От | Lawrence, Ramon |
---|---|
Тема | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |
Дата | |
Msg-id | 6EEA43D22289484890D119821101B1DF2C1885@exchange20.mercury.ad.ubc.ca обсуждение исходный текст |
Ответ на | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets ("Bryce Cutt" <pandasuit@gmail.com>) |
Ответы |
Re: Proposed Patch to Improve Performance of
Multi-BatchHash Join for Skewed Data Sets
|
Список | pgsql-hackers |
> Here is a cleaned-up version. I fixed a number of whitespace issues, > improved a few comments, and rearranged one set of nested if-else > statements (hopefully without breaking anything in the process). > > Josh / eggyknap - > > Can you rerun your performance tests with this version of the patch? To help with testing, we have constructed a patch specifically for testing. The patch is the same as Robert's version except that it tracks and prints out statistics during the join on how many tuples are affected and has the enable_hashjoin_usestatmcvs variable defined so that it is easy to turn on/off skew handling. This is useful as although the patch reduces the number of I/Os performed, this improvement may not be seen in some queries which are dominated by other cost factors (non-skew joins, CPU time, time to scan input relations, etc.). The sample output looks like this: LI-P Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est. Save: 16114709.99 Total Inner Tuples: 2000000 IM Inner Tuples: 83 Batch Zero Inner Tuples: 3941 Batch Zero Potential Inner Tuples: 3941 Total Outer Tuples: 59986052 IM Outer Tuples: 16074146 Batch Zero Outer Tuples: 98778 Batch Zero Potential Outer Tuples: 98778 Total Output Tuples: 59986052 IM Output Tuples: 16074146 Batch Zero Output Tuples: 98778 Batch Zero Potential Output Tuples: 98778 Percentage less tuple IOs than HHJ: 25.98 The other change is that the system calculates the skew and will not use the in-memory skew partition if the skew is less than 1%. Finally, we have attached some performance results for the TPCH 10G data set (skew factors z=1 and z=2). For the Customer-Orders-Lineitem-Part query that Josh was testing, we see no overall time difference that is significant compared to experimental error (although there is I/O benefit for the Lineitem-Part join). This query cost is dominated by the non-skew joins of Customer-Orders and Orders-Lineitem and output tuple construction. The joins with skew, Lineitem-Supplier and Lineitem-Part, show significantly improved performance. Note how the statistics show that the percentage I/O savings is directly proportional to the skew. However, the overall query time savings is always less than this as there are other costs such as reading the relations, performing the hash comparisons, building the output tuples, etc. that are unaffected by the optimization. At this point, we await further feedback on what is necessary to get this patch accepted. We would also like to thank Josh and Robert again for their review time. Sincerely, Ramon Lawrence and Bryce Cutt
Вложения
В списке pgsql-hackers по дате отправления: