Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
От | Joshua Tolley |
---|---|
Тема | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets |
Дата | |
Msg-id | e7e0a2570811011541x28612963w1f17dcb6d2fe846a@mail.gmail.com обсуждение исходный текст |
Ответ на | Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>) |
Ответы |
Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
|
Список | pgsql-hackers |
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon.lawrence@ubc.ca> wrote: > We propose a patch that improves hybrid hash join's performance for large > multi-batch joins where the probe relation has skew. > > Project name: Histojoin > Patch file: histojoin_v1.patch > > This patch implements the Histojoin join algorithm as an optional feature > added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or > disable the Histojoin features. When Histojoin is disabled, HHJ acts as > normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to > do skew aware partitioning. The basic idea is to keep build relation tuples > in a small in-memory hash table that have join values that are frequently > occurring in the probe relation. This improves performance of HHJ when > multiple batches are used by 10% to 50% for skewed data sets. The > performance improvements of this patch can be seen in the paper (pages > 25-30) at: > > http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf > > All generators and materials needed to verify these results can be provided. > > This is a patch against the HEAD of the repository. > > This patch does not contain platform specific code. It compiles and has > been tested on our machines in both Windows (MSVC++) and Linux (GCC). > > Currently the Histojoin feature is enabled by default and is used whenever > HHJ is used and there are Most Common Value (MCV) statistics available on > the probe side base relation of the join. To disable this feature simply > set the enable_hashjoin_usestatmcvs flag to off in the database > configuration file or at run time with the 'set' command. > > One potential improvement not included in the patch is that Most Common > Value (MCV) statistics are only determined when the probe relation is > produced by a scan operator. There is a benefit to using MCVs even when the > probe relation is not a base scan, but we were unable to determine how to > find statistics from a base relation after other operators are performed. > > This patch was created by Bryce Cutt as part of his work on his M.Sc. > thesis. > > -- > Dr. Ramon Lawrence > Assistant Professor, Department of Computer Science, University of British > Columbia Okanagan > E-mail: ramon.lawrence@ubc.ca I'm interested in trying to review this patch. Having not done patch review before, I can't exactly promise grand results, but if you could provide me with the data to check your results? In the meantime I'll go read the paper. - Josh / eggyknap
В списке pgsql-hackers по дате отправления: