Обсуждение: Partition-wise join for join between (declaratively) partitioned tables
Amit Langote is working on supporting declarative partitioning in PostgreSQL [1]. I have started working on supporting partition-wise join. This mail describes very high level design and some insight into the performance improvements.
An equi-join between two partitioned tables can be broken down into pair-wise join between their partitions. This technique is called partition-wise join. Partition-wise joins between similarly partitioned tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes, their storage etc.
3. An N-way partition-wise join may have different efficient join orders compared to the efficient join order between the parent tables.
An equi-join between two partitioned tables can be broken down into pair-wise join between their partitions. This technique is called partition-wise join. Partition-wise joins between similarly partitioned tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes, their storage etc.
3. An N-way partition-wise join may have different efficient join orders compared to the efficient join order between the parent tables.
A partition-wise join is processed in following stages [2], [3].
1. Applicability testing: This phase checks if the join conditions match the partitioning scheme. A partition-wise join is efficient if there is an equi-join on the partition keys. E.g. join between tables R and S partitioned by columns a and b resp. can be broken down into partition-wise joins if there exists a join condition is R.a = S.b. Or in other words the number of provably non-empty partition-wise joins is O(N) where N is the number of partitions.
2. Matching: This phase determines which joins between the partitions of R and S can potentially produce tuples in the join and prunes empty joins between partitions.
3. Clustering: This phase aims at reducing the number of partition-wise joins by clubbing together partitions from joining relations. E.g. clubbing multiple partitions from either of the partitioned relations which can join to a single partition from the other partitioned relation.
4. Path/plan creation: This phase creates multiple paths for each partition-wise join. It also creates Append path/s representing the union of partition-wise joins.
The work here focuses on a subset of use-cases discussed in [2]. It only considers partition-wise join for join between similarly partitioned tables with same number of partitions with same properties, thus producing at most as many partition-wise joins as there are partitions. It should be possible to apply partition-wise join technique (with some special handling for OUTER joins) if both relations have some extra partitions with non-overlapping partition conditions, apart from the matching partitions. But I am not planning to implement this optimization in the first cut.
The attached patch is a POC implementation of partition-wise join. It is is based on the set of patches posted on 23rd May 2016 by Amit Langote for declarative partitioning. The patch gives an idea about the approach used. It has several TODOs, which I am working on.
Attached is a script with output which measures potential performance improvement because of partition-wise join. The script uses a GUC enable_partition_wise_join to disable/enable this feature for performance measurement. The scripts measures performance improvement of a join between two tables partitioned by range on integer column. Each table contains 50K rows. Each table has an integer and a varchar column. It shows around 10-15% reduction in execution time when partition-wise join is used. Accompanied with parallel query and FDWs, it opens up avenues for further improvements for joins between partitioned tables.
[1]. https://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp
[2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
[3]. https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения
On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Amit Langote is working on supporting declarative partitioning in PostgreSQL > [1]. I have started working on supporting partition-wise join. This mail > describes very high level design and some insight into the performance > improvements. > > An equi-join between two partitioned tables can be broken down into > pair-wise join between their partitions. This technique is called > partition-wise join. Partition-wise joins between similarly partitioned > tables with equi-join condition can be efficient because [2] > 1. Each provably non-empty partition-wise join smaller. All such joins > collectively might be more efficient than the join between their parent. > 2. Such joins are able to exploit properties of partitions like indexes, > their storage etc. > 3. An N-way partition-wise join may have different efficient join orders > compared to the efficient join order between the parent tables. > > A partition-wise join is processed in following stages [2], [3]. > 1. Applicability testing: This phase checks if the join conditions match the > partitioning scheme. A partition-wise join is efficient if there is an > equi-join on the partition keys. E.g. join between tables R and S > partitioned by columns a and b resp. can be broken down into partition-wise > joins if there exists a join condition is R.a = S.b. Or in other words the > number of provably non-empty partition-wise joins is O(N) where N is the > number of partitions. > > 2. Matching: This phase determines which joins between the partitions of R > and S can potentially produce tuples in the join and prunes empty joins > between partitions. > > 3. Clustering: This phase aims at reducing the number of partition-wise > joins by clubbing together partitions from joining relations. E.g. clubbing > multiple partitions from either of the partitioned relations which can join > to a single partition from the other partitioned relation. > > 4. Path/plan creation: This phase creates multiple paths for each > partition-wise join. It also creates Append path/s representing the union of > partition-wise joins. > > The work here focuses on a subset of use-cases discussed in [2]. It only > considers partition-wise join for join between similarly partitioned tables > with same number of partitions with same properties, thus producing at most > as many partition-wise joins as there are partitions. It should be possible > to apply partition-wise join technique (with some special handling for OUTER > joins) if both relations have some extra partitions with non-overlapping > partition conditions, apart from the matching partitions. But I am not > planning to implement this optimization in the first cut. I haven't reviewed this code yet due to being busy with 9.6, but I think this is a very important query planner improvement with the potential for big wins on queries involving large amounts of data. Suppose we have a pair of equi-partitioned tables. Right now, if we choose to perform a hash join, we'll have to build a giant hash table with all of the rows from every inner partition and then probe it for every row in every outer partition. If there are few enough inner rows that the resultant hash table still fits in work_mem, this is somewhat inefficient but not terrible - but if it causes us to have to batch the hash join where we otherwise would not need to do so, then it really sucks. Similarly, if we decide to merge-join each pair of partitions, a partitionwise join may be able to use an internal sort on some or all partitions whereas if we had to deal with all of the data at the same time we'd need an external sort, possibly multi-pass. And if we choose a nested loop, say over an innerindex-scan, we do O(outer rows) index probes with this optimization but O(outer rows * inner partitions) index probes without it. In addition, parallel query can benefit significantly from this kind of optimization. Tom recently raised the case of an appendrel where every child has a parallel-safe path but not every child has a partial path; currently, we can't go parallel in that case, but it's easy to see that we could handle it by scheduling the appendrel's children across a pool of workers. If we had this optimization, that sort of thing would be much more likely to be useful, because it could create appendrels where each member is an N-way join between equipartitioned tables. That's particularly important right now because of the restriction that a partial path must be driven by a Parallel SeqScan, but even after that restriction is lifted it's easy to imagine that the effective degree of parallelism for a single index scan may be limited - so this kind of thing may significantly increase the number of workers that a given query can use productively. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
+1.
The attached patch implements the logic to assess whether two partitionedI haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.
Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.
Or we might be able to use indexes directly without need of a MergeAppend.
And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.
In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.
+1.
tables can be joined using partition-wise join technique described in my last
mail on this thread.
Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.
Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.
A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.
A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.
Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE: join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.
The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения
Sorry forgot to mention: this patch applies on top of the v7 patches posted by Amit Langote on 27th June (https://www.postgresql.org/message-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7%40lab.ntt.co.jp).
On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:The attached patch implements the logic to assess whether two partitionedI haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.
Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.Or we might be able to use indexes directly without need of a MergeAppend.And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.
In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.
+1.
tables can be joined using partition-wise join technique described in my last
mail on this thread.
Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.
Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.
A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.
A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.
Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE: join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.
The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Hi All,
PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016. The previous patch added support to assess whether two
tables can be joined using partition-wise join technique, but did not have
complete support to create plans which used partition-wise technique. This
patch implements three important pieces for supporting partition-wise join
1. Logic to assess whether join between two partitioned tables can be executed
using partition-wise join technique.
2. Construct RelOptInfo's representating join between matching partitions of
the joining relations and add join paths to those RelOptInfo's
3. Add append paths to the RelOptInfo representing the join between partitioned
tables. Rest of the planner code chooses the optimal path for join.
make_join_rel() now calls try_partition_wise_join(), which executes all of the
steps listed above. If the joining partitioned relations are deemed fit for
partition-wise join, we create one RelOptInfo (if not already present)
representing a join between every pair of partitions to be joined. Since the
join between parents is deemed legal, the join between the partitions is also
legal, hence legality of the join is not checked again. RelOptInfo representing
the join between partitions is constructed by translating the relevant members
of RelOptInfo of the parent join relation. Similarly SpecialJoinInfo,
restrictlist (for given join order) are constructed by translating those for
the parent join.
make_join_rel() is split into two portions, a. that deals with constructing
restrictlist and RelOptInfo for join relation b. that creates paths for the
join. The second portion is separated into a function
populate_joinrel_with_paths(), which is reused in try_partition_wise_join() to
create paths for join between matching partitions.
set_append_rel_pathlist() generates paths for child relations, marks the empty
children as dummy relations and creates append paths by collecting paths with
similar properties (parameterization and pathkeys) from non-empty children. It
then adds append paths to the parent relation. This patch divides
set_append_rel_pathlist() into two parts a. marking empty child relations as
dummy and generating paths for non-empty children. b. collecting children paths
into append paths for parent. Part b is separate into a function
add_paths_to_append_rel() which is reused for collecting paths from
partition-wise join child relations to construct append paths for join between
partitioned tables.
For an N-way join between partitioned tables, make_join_rel() is called as many
times as the number of valid join orders exist. For each such call, we will add
paths to join between partitions for corresponding join order between those
partitions. We can generate the append paths for parent joinrel only after all
such join orders have been considered. Hence before setting cheapest path forx
parent join relation, we set the cheapest path for each join relation between
partitions, followed by creating append paths for the parent joinrel. This
method needs some readjustment for multi-level partitions (TODO item 2 below).
A GUC enable_partition_wise_join is added to enable or disable partition-wise
join technique. I think the GUC is useful similar to other join related GUCs
like enable_hashjoin.
parameterized paths: While creating parameterized paths for child relations of
a partitioned tables, we do not have an idea as to whether we will be able to
use partition-wise join technique or not. Also we do not know the child
partition of the other partitioned table, to which a given partition would
join. Hence we do not create paths parameterized by child partitions of other
partitioned relations. But path for child of a partitioned relation
parameterized by other parent relation, can be considered to be parameterised
by any child relation of the other partitioned relation by replacing the parent
parameters by corresponding child parameters. This condition is used to
eliminate parameterized paths while creating merge and hash joins, to decide
the resultant parameterization of a join between child partitions and to create
nested loop paths with inner path parameterized by outer relation where inner
and outer relations are child partitions. While creating such nest loop join
paths we translate the path parameterized by other parent partitioned relation,
to that parameterized by the required child.
Functions like select_outer_pathkeys_for_merge(), make_sort_from_pathkeys(),
find_ec_member_for_tle() which did not expect to be called for a child
relation, are now used for child partition relations for joins. These functions
are adjusted for that usage.
Testing:
I have added partition_join.sql testcase to test partition-wise join feature.
That file has extensive tests for list, range, multi-level partitioning schemes
and various kinds of joins including nested loop join with inner relation
parameterized by outer relationThat file has extensive tests for list, range,
multi-level partitioning schemes and various kinds of joins including nested
loop join with inner relation parameterized by outer relation.
make check passes clean.
TODOs:
1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning schemes by
maintaining a list of compatible partitioning schemes.
Right now, I have moved some partition related structures from partition.c to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep that
structure and its final name.
2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here. RelOptInfo::parent_relid
should point to top parents rather than immediate parents.
3. Testing: need more tests for testing partition-wise join with foreign tables
as partitions. More tests for parameterized joins for multi-level partitioned
joins.
4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.
5. In add_paths_to_append_rel() to find the possible set of outer relations for
generating parameterized paths for a given join. This code needs to be adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.
6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path for
child partition.
7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide to
retain it.
8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.
PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016. The previous patch added support to assess whether two
tables can be joined using partition-wise join technique, but did not have
complete support to create plans which used partition-wise technique. This
patch implements three important pieces for supporting partition-wise join
1. Logic to assess whether join between two partitioned tables can be executed
using partition-wise join technique.
2. Construct RelOptInfo's representating join between matching partitions of
the joining relations and add join paths to those RelOptInfo's
3. Add append paths to the RelOptInfo representing the join between partitioned
tables. Rest of the planner code chooses the optimal path for join.
make_join_rel() now calls try_partition_wise_join(), which executes all of the
steps listed above. If the joining partitioned relations are deemed fit for
partition-wise join, we create one RelOptInfo (if not already present)
representing a join between every pair of partitions to be joined. Since the
join between parents is deemed legal, the join between the partitions is also
legal, hence legality of the join is not checked again. RelOptInfo representing
the join between partitions is constructed by translating the relevant members
of RelOptInfo of the parent join relation. Similarly SpecialJoinInfo,
restrictlist (for given join order) are constructed by translating those for
the parent join.
make_join_rel() is split into two portions, a. that deals with constructing
restrictlist and RelOptInfo for join relation b. that creates paths for the
join. The second portion is separated into a function
populate_joinrel_with_paths(), which is reused in try_partition_wise_join() to
create paths for join between matching partitions.
set_append_rel_pathlist() generates paths for child relations, marks the empty
children as dummy relations and creates append paths by collecting paths with
similar properties (parameterization and pathkeys) from non-empty children. It
then adds append paths to the parent relation. This patch divides
set_append_rel_pathlist() into two parts a. marking empty child relations as
dummy and generating paths for non-empty children. b. collecting children paths
into append paths for parent. Part b is separate into a function
add_paths_to_append_rel() which is reused for collecting paths from
partition-wise join child relations to construct append paths for join between
partitioned tables.
For an N-way join between partitioned tables, make_join_rel() is called as many
times as the number of valid join orders exist. For each such call, we will add
paths to join between partitions for corresponding join order between those
partitions. We can generate the append paths for parent joinrel only after all
such join orders have been considered. Hence before setting cheapest path forx
parent join relation, we set the cheapest path for each join relation between
partitions, followed by creating append paths for the parent joinrel. This
method needs some readjustment for multi-level partitions (TODO item 2 below).
A GUC enable_partition_wise_join is added to enable or disable partition-wise
join technique. I think the GUC is useful similar to other join related GUCs
like enable_hashjoin.
parameterized paths: While creating parameterized paths for child relations of
a partitioned tables, we do not have an idea as to whether we will be able to
use partition-wise join technique or not. Also we do not know the child
partition of the other partitioned table, to which a given partition would
join. Hence we do not create paths parameterized by child partitions of other
partitioned relations. But path for child of a partitioned relation
parameterized by other parent relation, can be considered to be parameterised
by any child relation of the other partitioned relation by replacing the parent
parameters by corresponding child parameters. This condition is used to
eliminate parameterized paths while creating merge and hash joins, to decide
the resultant parameterization of a join between child partitions and to create
nested loop paths with inner path parameterized by outer relation where inner
and outer relations are child partitions. While creating such nest loop join
paths we translate the path parameterized by other parent partitioned relation,
to that parameterized by the required child.
Functions like select_outer_pathkeys_for_merge(), make_sort_from_pathkeys(),
find_ec_member_for_tle() which did not expect to be called for a child
relation, are now used for child partition relations for joins. These functions
are adjusted for that usage.
Testing:
I have added partition_join.sql testcase to test partition-wise join feature.
That file has extensive tests for list, range, multi-level partitioning schemes
and various kinds of joins including nested loop join with inner relation
parameterized by outer relationThat file has extensive tests for list, range,
multi-level partitioning schemes and various kinds of joins including nested
loop join with inner relation parameterized by outer relation.
make check passes clean.
TODOs:
1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning schemes by
maintaining a list of compatible partitioning schemes.
Right now, I have moved some partition related structures from partition.c to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep that
structure and its final name.
2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here. RelOptInfo::parent_relid
should point to top parents rather than immediate parents.
3. Testing: need more tests for testing partition-wise join with foreign tables
as partitions. More tests for parameterized joins for multi-level partitioned
joins.
4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.
5. In add_paths_to_append_rel() to find the possible set of outer relations for
generating parameterized paths for a given join. This code needs to be adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.
6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path for
child partition.
7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide to
retain it.
8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения
On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > 4. Remove bms_to_char(): I have added this function to print Relids in the > debugger. I have found it very useful to quickly examine Relids in debugger, > which otherwise wasn't so easy. If others find it useful too, I can create a > separate patch to be considered for a separate commit. > +1 to have such a function. I often need something like that whenever I debug the optimizer code. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2016/09/09 18:47, Ashutosh Bapat wrote: > A related change is renaming RangeBound structure in Amit > Langote's patches to PartitionRangeBound to avoid name conflict with > rangetypes.h. That change too should vanish once we decide where to keep > that structure and its final name. This change has been incorporated into the latest patch I posted on Sep 9 [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/28ee345c-1278-700e-39a7-36a71f9a3b43@lab.ntt.co.jp
Re: Partition-wise join for join between (declaratively) partitioned tables
От
Rajkumar Raghuwanshi
Дата:
On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.c om> wrote:
I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then partition-wise-join patch, getting below error while make install.
../../../../src/include/nodes/ relation.h:706: error: redefinition of typedef ‘PartitionOptInfo’
../../../../src/include/nodes/ relation.h:490: note: previous declaration of ‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WO RKDB/PG/postgresql/src/backend /access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WO RKDB/PG/postgresql/src/backend /access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WO RKDB/PG/postgresql/src/backend '
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WO RKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2
PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
Attached the patch for the fix of above error.
Hi All,
PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016.
I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then partition-wise-join patch, getting below error while make install.
../../../../src/include/nodes/
../../../../src/include/nodes/
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WO
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WO
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WO
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WO
make: *** [all-src-recurse] Error 2
PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
Attached the patch for the fix of above error.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Вложения
On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > > On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> Hi All, >> >> PFA the patch to support partition-wise joins for partitioned tables. The >> patch >> is based on the declarative parition support patches provided by Amit >> Langote >> on 26th August 2016. > > > I have applied declarative partitioning patches posted by Amit Langote on 26 > Aug 2016 and then partition-wise-join patch, getting below error while make > install. > > ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef > ‘PartitionOptInfo’ > ../../../../src/include/nodes/relation.h:490: note: previous declaration of > ‘PartitionOptInfo’ was here > make[4]: *** [gistbuild.o] Error 1 > make[4]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist' > make[3]: *** [gist-recursive] Error 2 > make[3]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access' > make[2]: *** [access-recursive] Error 2 > make[2]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend' > make[1]: *** [all-backend-recurse] Error 2 > make[1]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src' > make: *** [all-src-recurse] Error 2 > > PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) > > Attached the patch for the fix of above error. Thanks for the report. I will fix this in the next patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
PFA patch which takes care of some of the TODOs mentioned in my previous mail. The patch is based on the set of patches supporting declarative partitioning by Amit Langoted posted on 26th August. > > TODOs: > > 1. Instead of storing partitioning information in RelOptInfo of each of the > partitioned relations (base and join relations), we can keep a list of > canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer > to > the member of list representing the partitioning scheme of corresponding > relation. RelOptInfo's of all similarly partitioned relations get the same > pointer thus making it easy to match the partitioning schemes by comparing > the > pointers. While we are supporting only exact partition matching scheme now, > it's possible to extend this method to match compatible partitioning schemes > by > maintaining a list of compatible partitioning schemes. > > Right now, I have moved some partition related structures from partition.c > to > partition.h. These structures are still being reviewed and might change when > Amit Langote improves his patches. Having canonical partitioning scheme in > PlannerInfo may not require moving those structures out. So, that code is > still > under development. A related change is renaming RangeBound structure in Amit > Langote's patches to PartitionRangeBound to avoid name conflict with > rangetypes.h. That change too should vanish once we decide where to keep > that > structure and its final name. Done. > > 2. Multi-level partitioned tables: For some reason path created for joining > partitions are not being picked up as the cheapest paths. I think, we need > to > finalize the lower level paths before moving upwards in the partition > hierarchy. But I am yet to investigate the issue here. > RelOptInfo::parent_relid > should point to top parents rather than immediate parents. Done > > 3. Testing: need more tests for testing partition-wise join with foreign > tables > as partitions. More tests for parameterized joins for multi-level > partitioned > joins. Needs to be done. > > 4. Remove bms_to_char(): I have added this function to print Relids in the > debugger. I have found it very useful to quickly examine Relids in debugger, > which otherwise wasn't so easy. If others find it useful too, I can create a > separate patch to be considered for a separate commit. I will take care of this after rebasing the patch on the latest sources and latest set of patches by Amit Langote. > > 5. In add_paths_to_append_rel() to find the possible set of outer relations > for > generating parameterized paths for a given join. This code needs to be > adjusted > to eliminate the parent relations possible set of outer relations for a join > between child partitions. Done. > > 6. Add support to reparameterize more types of paths for child relations. I > will add this once we finalize the method to reparameterize a parent path > for > child partition. Will wait for reviewer's opinion. > > 7. The patch adds make_joinrel() (name needs to be changed because of its > similariy with make_join_rel()) to construct an empty RelOptInfo for a join > between partitions. The function copies code doing the same from > build_join_rel(). build_join_rel() too can use this function, if we decide > to > retain it. This will be done as a separate cleanup patch. > > 8. Few small TODOs related to code reorganization, proper function, > variable naming etc. are in the patch. pg_indent run. I have taken care of most of the TODOs. But there are still some TODOs remaining. I will take care of those in the next version of patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
Hi Rajkumar, On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > > On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> Hi All, >> >> PFA the patch to support partition-wise joins for partitioned tables. The >> patch >> is based on the declarative parition support patches provided by Amit >> Langote >> on 26th August 2016. > > > I have applied declarative partitioning patches posted by Amit Langote on 26 > Aug 2016 and then partition-wise-join patch, getting below error while make > install. > > ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef > ‘PartitionOptInfo’ > ../../../../src/include/nodes/relation.h:490: note: previous declaration of > ‘PartitionOptInfo’ was here > make[4]: *** [gistbuild.o] Error 1 > make[4]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist' > make[3]: *** [gist-recursive] Error 2 > make[3]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access' > make[2]: *** [access-recursive] Error 2 > make[2]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend' > make[1]: *** [all-backend-recurse] Error 2 > make[1]: Leaving directory > `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src' > make: *** [all-src-recurse] Error 2 > > PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) > Thanks for the report and the patch. This is fixed by the patch posted with https://www.postgresql.org/message-id/CAFjFpRdRFWMc4zNjeJB6p1Ncpznc9DMdXfZJmVK5X_us5zeD9Q%40mail.gmail.com. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Partition-wise join for join between (declaratively) partitioned tables
От
Rajkumar Raghuwanshi
Дата:
On Tue, Sep 20, 2016 at 4:26 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then latest partition-wise-join patch, getting below error while make install.
../../../../src/include/catalog/partition.h:37: error: redefinition of typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of ‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2
PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
I have commented below statement in src/include/catalog/partition.h file and then tried to install, it worked fine.
/* typedef struct PartitionSchemeData *PartitionScheme; */
PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.
I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then latest partition-wise-join patch, getting below error while make install.
../../../../src/include/catalog/partition.h:37: error: redefinition of typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of ‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2
PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
I have commented below statement in src/include/catalog/partition.h file and then tried to install, it worked fine.
/* typedef struct PartitionSchemeData *PartitionScheme; */
Thanks & Regards,
Rajkumar Raghuwanshi
> ../../../../src/include/catalog/partition.h:37: error: redefinition of > typedef ‘PartitionScheme’ > ../../../../src/include/nodes/relation.h:492: note: previous declaration of > ‘PartitionScheme’ was here [...] > > PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) Thanks for the report. For some reason, I am not getting these errors with my compiler [ashutosh@ubuntu regress]gcc --version gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 Anyway, I have fixed it in the attached patch. The patch is based on sources upto commit commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Date: Wed Sep 21 13:24:13 2016 +0300 and Amit Langote's set of patches posted on 15th Sept. 2016 [1] There are few implementation details that need to be worked out like 1. adjust_partitionrel_attrs() calls adjust_appendrel_attrs() as many times as the number of base relations in the join, possibly producing a new expression tree in every call. It can be optimized to call adjust_appendrel_attrs() only once. I will work on that if reviewers agree that adjust_partitionrel_attrs() is needed and should be optimized. 2. As mentioned in earlier mails, the paths parameterized by parent partitioned table are translated to be parameterized by child partitions. That code needs to support more kinds of paths. I will work on that, if reviewers agree that the approach of translating paths is acceptable. 3. Because of an issue with declarative partitioning patch [2] multi-level partition table tests are failing in partition_join.sql. Those were not failing with an earlier set of patches supporting declarative partitions. Those will be fixed based on the discussion in that thread. 4. More tests for foreign tables as partitions and for multi-level partitioned tables. 5. The tests use unpartitioned tables for verifying results. Those tables and corresponding SQL statements will be removed once the tests are finalised. [1]. https://www.postgresql.org/message-id/e5c1c9cf-3f5a-c4d7-6047-7351147aaef9%40lab.ntt.co.jp [2]. https://www.postgresql.org/message-id/CAFjFpRc%3DT%2BCjpGNkNSdOkHza8VAPb35bngaCdAzPgBkhijmJhg%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
Re: Partition-wise join for join between (declaratively) partitioned tables
От
Rajkumar Raghuwanshi
Дата:
On Thu, Sep 22, 2016 at 4:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb. com> wrote:
Attached patch created on top to Ashutosh's patch posted on 22 Sept 2016.
The patch is based on sources upto commit
commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Sep 21 13:24:13 2016 +0300
and Amit Langote's set of patches posted on 15th Sept. 2016 [1]
I have applied your patch on top of Amit patches posted on 15th Sept. 2016, and tried to create some test cases on list and multi-level partition based on test cases written for range partition.
I got some server crash and errors which I have mentioned as comment in expected output file, which need to be updated once these issues will get fix. also for these issue expected output is generated by creating same query for non-partition table with same data.
Attached patch created on top to Ashutosh's patch posted on 22 Sept 2016.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Вложения
On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > [ new patch ] This should probably get updated since Rajkumar reported a crash. Meanwhile, here are some comments from an initial read-through: + * Multiple relations may be partitioned in the same way. The relations + * resulting from joining such relations may be partitioned in the same way as + * the joining relations. Similarly, relations derived from such relations by + * grouping, sorting be partitioned in the same as the underlying relations. I think you should change "may be partitioned in the same way" to "are partitioned in the same way" or "can be regarded as partitioned in the same way". The sentence that begins with "Similarly," is not grammatical; it should say something like: ...by grouping or sorting are partitioned in the same way as the underlying relations. @@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel) result->bounds->rangeinfo = rangeinfo; break; } } } MemoryContextSwitchTo(oldcxt); rel->rd_partdesc = result;} +/* * Are two partition bound collections logically equal? * * Used in the keep logic of relcache.c (ie, in RelationClearRelation()).* This is also useful when b1 and b2 are bound collections of two separate * relations, respectively,because BoundCollection is a canonical * representation of a set partition bounds (for given partitioning strategy).*/boolpartition_bounds_equal(PartitionKey key, Spurious hunk. + * For an umpartitioned table, it returns NULL. Spelling. + * two arguemnts and returns boolean. For types, it suffices to match Spelling. + * partition key expression is stored as a single member list to accomodate Spelling. + * For a base relation, construct an array of partition key expressions. Each + * partition key expression is stored as a single member list to accomodate + * more partition keys when relations are joined. How would joining relations result in more partitioning keys getting added? Especially given the comment for the preceding function, which says that a new PartitionScheme gets created unless an exact match is found. + if (!lc) Test lc == NIL instead of !lc. +extern int +PartitionSchemeGetNumParts(PartitionScheme part_scheme) +{ + return part_scheme ? part_scheme->nparts : 0; +} I'm not convinced it's a very good idea for this function to have special handling for when part_scheme is NULL. In try_partition_wise_join() that checks is not needed because it's already been done, and in generate_partition_wise_join_paths it is needed but only because you are initializing nparts too early. If you move this initialization down below the IS_DUMMY_REL() check you won't need the NULL guard. I would ditch this function and let the callers access the structure member directly. +extern int +PartitionSchemeGetNumKeys(PartitionScheme part_scheme) +{ + return part_scheme ? part_scheme->partnatts : 0; +} Similarly here. have_partkey_equi_join should probably have a quick-exit path when part_scheme is NULL, and then num_pks can be set afterwards unconditionally. Same for match_expr_to_partition_keys. build_joinrel_partition_info already has it and doesn't need this double-check. +extern Oid * +PartitionDescGetPartOids(PartitionDesc part_desc) +{ + Oid *part_oids; + int cnt_parts; + + if (!part_desc || part_desc->nparts <= 0) + return NULL; + + part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts); + for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++) + part_oids[cnt_parts] = part_desc->oids[cnt_parts]; + + return part_oids; +} I may be missing something, but this looks like a bad idea in multiple ways. First, you've got checks for part_desc's validity here that should be in the caller, as noted above. Second, you're copying an array by looping instead of using memcpy(). Third, the one and only caller is set_append_rel_size, which doesn't seem to have any need to copy this data in the first place. If there is any possibility that the PartitionDesc is going to change under us while that function is running, something is deeply broken. Nothing in the planner is going to cope with the table structure changing under us, so it had better not. + /* + * For a partitioned relation, we will save the child RelOptInfos in parent + * RelOptInfo in the same the order as corresponding bounds/lists are + * stored in the partition scheme. + */ This comment seems misplaced; shouldn't it be next to the code that is actually doing this, rather than the code that is merely setting up for it? And, also, the comment implies that we're doing this instead of what we'd normally do, whereas I think we are actually doing something additional. + /* + * Save topmost parent's relid. If the parent itself is a child of some + * other relation, use parent's topmost parent relids. + */ + if (rel->top_parent_relids) + childrel->top_parent_relids = rel->top_parent_relids; + else + childrel->top_parent_relids = bms_copy(rel->relids); Comment should explain why we're doing it, not what we're doing. The comment as written just restates what anybody who's likely to be looking at this can already see to be true from looking at the code that follows. The question is why do it. + /* Set only for "other" base or join relations. */ + Relids top_parent_relids; Comment should say what it is, not just when it's set. + /* Should have found all the childrels of a partitioned relation. */ + if (rel->part_scheme) + { + int cnt_parts; + for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++) + Assert(rel->part_rels[cnt_parts]); + } A block that does nothing but Assert() should be guarded by #ifdef USE_ASSERT_CHECKING. Although, actually, maybe this should be an elog(), just in case? + } + + add_paths_to_append_rel(root, rel, live_childrels); +} + +static void +add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, + List *live_childrels) The new function should have a header comment, which should include an explanation of why this is now separate from set_append_rel_pathlist(). + if (!live_childrels) As before, I think live_childrels == NIL is better style. + generate_partition_wise_join_paths(root, rel); Needs an update to the comment earlier in the hunk. It's important to explain why this has to be done here and not within join_search_one_level. + /* Recursively collect the paths from child joinrel. */ + generate_partition_wise_join_paths(root, child_rel); Given the recursion, check_stack_depth() at top of function is probably appropriate. Same for try_partition_wise_join(). + if (live_children) + pfree(live_children); Given that none of the substructure, including ListCells, will be freed, this seems utterly pointless. If it's necessary to recover memory here at all, we probably need to be more aggressive about it. Have you tested the effect of this patch on planner memory consumption with multi-way joins between tables with many partitions? If you haven't, you probably should. (Testing runtime would be good, too.) Does it grow linearly? Quadratically? Exponentially? Minor leaks don't matter, but if we're generating too much garbage we'll have to make sure it gets cleaned up soon enough to prevent runaway memory usage. /* + * An inner path parameterized by the parent relation of outer + * relation needs to be reparameterized by the outer relation to be used + * for parameterized nested loop join. + */ No doubt, but I think the comment is missing the bigger picture -- it doesn't say anything about this being here to support partition-wise joins, which seems like a key point. + /* If we could not translate the path, don't produce nest loop path. */ + if (!inner_path) + return; Why would that ever happen? +/* + * If the join between the given two relations can be executed as + * partition-wise join create the join relations for partition-wise join, + * create paths for those and then create append paths to combine + * partition-wise join results. + */ +static void +try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, + RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo, + List *parent_restrictlist) This comment doesn't accurately describe what the function does. No append paths are created here; that happens at a much later stage. I think this comment needs quite a bit more work, and maybe the function should be renamed, too. There are really two steps involved here: first, we create paths for each child, attached to a new RelOptInfo flagged as RELOPT_OTHER_JOINREL paths; later, we create additional paths for the parent RelOptInfo by appending a path for each child. Broadly, I think there's a lack of adequate documentation of the overall theory of operation of this patch. I believe that an update to the optimizer README would be appropriate, probably with a new section but maybe incorporating the new material into an existing section. In addition, the comments for individual comments and chunks of code need to do a better job explaining how each part of the patch contributes to the overall picture. I also think we need to do a better join hammering out the terminology. I don't particularly like the term "partition-wise join" in the first place, although I don't know what would be better, but we certainly need to avoid confusing a partition-wise join -- which is a join performed by joining each partition of one partitioned rel to the corresponding partition of a similarly partitioned rel rather than by the usual execution strategy of joining the parent rels -- with the concept of an other-join-rel, which an other-member-rel analogue for joins. I don't think the patch is currently very clear about this right now, either in the code or in the comments. Maybe this function ought to be named something like make_child_joins() or make_child_join_paths(), and we could use "child joins" and/or "child join paths" as standard terminology throughout the patch. + rel1_desc = makeStringInfo(); + rel2_desc = makeStringInfo(); + + /* TODO: remove this notice when finalising the patch. */ + outBitmapset(rel1_desc, rel1->relids); + outBitmapset(rel2_desc, rel2->relids); + elog(NOTICE, "join between relations %s and %s is considered for partition-wise join.", + rel1_desc->data, rel2_desc->data); Please remove your debugging cruft before submitting patches to pgsql-hackers, or at least put #ifdef NOT_USED or something around it. + * We allocate the array for child RelOptInfos till we find at least one + * join order which can use partition-wise join technique. If no join order + * can use partition-wise join technique, there are no child relations. This comment has problems. I think "till" is supposed to be "until", and there's supposed to be a "don't" in there somewhere. But really, I think what you're going for is just /* Allocate when first needed */ which would be a lot shorter and also more clear. + * Create join relations for the partition relations, if they do not exist + * already. Add paths to those for the given pair of joining relations. I think the comment could be a bit more explanatory here. Something like: "This joinrel is partitioned, so iterate over the partitions and create paths for each one, allowing us to eventually build an append-of-joins path for the parent. Since this routine may be called multiple times for various join orders, the RelOptInfo needed for each child join may or may not already exist, but the paths for this join order definitely do not. Note that we don't create any actual AppendPath at this stage; it only makes sense to do that at the end, after each possible join order has been considered for each child join. The best join order may differ from child to child." + * partiticipating in the given partition relations. We need them Spelling. +/* + * Construct the SpecialJoinInfo for the partition-wise join using parents' + * special join info. Also, instead of + * constructing an sjinfo everytime, we should probably save it in + * root->join_info_list and search within it like join_is_legal? + */ The lines here are of very different lengths for no particularly good reason, and it should end with a period, not a question mark. On the substance of the issue, it seems like the way you're doing this right now could allocate a very large number of SpecialJoinInfo structures. For every join relation, you'll create one SpecialJoinInfo per legal join order per partition. That seems like it could get to be a big number. I don't know if that's going to be a problem from a memory-usage standpoint, but it seems like it might. It's not just the SpecialJoinInfo itself; all of the substructure gets duplicated, too. + SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo); + sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand, + append_rel_infos1); Missing a blank line here. + AppendRelInfo *ari = lfirst(lc); Standard naming convention for an AppendRelInfo variable seems to be appinfo, not ari. (I just did "git grep AppendRelInfo".) + /* Skip non-equi-join clauses. */ + if (!rinfo->can_join || + rinfo->hashjoinoperator == InvalidOid || + !rinfo->mergeopfamilies) + continue; There's definitely something ugly about this. If rinfo->can_join is false, then we're done. But suppose one of mergeopfamilies == NIL and rinfo->hashoperator == InvalidOid is true and the other is false. Are we really precluded from doing a partiion-wise join in that case, or are we just prohibited from using certain join strategies? In most places where we make similar tests, we're careful not to require more than we need. I also think that these tests need to consider the partitioning operator in use. Suppose that the partition key is of a type T which has two operator classes X and Y. Both relations are partitioned using an operator from opfamily X, but the join condition mentions opfamily Y. I'm pretty sure this precludes a partitionwise join. If the join condition used opfamily X, then we would have a guarantee that two rows which compared as equal would be in the same partition, but because it uses opfamily Y, that's not guaranteed. For example, if T is a text type, X might test for exact equality using "C" collation rules, while Y might test for equality using some case-insensitive set of rules. If the partition boundaries are such that "foo" and "FOO" are in different partitions, a partitionwise join using the case-insensitive operator will produce wrong results. You can also imagine this happening with numeric, if you have one opclass (like the default one) that considers 5.0 and 5.00 to be equal, but another opclass that thinks they are different; if the latter is used to set the partition bounds, 5.0 and 5.00 could end up in different partitions - which will be fine if an operator from that opclass is used for the join, but not if an operator from the regular opclass is used. After thinking this over a bit, I think the right way to think about this is: 1. Amit's patch currently only ever uses btree opfamilies for partitioning. It uses those for both range partitioning and list partitioning. If we ever support hash partitioning, we would presumably use hash opfamilies for that purpose, but right now it's all about btree opfamilies. 2. Therefore, if A and B are partitioned but the btree opfamilies don't match, they don't have the same partitioning scheme and this code should never be reached. Similarly, if they use the same opfamily but different collations, the partitioning schemes shouldn't match and therefore this code should not be reached. 3. If A and B are partitioned and the partitioning opfamilies - which are necessarily btree opfamilies - do match, then the operator which appears in the query needs to be from the same opfamily and have amopstrategy of BTEqualStrategyNumber within that opfamily. If not, then a partition-wise join is not possible. 4. Assuming the above conditions are met, have_partkey_equi_join doesn't need to care whether the operator chosen has mergeopfamilies or a valid hashjoinoperator. Those factors will control which join methods are legal, but not whether a partitionwise join is possible in principle. Let me know whether that seems right. + * RelabelType node; eval_const_expressions() will have simplied if more Spelling. /* + * Code below scores equivalence classes by how many equivalence members + * can produce join clauses for this join relation. Equivalence members + * which do not cover the parents of a partition-wise join relation, can + * produce join clauses for partition-wise join relation. + */ I don't know what that means. The comma in the second sentence doesn't belong there. + /* + * TODO: Instead of copying and mutating the trees one child relation at a + * time, we should be able to do this en-masse for all the partitions + * involved. + */ I don't see how that would be possible, but if it's a TODO, you'd better do it (or decide not to do it and remove or change the comment). /* * Create explicit sort nodes for the outer and inner paths if necessary. */ if (best_path->outersortkeys) { + Relids outer_relids = outer_path->parent->relids; Sort *sort = make_sort_from_pathkeys(outer_plan, - best_path->outersortkeys); + best_path->outersortkeys, + outer_relids); The changes related to make_sort_from_pathkeys() are pretty opaque to me. Can you explain? + * Change parameterization of sub paths recursively. Also carry out any "sub paths" should not be two words, here or anywhere. +reparameterize_path_for_child(PlannerInfo *root, Path *path, + RelOptInfo *child_rel) This is suspiciously unlike reparameterize_path. Why? + /* Computer information relevant to the foreign relations. */ + set_foreign_rel_properties(joinrel, outer_rel, inner_rel); Perhaps this refactoring could be split out into a preliminary patch, which would then simplify this patch. And same for add_join_rel(). + * Produce partition-wise joinrel's targetlist by translating the parent + * joinrel's targetlist. This will also include the required placeholder Again the confusion between a "child" join and a partition-wise join... + /* + * Nothing to do if + * a. partition-wise join is disabled. + * b. joining relations are not partitioned. + * c. partitioning schemes do not match. + */ + I don't think that's going to survive pgindent. + * are not considered equal, an equi-join involing inner partition keys Spelling. + * Collect the partition key expressions. An OUTER join will produce rows + * where the partition key columns of inner side are NULL and may not fit + * the partitioning scheme with inner partition keys. Since two NULL values + * are not considered equal, an equi-join involing inner partition keys + * still prohibits cross-partition joins while joining with another + * similarly partitioned relation. I can't figure out what this comment is trying to tell me. Possibly I just need more caffeine. + * Adding these two join_rel_level list also means that top level list has more + * than one join relation, which is symantically incorrect. I don't understand this, either; also, spelling. As a general comment, the ratio of tests-to-code in this patch is way out of line with PostgreSQL's normal practices. The total patch file is 10965 lines. The test cases begin at line 3047, meaning that in round figures you've got about one-quarter code and about three-quarters test cases. I suspect that a large fraction of those test cases aren't adding any meaningful code coverage and will just take work to maintain. That needs to be slimmed down substantially in any version of this considered for commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 28, 2016 at 2:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> [ new patch ] > > This should probably get updated since Rajkumar reported a crash. > Meanwhile, here are some comments from an initial read-through: Done. Fixed those crashes. Also fixed some crashes in foreign table code and postgres_fdw. The tests were provided by Rajkumar. I am working on including those in my patch. The attached patch is still based on Amit's patches set of patches posted on 15th Sept. 2016. He is addressing your comments on his patches. So, I am expecting a more stable version arrive soon. I will rebase my patches then. Because of a bug in those patches related to multi-level partitioned tables and lateral joins and also a restriction on sharing partition keys across levels of partitions, the testcase is still failing. I will work on that while rebasing the patch. > > + * Multiple relations may be partitioned in the same way. The relations > + * resulting from joining such relations may be partitioned in the same way as > + * the joining relations. Similarly, relations derived from such relations by > + * grouping, sorting be partitioned in the same as the underlying relations. > > I think you should change "may be partitioned in the same way" to "are > partitioned in the same way" or "can be regarded as partitioned in the > same way". The relations resulting from joining partitioned relations are partitioned in the same way, if there exist equi-join condition/s between their partition keys. If such equi-joins do not exist, the join is *not* partitioned. Hence I did not use "are" or "can be" which indicate a certainty. Instead I used "may" which indicates "uncertainty". I am not sure whether that's a good place to explain the conditions under which such relations are partitioned. Those conditions will change as we implement more and more partition-wise join strategies. But that comment conveys two things 1. partition scheme makes sense for all kinds of relations 2. multiple relations (of any kind) may share partition scheme. I have slightly changed the wording to make this point clear. Please let me know if it looks better. > The sentence that begins with "Similarly," is not > grammatical; it should say something like: ...by grouping or sorting > are partitioned in the same way as the underlying relations. Done. Instead of "are" I have used "may" for the same reason as above. > > @@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel) > result->bounds->rangeinfo = rangeinfo; > break; > } > } > } > > MemoryContextSwitchTo(oldcxt); > rel->rd_partdesc = result; > } > > + > /* > * Are two partition bound collections logically equal? > * > * Used in the keep logic of relcache.c (ie, in RelationClearRelation()). > * This is also useful when b1 and b2 are bound collections of two separate > * relations, respectively, because BoundCollection is a canonical > * representation of a set partition bounds (for given partitioning strategy). > */ > bool > partition_bounds_equal(PartitionKey key, > > Spurious hunk. > Thanks. Done. > + * For an umpartitioned table, it returns NULL. > > Spelling. Done. Thanks. > > + * two arguemnts and returns boolean. For types, it > suffices to match > > Spelling. Thanks. Done. > > + * partition key expression is stored as a single member list to accomodate > > Spelling. Thanks. Done. > > + * For a base relation, construct an array of partition key expressions. Each > + * partition key expression is stored as a single member list to accomodate > + * more partition keys when relations are joined. > > How would joining relations result in more partitioning keys getting > added? Especially given the comment for the preceding function, which > says that a new PartitionScheme gets created unless an exact match is > found. Let's assume that relation A and B are partitioned by columns a and b resp. and have same partitioning scheme. This means that the datatypes of a and b as well as the opclass used for comparing partition key values of A and B are same. A join between A and B with condition A.a = B.b is partitioned by both A.a and B.b. We need to keep track of both the keys in case AB joins with C which is partitioned in the same manner. I guess, the confusion is with the term "partition keys" - which is being used to indicate the class of partition key as well as instance of partition key. In the above example, the datatype of partition key and the opclass together indicate partition key class whereas A.a and B.b are instances of that class. Increase in partition keys may mean both increase in the number of classes or increase in the number of instances. In the above comment I used to mean number of instances. May be we should use "partition key expressions" to indicate the partition key instances and "partition key" to indicate partition key class. I have changed the comments to use partition keys and partition key expressions appropriately. Please let me know if the comments are worded correctly. PartitionScheme does not hold the actual partition key expressions. It holds the partition key type and opclass used for comparison, which should be same for all the relations sharing the partition scheme. > > + if (!lc) > > Test lc == NIL instead of !lc. NIL is defined as (List *) NULL and lc is ListCell *. So changed the test to lc == NULL instead of !lc. > > +extern int > +PartitionSchemeGetNumParts(PartitionScheme part_scheme) > +{ > + return part_scheme ? part_scheme->nparts : 0; > +} > > I'm not convinced it's a very good idea for this function to have > special handling for when part_scheme is NULL. In > try_partition_wise_join() that checks is not needed because it's > already been done, and in generate_partition_wise_join_paths it is > needed but only because you are initializing nparts too early. If you > move this initialization down below the IS_DUMMY_REL() check you won't > need the NULL guard. I would ditch this function and let the callers > access the structure member directly. > > +extern int > +PartitionSchemeGetNumKeys(PartitionScheme part_scheme) > +{ > + return part_scheme ? part_scheme->partnatts : 0; > +} > > Similarly here. have_partkey_equi_join should probably have a > quick-exit path when part_scheme is NULL, and then num_pks can be set > afterwards unconditionally. Same for match_expr_to_partition_keys. > build_joinrel_partition_info already has it and doesn't need this > double-check. > > +extern Oid * > +PartitionDescGetPartOids(PartitionDesc part_desc) > +{ > + Oid *part_oids; > + int cnt_parts; > + > + if (!part_desc || part_desc->nparts <= 0) > + return NULL; > + > + part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts); > + for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++) > + part_oids[cnt_parts] = part_desc->oids[cnt_parts]; > + > + return part_oids; > +} > > I may be missing something, but this looks like a bad idea in multiple > ways. First, you've got checks for part_desc's validity here that > should be in the caller, as noted above. Second, you're copying an > array by looping instead of using memcpy(). Third, the one and only > caller is set_append_rel_size, which doesn't seem to have any need to > copy this data in the first place. If there is any possibility that > the PartitionDesc is going to change under us while that function is > running, something is deeply broken. Nothing in the planner is going > to cope with the table structure changing under us, so it had better > not. These three functions were written based on Amit Langote's patches which did not expose partition related structures outside partition.c. Hence they required wrappers. I have moved PartitionSchemeData to partition.h and removed these functions. Instead the members are accessed directly. > > + /* > + * For a partitioned relation, we will save the child RelOptInfos in parent > + * RelOptInfo in the same the order as corresponding bounds/lists are > + * stored in the partition scheme. > + */ > > This comment seems misplaced; shouldn't it be next to the code that is > actually doing this, rather than the code that is merely setting up > for it? And, also, the comment implies that we're doing this instead > of what we'd normally do, whereas I think we are actually doing > something additional. > Ok. I have moved the comment few line below, near the code which saves the partition RelOptInfos. > + /* > + * Save topmost parent's relid. If the parent itself is a child of some > + * other relation, use parent's topmost parent relids. > + */ > + if (rel->top_parent_relids) > + childrel->top_parent_relids = rel->top_parent_relids; > + else > + childrel->top_parent_relids = bms_copy(rel->relids); > > Comment should explain why we're doing it, not what we're doing. The > comment as written just restates what anybody who's likely to be > looking at this can already see to be true from looking at the code > that follows. The question is why do it. > The point of that comment is to explain how it percolates down the hierarchy, which is not so clear from the code. I have changed it to read /* * Recursively save topmost parent's relid in RelOptInfos of * partitions. */ Or you are expecting that the comment to explain the purpose of top_parent_relids? I don't think that's a good idea, since the purpose will change over the time and the comment will soon be out of sync with the actual code, unless the developers expanding the usage remember to update the comment. I have not seen the comments, explaining purpose, next to the assignments. Take for example RelOptInfo::relids. > + /* Set only for "other" base or join relations. */ > + Relids top_parent_relids; > > Comment should say what it is, not just when it's set. Done. Check if it looks good. > > + /* Should have found all the childrels of a partitioned relation. */ > + if (rel->part_scheme) > + { > + int cnt_parts; > + for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++) > + Assert(rel->part_rels[cnt_parts]); > + } > > A block that does nothing but Assert() should be guarded by #ifdef > USE_ASSERT_CHECKING. Although, actually, maybe this should be an > elog(), just in case? Changed it to elog(). > > + } > + > + add_paths_to_append_rel(root, rel, live_childrels); > +} > + > +static void > +add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, > + List *live_childrels) > > The new function should have a header comment, which should include an > explanation of why this is now separate from > set_append_rel_pathlist(). Sorry for missing it. Added the prologue. Let me know, if it looks good. I have made sure that all functions have a prologue and tried to match the style with surrounding functions. Let me know if I have still missed any or the styles do not match. > > + if (!live_childrels) > > As before, I think live_childrels == NIL is better style. Fixed. > > + generate_partition_wise_join_paths(root, rel); > > Needs an update to the comment earlier in the hunk. It's important to > explain why this has to be done here and not within > join_search_one_level. Thanks for pointing that out. Similar to generate_gather_paths(), we need to add explanation in standard_join_search() as well as in the function prologue. Did that. Let me know if it looks good. > > + /* Recursively collect the paths from child joinrel. */ > + generate_partition_wise_join_paths(root, child_rel); > > Given the recursion, check_stack_depth() at top of function is > probably appropriate. Same for try_partition_wise_join(). Done. I wouldn't imagine a user creating that many levels of partitions, but it's good to guard against some automated script that has gone berserk. > > + if (live_children) > + pfree(live_children); > > Given that none of the substructure, including ListCells, will be > freed, this seems utterly pointless. If it's necessary to recover > memory here at all, we probably need to be more aggressive about it. I intended to use list_free() instead of pfree(). Fixed that. > Have you tested the effect of this patch on planner memory consumption > with multi-way joins between tables with many partitions? If you > haven't, you probably should. (Testing runtime would be good, too.) > Does it grow linearly? Quadratically? Exponentially? Minor leaks > don't matter, but if we're generating too much garbage we'll have to > make sure it gets cleaned up soon enough to prevent runaway memory > usage. I tried to check memory usage with various combinations of number of partitions and number of relations being joined. For higher number of relations being joined like 10 with 100 partitions, OOM killer kicked in during the planning phase. I am suspecting adjust_partitionrel_attrs() (changed that name to adjust_join_appendrel_attrs() to be in sync with adjust_appendrel_attrs()) to be the culprit. It copies expression trees every time for joining two children. That's an exponentially increasing number as the number of legal joins increases exponentially. I am still investigating this. As a side question, do we have a function to free an expression tree? I didn't find any. > > /* > + * An inner path parameterized by the parent relation of outer > + * relation needs to be reparameterized by the outer relation to be used > + * for parameterized nested loop join. > + */ > > No doubt, but I think the comment is missing the bigger picture -- it > doesn't say anything about this being here to support partition-wise > joins, which seems like a key point. I have tried to explain the partition-wise join context. Let me know if it looks good. > > + /* If we could not translate the path, don't produce nest loop path. */ > + if (!inner_path) > + return; > > Why would that ever happen? Right now, reparameterize_path_for_child() does not support all kinds of paths. So I have added that condition. I will add support for more path types there once we agree that this is the right way to translate the paths and that the path translation is required. > > +/* > + * If the join between the given two relations can be executed as > + * partition-wise join create the join relations for partition-wise join, > + * create paths for those and then create append paths to combine > + * partition-wise join results. > + */ > +static void > +try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, > + RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo, > + List *parent_restrictlist) > > This comment doesn't accurately describe what the function does. No > append paths are created here; that happens at a much later stage. Removed reference to the append paths. Sorry for leaving it there, when I moved the append path creation to a later stage. > I > think this comment needs quite a bit more work, and maybe the function > should be renamed, too. Improved the comments in the prologue and inside the function. Please let me know, if they look good. > There are really two steps involved here: > first, we create paths for each child, attached to a new RelOptInfo > flagged as RELOPT_OTHER_JOINREL paths; later, we create additional > paths for the parent RelOptInfo by appending a path for each child. > Right, the first one is done in try_partition_wise_join() and the later is done in generate_partition_wise_join_paths() > Broadly, I think there's a lack of adequate documentation of the > overall theory of operation of this patch. I believe that an update > to the optimizer README would be appropriate, probably with a new > section but maybe incorporating the new material into an existing > section. Done. I have added a separate section to optimizer/README > In addition, the comments for individual comments and chunks > of code need to do a better job explaining how each part of the patch > contributes to the overall picture. > I also think we need to do a > better join hammering out the terminology. I don't particularly like > the term "partition-wise join" in the first place, although I don't > know what would be better, but we certainly need to avoid confusing a > partition-wise join -- which is a join performed by joining each > partition of one partitioned rel to the corresponding partition of a > similarly partitioned rel rather than by the usual execution strategy > of joining the parent rels -- with the concept of an other-join-rel, > which an other-member-rel analogue for joins. I don't think the patch > is currently very clear about this right now, either in the code or in > the comments. Maybe this function ought to be named something like > make_child_joins() or make_child_join_paths(), and we could use "child > joins" and/or "child join paths" as standard terminology throughout > the patch. Partition-wise join is widely used term in the literature. Other DBMSes use the same term as well. So, I think we should stick with "partition-wise join". Partition-wise join as you have described is a join performed by joining each partition of one partitioned rel to the corresponding partition of a similarly partitioned rel rather than by the usual execution strategy of joining the parent rels. I have usually used the term "partition-wise join technique" to refer to this method. I have changed the other usages of this term to use wording like child joins or join between partiitions or join between child relations as appropriate. Also, I have changed the names of functions dealing with joins between partitions to use child_join instead of partition_join or partition_wise_join. Since partition-wise join is a method to join two relations just like other methods, try_partition_wise_join() fits into the naming convention try_<join technique> like try_nestloop_join. > > + rel1_desc = makeStringInfo(); > + rel2_desc = makeStringInfo(); > + > + /* TODO: remove this notice when finalising the patch. */ > + outBitmapset(rel1_desc, rel1->relids); > + outBitmapset(rel2_desc, rel2->relids); > + elog(NOTICE, "join between relations %s and %s is considered for > partition-wise join.", > + rel1_desc->data, rel2_desc->data); > > Please remove your debugging cruft before submitting patches to > pgsql-hackers, or at least put #ifdef NOT_USED or something around it. I kept this one intentionally. But as the TODO comment says, I do intend to remove it once testing is over. Those messages make it very easy to know whether partition-wise join was considered for a given join or not. Without those messages, one has to break into try_partition_wise_join() to figure out whether partition-wise join was used or not. The final plan may not come out to be partition-wise join plan even if partition-wise join was considered. Although, I have now used DEBUG3 instead of NOTICE and removed those lines from the expected output. > > + * We allocate the array for child RelOptInfos till we find at least one > + * join order which can use partition-wise join technique. If no join order > + * can use partition-wise join technique, there are no child relations. > > This comment has problems. I think "till" is supposed to be "until", > and there's supposed to be a "don't" in there somewhere. But really, > I think what you're going for is just /* Allocate when first needed */ > which would be a lot shorter and also more clear. Sorry for those mistakes. Yes, shorter version is better. Fixed the comment as per your suggestion. > > + * Create join relations for the partition relations, if they do not exist > + * already. Add paths to those for the given pair of joining relations. > > I think the comment could be a bit more explanatory here. Something > like: "This joinrel is partitioned, so iterate over the partitions and > create paths for each one, allowing us to eventually build an > append-of-joins path for the parent. Since this routine may be called > multiple times for various join orders, the RelOptInfo needed for each > child join may or may not already exist, but the paths for this join > order definitely do not. Note that we don't create any actual > AppendPath at this stage; it only makes sense to do that at the end, > after each possible join order has been considered for each child > join. The best join order may differ from child to child." > Copied verbatim. Thanks for the detailed comment. > + * partiticipating in the given partition relations. We need them > > Spelling. > Done. Also fixed other grammatical mistakes and typos in that comment. > +/* > + * Construct the SpecialJoinInfo for the partition-wise join using parents' > + * special join info. Also, instead of > + * constructing an sjinfo everytime, we should probably save it in > + * root->join_info_list and search within it like join_is_legal? > + */ > > The lines here are of very different lengths for no particularly good > reason, and it should end with a period, not a question mark. My bad. Sorry. Fixed. > > On the substance of the issue, it seems like the way you're doing this > right now could allocate a very large number of SpecialJoinInfo > structures. For every join relation, you'll create one > SpecialJoinInfo per legal join order per partition. That seems like > it could get to be a big number. I don't know if that's going to be a > problem from a memory-usage standpoint, but it seems like it might. > It's not just the SpecialJoinInfo itself; all of the substructure gets > duplicated, too. > Yes. We need the SpecialJoinInfo structures for the existing path creation to work. The code will be complicated if we try to use parent SpecialJoinInfo instead of creating those for children. We may free memory allocated in SpecialJoinInfo to save some memory. SpecialJoinInfos are not needed once the paths are created. Still we will waste some memory for semi_rhs_exprs, which are reused for unique paths. But otherwise we will reclaim the rest of the memory. Memory wastage in adjust_partition_relids() may be minimized by modifying adjust_appendrel_attrs() to accept list of AppendRelInfos and mutating the tree only once rather than doing it N times for an N-way join. > + SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo); > + sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand, > + append_rel_infos1); > > Missing a blank line here. Done. > > + AppendRelInfo *ari = lfirst(lc); > > Standard naming convention for an AppendRelInfo variable seems to be > appinfo, not ari. (I just did "git grep AppendRelInfo".) Done. > > + /* Skip non-equi-join clauses. */ > + if (!rinfo->can_join || > + rinfo->hashjoinoperator == InvalidOid || > + !rinfo->mergeopfamilies) > + continue; > > There's definitely something ugly about this. If rinfo->can_join is > false, then we're done. But suppose one of mergeopfamilies == NIL and > rinfo->hashoperator == InvalidOid is true and the other is false. Are > we really precluded from doing a partiion-wise join in that case, or > are we just prohibited from using certain join strategies? In most > places where we make similar tests, we're careful not to require more > than we need. Right. That condition is flawed. Corrected it. > > I also think that these tests need to consider the partitioning > operator in use. Suppose that the partition key is of a type T which > has two operator classes X and Y. Both relations are partitioned > using an operator from opfamily X, but the join condition mentions > opfamily Y. I'm pretty sure this precludes a partitionwise join. If > the join condition used opfamily X, then we would have a guarantee > that two rows which compared as equal would be in the same partition, > but because it uses opfamily Y, that's not guaranteed. For example, > if T is a text type, X might test for exact equality using "C" > collation rules, while Y might test for equality using some > case-insensitive set of rules. If the partition boundaries are such > that "foo" and "FOO" are in different partitions, a partitionwise join > using the case-insensitive operator will produce wrong results. You > can also imagine this happening with numeric, if you have one opclass > (like the default one) that considers 5.0 and 5.00 to be equal, but > another opclass that thinks they are different; if the latter is used > to set the partition bounds, 5.0 and 5.00 could end up in different > partitions - which will be fine if an operator from that opclass is > used for the join, but not if an operator from the regular opclass is > used. Your description above uses opfamily and opclass interchangeably. It starts saying X and Y are classed but then also refers to them as families. But I got the point. I guess, similar to relation_has_unique_index_for(), I have to check whether the operator family specified in the partition scheme is present in the mergeopfamilies in RestrictInfo for matching partition key. I have added that check and restructured that portion of code to be readable. > > After thinking this over a bit, I think the right way to think about this is: > > 1. Amit's patch currently only ever uses btree opfamilies for > partitioning. It uses those for both range partitioning and list > partitioning. If we ever support hash partitioning, we would > presumably use hash opfamilies for that purpose, but right now it's > all about btree opfamilies. > > 2. Therefore, if A and B are partitioned but the btree opfamilies > don't match, they don't have the same partitioning scheme and this > code should never be reached. Similarly, if they use the same > opfamily but different collations, the partitioning schemes shouldn't > match and therefore this code should not be reached. That's right. > > 3. If A and B are partitioned and the partitioning opfamilies - which > are necessarily btree opfamilies - do match, then the operator which > appears in the query needs to be from the same opfamily and have > amopstrategy of BTEqualStrategyNumber within that opfamily. If not, > then a partition-wise join is not possible. I think this is achieved by checking whether the opfamily for given partition key is present in the mergeopfamilies of corresponding RestrictInfo, as stated above. > > 4. Assuming the above conditions are met, have_partkey_equi_join > doesn't need to care whether the operator chosen has mergeopfamilies > or a valid hashjoinoperator. Those factors will control which join > methods are legal, but not whether a partitionwise join is possible in > principle. If mergeopfamilies is NIL, above check will fail anyway. But skipping a clause which has mergeopfamilies NIL will save some cycles in matching expressions. There is something strange happening with Amit's patch. When we create a table partitioned by range on a column of type int2vector, it somehow gets a btree operator family, but doesn't have mergeopfamilies set in RestrictInfo of equality condition on that column. Instead the RestrictInfo has hashjoinoperator. In this case if we ignore hashjoinoperator, we won't be able to apply partition-wise join. I guess, in such case we want to play safe and not apply partition-wise join, even though applying it will give the correct result. > > + * RelabelType node; eval_const_expressions() will have simplied if more > > Spelling. > Thanks. Done. > > /* > + * Code below scores equivalence classes by how many equivalence members > + * can produce join clauses for this join relation. Equivalence members > + * which do not cover the parents of a partition-wise join relation, can > + * produce join clauses for partition-wise join relation. > + */ > > I don't know what that means. The comma in the second sentence > doesn't belong there. Sorry for that construction. I have changed the comment to be something more meaningful. > > + /* > + * TODO: Instead of copying and mutating the trees one child relation at a > + * time, we should be able to do this en-masse for all the partitions > + * involved. > + */ > > I don't see how that would be possible, but if it's a TODO, you'd > better do it (or decide not to do it and remove or change the > comment). That should be doable by passing a list of AppendRelInfo structures to adjust_appendrel_attrs_mutator(). In the mutator, we have to check each appinfo instead of just one. But that's a lot of refactoring. May be done as a separate patch, if we are consuming too much memory. I have removed TODO for now. > > /* > * Create explicit sort nodes for the outer and inner paths if necessary. > */ > if (best_path->outersortkeys) > { > + Relids outer_relids = outer_path->parent->relids; > Sort *sort = make_sort_from_pathkeys(outer_plan, > - best_path->outersortkeys); > + best_path->outersortkeys, > + outer_relids); > > The changes related to make_sort_from_pathkeys() are pretty opaque to > me. Can you explain? prepare_sort_from_pathkeys() accepts Relids as one of the argument to find equivalence members belonging to child relations. The function does not expect relids when searching equivalence members for parent relations. Before this patch, make_sort_from_pathkeys() passed NULL to this function, because it didn't expect child relations before. Because of partition-wise joins, we need to sort child relations for merge join or to create unique paths. So, make_sort_from_pathkeys() is required to pass relids to prepare_sort_from_pathkeys() when processing child relations, so that the later does not skip child members. > > + * Change parameterization of sub paths recursively. Also carry out any > > "sub paths" should not be two words, here or anywhere. Fixed. > > +reparameterize_path_for_child(PlannerInfo *root, Path *path, > + RelOptInfo *child_rel) > > This is suspiciously unlike reparameterize_path. Why? reparameterize_path() tries to create path with new parameterization from an existing parameterized path. So, it looks for additional conditions to expand the parameterization. But this functions translates a path parameterized by parent to be parameterized by its child. That does not involve looking for any extra conditions, but involves translating the existing ones so that they can be used with a child. A right name would be translate_parampath_to_child() or something which uses word "translate" instead of "reparameterize". But every name like that is getting too long. For now I have renamed it as reparameterize_path_by_child(). Also added a comment in the function prologue about cost, rows, width etc. > > + /* Computer information relevant to the foreign relations. */ > + set_foreign_rel_properties(joinrel, outer_rel, inner_rel); > > Perhaps this refactoring could be split out into a preliminary patch, > which would then simplify this patch. And same for add_join_rel(). > Yes, that's better. I will separate the code out in a separate patch. There's code in build_join_rel() and build_partition_join_rel() (I will change that name) which creates a joinrel RelOptInfo. Most of that code simply sets NULL or 0 fields and is duplicated in both the functions. Do you see any value in separating it out in its own function? Also, makeNode() uses palloc0(), thus makeNode(RelOptInfo) would set most of the fields to 0 or NULL. Why do we then again set those fields as NULL or 0? Should I try to remove unnecessary assignments? > + * Produce partition-wise joinrel's targetlist by translating the parent > + * joinrel's targetlist. This will also include the required placeholder > > Again the confusion between a "child" join and a partition-wise join... > > + /* > + * Nothing to do if > + * a. partition-wise join is disabled. > + * b. joining relations are not partitioned. > + * c. partitioning schemes do not match. > + */ > + > > I don't think that's going to survive pgindent. Changed this code a bit. > > + * are not considered equal, an equi-join involing inner partition keys > > Spelling. > > + * Collect the partition key expressions. An OUTER join will produce rows > + * where the partition key columns of inner side are NULL and may not fit > + * the partitioning scheme with inner partition keys. Since two NULL values > + * are not considered equal, an equi-join involing inner partition keys > + * still prohibits cross-partition joins while joining with another > + * similarly partitioned relation. > > I can't figure out what this comment is trying to tell me. Possibly I > just need more caffeine. Re-wrote the comment with examples and detailed explanation. The comment talks about whether inner partition key expressions should be considered as the partition key expressions for the join, given that for an OUTER join the inner partition key expressions may go NULL. The comment explains why it's safe to do so. If we don't do that, any FULL OUTER join will have no partition expressions and thus partition-wise join technique will be useless for a N-way FULL OUTER join even if it's safe to use it. > > + * Adding these two join_rel_level list also means that top level list has more > + * than one join relation, which is symantically incorrect. > > I don't understand this, either; also, spelling. I think, that sentence is not required. Removed it. > > As a general comment, the ratio of tests-to-code in this patch is way > out of line with PostgreSQL's normal practices. The total patch file > is 10965 lines. The test cases begin at line 3047, meaning that in > round figures you've got about one-quarter code and about > three-quarters test cases. I suspect that a large fraction of those > test cases aren't adding any meaningful code coverage and will just > take work to maintain. That needs to be slimmed down substantially in > any version of this considered for commit. I agree. We require two kinds of tests 1. those which test partition scheme matching 2. those test the planner code, which deals with path creation. I have added both kinds of testcases for all kinds of partitioning schemes (range, list, multi-level, partition key being expressions, columns). That's not required. We need 1st kind of tests for all partitioning schemes and 2nd kind of testcases only for one of the partitioning schemes. So, definitely the number of tests will reduce. A possible extreme would be to use a single multi-level partitioned tests, which includes all kinds of partitioning schemes at various partition levels. But that kind of testcase will be highly unreadable and harder to maintain. Let me know what do you think. I will work on that in the next version of patch. The test still fails because of a bug in Amit's earlier set of patches -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> Have you tested the effect of this patch on planner memory consumption >> with multi-way joins between tables with many partitions? If you >> haven't, you probably should. (Testing runtime would be good, too.) >> Does it grow linearly? Quadratically? Exponentially? Minor leaks >> don't matter, but if we're generating too much garbage we'll have to >> make sure it gets cleaned up soon enough to prevent runaway memory >> usage. > > I tried to check memory usage with various combinations of number of > partitions and number of relations being joined. For higher number of > relations being joined like 10 with 100 partitions, OOM killer kicked > in during the planning phase. I am suspecting > adjust_partitionrel_attrs() (changed that name to > adjust_join_appendrel_attrs() to be in sync with > adjust_appendrel_attrs()) to be the culprit. It copies expression > trees every time for joining two children. That's an exponentially > increasing number as the number of legal joins increases > exponentially. I am still investigating this. I think the root of this problem is that the existing paths shares a lot more substructure than the ones created by the new code. Without a partition-wise join, the incremental memory usage for a joinrel isn't any different whether the underlying rel is partitioned or not. If it's partitioned, we'll be pointing to an AppendPath; if not, we'll be pointing to some kind of Scan. But the join itself creates exactly the same amount of new stuff regardless of what's underneath it. With partitionwise join, that ceases to be true. Every joinrel - and the number of those grows exponentially in the number of baserels, IICU - needs its own list of paths for every member rel. So if a non-partition-wise join created X paths, and there are K partitions, a partition-wise join creates X * K paths. That's a lot. Although we might be able to save some memory by tightening things up here and there - for example, right now the planner isn't real smart about recycling paths that are evicted by add_path(), and there's probably other wastage as well - I suspect that what this shows is that the basic design of this patch is not going to be viable. Intuitively, it's often going to be the case that we want the "same plan" for every partition-set. That is, if we have A JOIN B ON A.x = B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility partitioned, then the result should be an Append plan with 100 join plans under it, and all 100 of those plans should be basically mirror images of each other. Of course, that's not really right in general: for example, it could be that A1 is big and A2 is small while B1 is small and B2 is big, so that the right plan for (A1 JOIN B1) and for (A2 JOIN B2) are totally different from each other. But in many practical cases we'll want to end up with a plan of precisely the same shape for all children, and the current design ignores this, expending both memory and CPU time to compute essentially-equivalent paths across all children. One way of attacking this problem is to gang together partitions which are equivalent for planning purposes, as discussed in the paper "Join Optimization Techniques for Partitioned Tables" by Herodotou, Borisov, and Babu. However, it's not exactly clear how to do this: we could gang together partitions that have the same index definitions, but the sizes of the heaps, the sizes of their indexes, and the row counts will vary from one partition to the next, and any of those things could cause the plan choice to be different for one partition vs. the next. We could try to come up with heuristics for when those things are likely to be true. For example, suppose we compute the set of partitions such that all joined relations have matching index definitions on all tables; then, we take the biggest table in the set and consider all tables more than half that size as part of one gang. The biggest table becomes the leader and we compute partition-wise paths for just that partition; the other members of the gang will eventually get a plan that is of the same shape, but we don't actually create it that plan until after scan/join planning is concluded. Another idea is to try to reduce peak memory usage by performing planning separately for each partition-set. For example, suppose we decide to do a partition-wise join of A, B, and C. Initially, this gets represented as a PartitionJoinPath tree, like this: PartitionJoinPath -> AppendPath for A -> PartitionJoinPath -> AppendPath for B -> AppendPath for C Because we haven't created individual join paths for the members, this doesn't use much memory. Somehow, we come up with a cost for the PartitionJoinPath; it probably won't be entirely accurate. Once scan/join planning is concluded, if our final path contains a PartitionJoinPath, we go back and loop over the partitions. For each partition, we switch to a new memory context, perform planning, copy the best path and its substructure back to the parent context, and then reset the context. In that way, peak memory usage only grows by about a factor of 2 rather than a factor equal to the partition count, because we don't need to keep every possibly-useful path for every partition all at the same time, but rather every possibly-useful path for a single partition. Maybe there are other ideas but I have a feeling any way you slice it this is going to be a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> Have you tested the effect of this patch on planner memory consumption >>> with multi-way joins between tables with many partitions? If you >>> haven't, you probably should. (Testing runtime would be good, too.) >>> Does it grow linearly? Quadratically? Exponentially? Minor leaks >>> don't matter, but if we're generating too much garbage we'll have to >>> make sure it gets cleaned up soon enough to prevent runaway memory >>> usage. >> >> I tried to check memory usage with various combinations of number of >> partitions and number of relations being joined. For higher number of >> relations being joined like 10 with 100 partitions, OOM killer kicked >> in during the planning phase. I am suspecting >> adjust_partitionrel_attrs() (changed that name to >> adjust_join_appendrel_attrs() to be in sync with >> adjust_appendrel_attrs()) to be the culprit. It copies expression >> trees every time for joining two children. That's an exponentially >> increasing number as the number of legal joins increases >> exponentially. I am still investigating this. > > I think the root of this problem is that the existing paths shares a > lot more substructure than the ones created by the new code. Without > a partition-wise join, the incremental memory usage for a joinrel > isn't any different whether the underlying rel is partitioned or not. > If it's partitioned, we'll be pointing to an AppendPath; if not, we'll > be pointing to some kind of Scan. But the join itself creates exactly > the same amount of new stuff regardless of what's underneath it. With > partitionwise join, that ceases to be true. Every joinrel - and the > number of those grows exponentially in the number of baserels, IICU - > needs its own list of paths for every member rel. So if a > non-partition-wise join created X paths, and there are K partitions, a > partition-wise join creates X * K paths. That's a lot. > > Although we might be able to save some memory by tightening things up > here and there - for example, right now the planner isn't real smart > about recycling paths that are evicted by add_path(), and there's > probably other wastage as well - I suspect that what this shows is > that the basic design of this patch is not going to be viable. > Intuitively, it's often going to be the case that we want the "same > plan" for every partition-set. That is, if we have A JOIN B ON A.x = > B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility > partitioned, then the result should be an Append plan with 100 join > plans under it, and all 100 of those plans should be basically mirror > images of each other. Of course, that's not really right in general: > for example, it could be that A1 is big and A2 is small while B1 is > small and B2 is big, so that the right plan for (A1 JOIN B1) and for > (A2 JOIN B2) are totally different from each other. But in many > practical cases we'll want to end up with a plan of precisely the same > shape for all children, and the current design ignores this, expending > both memory and CPU time to compute essentially-equivalent paths > across all children. I think there are going to be two kinds of partitioning use-cases. First, carefully hand-crafted by DBAs so that every partition is different from other and so is every join between two partitions. There will be lesser number of partitions, but creating paths for each join between partitions will be crucial from performance point of view. Consider, for example, systems which use partitions to consolidate results from different sources for analytical purposes or sharding. If we consider various points you have listed in [1] as to why a partition is equivalent to a table, each join between partitions is going to have very different characteristics and thus deserves a set of paths for its own. Add to that possibility of partition pruning or certain conditions affecting particular partitions, the need for detailed planning evident. The other usage of partitioning is to distribute the data and/or quickly eliminate the data by partition pruning. In such case, all partitions of a given table will have very similar properties. There is a large chance that we will end up having same plans for every partition and for joins between partitions. In such cases, I think it suffices to create paths for just one or may be a handful partitions of join and repeat that plan for other partitions of join. But in such cases it also makes sense to have a light-weight representation for partitions as compared to partitions being a full-fledged tables. If we have such a light-weight representation, we may not even create RelOptInfos representing joins between partitions, and different paths for each join between partitions. > > One way of attacking this problem is to gang together partitions which > are equivalent for planning purposes, as discussed in the paper "Join > Optimization Techniques for Partitioned Tables" by Herodotou, Borisov, > and Babu. However, it's not exactly clear how to do this: we could > gang together partitions that have the same index definitions, but the > sizes of the heaps, the sizes of their indexes, and the row counts > will vary from one partition to the next, and any of those things > could cause the plan choice to be different for one partition vs. the > next. We could try to come up with heuristics for when those things > are likely to be true. For example, suppose we compute the set of > partitions such that all joined relations have matching index > definitions on all tables; then, we take the biggest table in the set > and consider all tables more than half that size as part of one gang. > The biggest table becomes the leader and we compute partition-wise > paths for just that partition; the other members of the gang will > eventually get a plan that is of the same shape, but we don't actually > create it that plan until after scan/join planning is concluded. Section 5 of that paper talks about clustering partitions together for joining, only when there is 1:m OR n:1 partition matching for join. In such a case, it clusters all the partitions from one relation that are all joined with a single partition of the other relation. But I think your idea to gang up partitions with similar properties may reduce the number of paths we create but as you have mentioned how to gang them up is not very clear. There are just too many factors like availability of the indexes, sizes of tables, size of intermediate results etc. which make it difficult to identify the properties used for ganging up. Even after we do that, in the worst case, we will still end up creating paths for all partitions of all joins, thus causing increase in paths proportionate to the number of partitions. In the section 6.3, the paper mentions that the number of paths retained are linear in the number of child joins per parent join. So, it's clear that the paper never considered linear increase in the paths to be a problem or at least a problem that that work had to solve. Now, it's surprising that their memory usage increased by 7% to 10%. But 1. they might be measuring total memory and not the memory used by the planner and they experimented with PostgreSQL 8.3.7, which probably tried much less number of paths than the current optimizer. > > Another idea is to try to reduce peak memory usage by performing > planning separately for each partition-set. For example, suppose we > decide to do a partition-wise join of A, B, and C. Initially, this > gets represented as a PartitionJoinPath tree, like this: > > PartitionJoinPath > -> AppendPath for A > -> PartitionJoinPath > -> AppendPath for B > -> AppendPath for C > > Because we haven't created individual join paths for the members, this > doesn't use much memory. Somehow, we come up with a cost for the > PartitionJoinPath; it probably won't be entirely accurate. Once > scan/join planning is concluded, if our final path contains a > PartitionJoinPath, we go back and loop over the partitions. A typical join tree will be composite: some portion partitioned and some portion unpartitioned or different portions partitioned by different partition schemes. In such case, inaccurate costs for PartitionJoinPath, can affect the plan heavily, causing a suboptimal path to be picked. Assuming that partitioning will be useful for large sets of data, choosing a suboptimal plan can be more dangerous than consuming memory for creating paths. If we could come up with costs for PartitionJoinPath using some methods of interpolation, say by sampling few partitions and then extrapolating their costs for entire PartitionJoinPath, we can use this method. But unless the partitions have very similar characteristics or have such characteristics that costs can be guessed based on the differences between the characteristics, I do not see how that can happen. For example, while costing a PartitionJoinPath with pathkeys, the costs will change a lot based on whether underlying relations have indexes, or which join methods are used, which in turn is based on properties on the partitions. Same is the case for paths with parameterization. All such paths are important when a partitioned join relation joins with other unpartitioned relation or a partitioned relation with different partitioning scheme. When each partition of base relation being joined has different properties, the cost for join between one set of partitions can differ from join between other set of partitions. Not only that, the costs for various properties of resultant paths like pathkeys, parameterization can vary a lot, depending upon the available indexes and estimates of rows for each join. So, we need to come up with these cost estimates separately for each join between partitions to come up with cost of each PartitionJoinPath. If we have to calculate those costs to create PartitionJoinPath, we better save them in paths rather than recalculating them in the second round of planning for joins between partitions. > For each > partition, we switch to a new memory context, perform planning, copy > the best path and its substructure back to the parent context, and > then reset the context. This could be rather tricky. It assumes that all the code that creates paths for joins, should not allocate any memory which is linked to some object in a context that lives longer than the path creation context. There is some code like create_join_clause() or make_canonical_pathkey(), which carefully chooses which memory context to allocate memory in. But can we ensure it always? postgres_fdw for example allocates memory for PgFdwRelationInfo in current memory context and attaches it in RelOptInfo, which should be in the planner's original context. So, if we create a new memory context for each partition, fpinfos would be invalidated when those contexts are released. Not that, we can not enforce some restriction on the memory usage while planning, it's hard to enforce it and bugs arising from it may go unnoticed. GEQO planner might have its own problems with this approach. Third party FDWs will pose a problem. A possible solution would be to keep the track of used paths using a reference count. Once the paths for given join tree are created, free up the unused paths by traversing pathlist in each of the RelOptInfos. Attached patch has a prototype implementation for the same. There are some paths which are not linked to RelOptInfos, which need a bit different treatment, but they can be handled too. > In that way, peak memory usage only grows by > about a factor of 2 rather than a factor equal to the partition count, > because we don't need to keep every possibly-useful path for every > partition all at the same time, but rather every possibly-useful path > for a single partition. > > Maybe there are other ideas but I have a feeling any way you slice it > this is going to be a lot of work. For the case of carefully hand-crafted partitions, I think, users would expect the planner to use really the best plan and thus may be willing to accommodate for increased memory usage. Going by any approach that does not create the paths for joins between partitions is not guaranteed to give the best plan. Users willing to provide increased memory will be unhappy if we do not give them the best path. The user who creates hundreds of partitions, will ideally be using pretty powerful servers with a lot of memory. On such servers, the linear increase in memory for paths may not be as bad as you are portraying above, as long as its producing the best plan. Just joining partitioned tables with hundreds of partitions does not increase the number of paths. Number of paths increases when two partitioned tables with similar partitioning scheme are joined with equality condition on partition key. Unless we consider repartitioning, how many of the joining relations share same partitioning scheme? Section 8.6 mentions, "no TPC-H query plan, regardless of the partitioning scheme, contains n-way child joins for n >= 4. Maximum partitions that the paper mentions is 168 (Table 3). My VM which has 8GB RAM and 4 cores handled that case pretty well. We may add logic to free up space used by useless paths post-join to free up some memory for next stages of query execution. There will still be users, for whom the increase in the memory usage is unexpected. Those will need to be educated or for them we might take heuristic PartitionJoinPath based approach discussed above. But I don't think that heuristic approach should be the default case. May be we should supply a GUC which can switch between the approaches. Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses the heuristic approach of PartitionJoinPath. 2. A GUC similar to join_collapse_limit may be used to limit the number of partitioned relations being joined using partition-wise join technique. A value of 1, indicates enable_partition_wise_join = false. So, we may replace enable_partition_wise_join withe this GUC. 3. A GUC max_joinable_partitions (open to suggestions for name) may specify the maximum number of partitions that two relations may have to be eligible for partition-wise join. I guess, using these GUCs allows a user handle the trade-off between getting the best plan and memory usage consciously. I think, users would like to accept a suboptimal plans consciously than being thrown a suboptimal plan without choice. [1] http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I think there are going to be two kinds of partitioning use-cases. > First, carefully hand-crafted by DBAs so that every partition is > different from other and so is every join between two partitions. > There will be lesser number of partitions, but creating paths for each > join between partitions will be crucial from performance point of > view. Consider, for example, systems which use partitions to > consolidate results from different sources for analytical purposes or > sharding. If we consider various points you have listed in [1] as to > why a partition is equivalent to a table, each join between partitions > is going to have very different characteristics and thus deserves a > set of paths for its own. Add to that possibility of partition pruning > or certain conditions affecting particular partitions, the need for > detailed planning evident. > > The other usage of partitioning is to distribute the data and/or > quickly eliminate the data by partition pruning. In such case, all > partitions of a given table will have very similar properties. There > is a large chance that we will end up having same plans for every > partition and for joins between partitions. In such cases, I think it > suffices to create paths for just one or may be a handful partitions > of join and repeat that plan for other partitions of join. But in such > cases it also makes sense to have a light-weight representation for > partitions as compared to partitions being a full-fledged tables. If > we have such a light-weight representation, we may not even create > RelOptInfos representing joins between partitions, and different paths > for each join between partitions. I'm not sure I see a real distinction between these two use cases. I think that the problem of differing data distribution between partitions is almost always going to be an issue. Take the simple case of an "orders" table which is partitioned by month. First, the month that's currently in progress may be much smaller than a typical completed month. Second, many businesses are seasonal and may have many more orders at certain times of year. For example, in American retail, many businesses have large spikes in December. I think some businesses may do four times as much business in December as any other month, for example. So you will have that sort of variation, at least. > A typical join tree will be composite: some portion partitioned and > some portion unpartitioned or different portions partitioned by > different partition schemes. In such case, inaccurate costs for > PartitionJoinPath, can affect the plan heavily, causing a suboptimal > path to be picked. Assuming that partitioning will be useful for large > sets of data, choosing a suboptimal plan can be more dangerous than > consuming memory for creating paths. Well, sure. But, I mean, every simplifying assumption which the planner makes to limit resource consumption could have that effect. join_collapse_limit, for example, can cause horrible plans. However, we have it anyway, because the alternative of having planning take far too long is unpalatable. Planning is always, at some level, guesswork. >> For each >> partition, we switch to a new memory context, perform planning, copy >> the best path and its substructure back to the parent context, and >> then reset the context. > > This could be rather tricky. It assumes that all the code that creates > paths for joins, should not allocate any memory which is linked to > some object in a context that lives longer than the path creation > context. There is some code like create_join_clause() or > make_canonical_pathkey(), which carefully chooses which memory context > to allocate memory in. But can we ensure it always? postgres_fdw for > example allocates memory for PgFdwRelationInfo in current memory > context and attaches it in RelOptInfo, which should be in the > planner's original context. So, if we create a new memory context for > each partition, fpinfos would be invalidated when those contexts are > released. Not that, we can not enforce some restriction on the memory > usage while planning, it's hard to enforce it and bugs arising from it > may go unnoticed. GEQO planner might have its own problems with this > approach. Third party FDWs will pose a problem. Yep, there are problems. :-) > A possible solution would be to keep the track of used paths using a > reference count. Once the paths for given join tree are created, free > up the unused paths by traversing pathlist in each of the RelOptInfos. > Attached patch has a prototype implementation for the same. There are > some paths which are not linked to RelOptInfos, which need a bit > different treatment, but they can be handled too. So, if you apply this with your previous patch, how much does it cut down memory consumption? >> In that way, peak memory usage only grows by >> about a factor of 2 rather than a factor equal to the partition count, >> because we don't need to keep every possibly-useful path for every >> partition all at the same time, but rather every possibly-useful path >> for a single partition. >> >> Maybe there are other ideas but I have a feeling any way you slice it >> this is going to be a lot of work. > > For the case of carefully hand-crafted partitions, I think, users > would expect the planner to use really the best plan and thus may be > willing to accommodate for increased memory usage. Going by any > approach that does not create the paths for joins between partitions > is not guaranteed to give the best plan. Users willing to provide > increased memory will be unhappy if we do not give them the best path. > > The user who creates hundreds of partitions, will ideally be using > pretty powerful servers with a lot of memory. On such servers, the > linear increase in memory for paths may not be as bad as you are > portraying above, as long as its producing the best plan. No, I don't agree. We should be trying to build something that scales well. I've heard reports of customers with hundreds or even thousands of partitions; I think it is quite reasonable to think that we need to scale to 1000 partitions. If we use 3MB of memory to plan a query involving unpartitioned, using 3GB to plan a query where the main tables have been partitioned 1000 ways does not seem reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 31, 2016 at 6:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I think there are going to be two kinds of partitioning use-cases. >> First, carefully hand-crafted by DBAs so that every partition is >> different from other and so is every join between two partitions. >> There will be lesser number of partitions, but creating paths for each >> join between partitions will be crucial from performance point of >> view. Consider, for example, systems which use partitions to >> consolidate results from different sources for analytical purposes or >> sharding. If we consider various points you have listed in [1] as to >> why a partition is equivalent to a table, each join between partitions >> is going to have very different characteristics and thus deserves a >> set of paths for its own. Add to that possibility of partition pruning >> or certain conditions affecting particular partitions, the need for >> detailed planning evident. >> >> The other usage of partitioning is to distribute the data and/or >> quickly eliminate the data by partition pruning. In such case, all >> partitions of a given table will have very similar properties. There >> is a large chance that we will end up having same plans for every >> partition and for joins between partitions. In such cases, I think it >> suffices to create paths for just one or may be a handful partitions >> of join and repeat that plan for other partitions of join. But in such >> cases it also makes sense to have a light-weight representation for >> partitions as compared to partitions being a full-fledged tables. If >> we have such a light-weight representation, we may not even create >> RelOptInfos representing joins between partitions, and different paths >> for each join between partitions. > > I'm not sure I see a real distinction between these two use cases. I > think that the problem of differing data distribution between > partitions is almost always going to be an issue. Take the simple > case of an "orders" table which is partitioned by month. First, the > month that's currently in progress may be much smaller than a typical > completed month. Second, many businesses are seasonal and may have > many more orders at certain times of year. For example, in American > retail, many businesses have large spikes in December. I think some > businesses may do four times as much business in December as any other > month, for example. So you will have that sort of variation, at > least. > >> A typical join tree will be composite: some portion partitioned and >> some portion unpartitioned or different portions partitioned by >> different partition schemes. In such case, inaccurate costs for >> PartitionJoinPath, can affect the plan heavily, causing a suboptimal >> path to be picked. Assuming that partitioning will be useful for large >> sets of data, choosing a suboptimal plan can be more dangerous than >> consuming memory for creating paths. > > Well, sure. But, I mean, every simplifying assumption which the > planner makes to limit resource consumption could have that effect. > join_collapse_limit, for example, can cause horrible plans. However, > we have it anyway, because the alternative of having planning take far > too long is unpalatable. Planning is always, at some level, > guesswork. My point is, this behaviour is configurable. Users who are ready to spend time and resources to get the best plan are still able to do so, by choosing a higher limit on join_collapse_limit. Those who can not afford to do so, are ready to use inferior plans willingly by setting join_collapse_limit to a lower number. > >> A possible solution would be to keep the track of used paths using a >> reference count. Once the paths for given join tree are created, free >> up the unused paths by traversing pathlist in each of the RelOptInfos. >> Attached patch has a prototype implementation for the same. There are >> some paths which are not linked to RelOptInfos, which need a bit >> different treatment, but they can be handled too. > > So, if you apply this with your previous patch, how much does it cut > down memory consumption? Answered this below: > >>> In that way, peak memory usage only grows by >>> about a factor of 2 rather than a factor equal to the partition count, >>> because we don't need to keep every possibly-useful path for every >>> partition all at the same time, but rather every possibly-useful path >>> for a single partition. >>> >>> Maybe there are other ideas but I have a feeling any way you slice it >>> this is going to be a lot of work. >> >> For the case of carefully hand-crafted partitions, I think, users >> would expect the planner to use really the best plan and thus may be >> willing to accommodate for increased memory usage. Going by any >> approach that does not create the paths for joins between partitions >> is not guaranteed to give the best plan. Users willing to provide >> increased memory will be unhappy if we do not give them the best path. >> >> The user who creates hundreds of partitions, will ideally be using >> pretty powerful servers with a lot of memory. On such servers, the >> linear increase in memory for paths may not be as bad as you are >> portraying above, as long as its producing the best plan. > > No, I don't agree. We should be trying to build something that scales > well. I've heard reports of customers with hundreds or even thousands > of partitions; I think it is quite reasonable to think that we need to > scale to 1000 partitions. If we use 3MB of memory to plan a query > involving unpartitioned, using 3GB to plan a query where the main > tables have been partitioned 1000 ways does not seem reasonable to me. Here are memory consumption numbers. For a simple query "select * from v5_prt100", where v5_prt100 is a view on a 5 way self join of table prt100, which is a plain table with 100 partitions without any indexes. postgres=# \d+ v5_prt100 View "part_mem_usage.v5_prt100"Column | Type | Modifiers | Storage | Description --------+--------+-----------+----------+-------------t1 | prt100 | | extended |t2 | prt100 | | extended |t3 | prt100 | | extended |t4 | prt100 | | extended |t5 | prt100 | | extended | View definition:SELECT t1.*::prt100 AS t1, t2.*::prt100 AS t2, t3.*::prt100 AS t3, t4.*::prt100 AS t4, t5.*::prt100AS t5 FROM prt100 t1, prt100 t2, prt100 t3, prt100 t4, prt100 t5 WHERE t1.a = t2.a AND t2.a = t3.a ANDt3.a = t4.a AND t4.a = t5.a; postgres=# \d prt100 Table "part_mem_usage.prt100"Column | Type | Modifiers --------+-------------------+-----------a | integer |b | integer |c | character varying| Partition Key: RANGE (a) Number of partitions: 100 (Use \d+ to list them.) Without partition-wise join the standard_planner() consumes 4311 kB memory of which 150 kB is consumed in add_paths_to_joinrel(). With partition-wise join standard_planner() consumes 65MB memory, which is 16 times more (not 100 times more as you suspected above). Of this bloat 16MB is consumed for creating child join paths whereas 651kB is consumed in creating append paths. That's 100 times bloat for path creation. Rest of the memory bloat is broken down as 9MB to create child join RelOptInfos, 29MB to translate restrict clauses, 8MB to translate target lists. 2MB for creating special join info for children, 2MB goes into creating plans. If we apply logic to free unused paths, the memory consumption reduces as follows Without partition-wise join standard_planner() consumes 4268 kB (against 4311kB earlier) of which 123kB (against 150kB earlier) is consumed in add_paths_to_joinrel(). With partition-wise join, standard_planner() consumes 63MB (against 65MB earlier). Child join paths still consume 13 MB (against 16MB earlier), which is still 100 times that without using partition-wise join. We may shave off some memory consumption by using better methods than translating expressions, but we will continue to have bloats introduced by paths, RelOptInfos for child joins etc. So, I am thinking about your approach of creating PartitionJoinPaths without actually creating child paths and then at a later stage actually plan the child joins. Here's rough sketch of how that may be done. At the time of creating regular paths, we identify the join orders which can use partition-wise join and save those in the RelOptInfo of the parent table. If no such join order exists, we do not create PartitionJoinPaths for that relation. Otherwise, once we have considered all the join orders i.e. in generate_partition_wise_join_paths(), we create one PartitionJoinPath for every path that has survived in the parent or at least for every path that has distinct properties like pathkeys or parameterisation, with those properties. At the time of creating plans, if PartitionJoinPath is chosen, we actually create paths for every partition of that relation recursively. The path creation logic is carried out in a different memory context. Amongst the paths that survive, we choose the best path that has the same properties as PartitionJoinPath. We would expect all parameterized paths to be retained and any unparameterized path can be sorted to match the pathkeys of reference PartitionJoinPath. We then create the plan out of this path and copy it into the outer memory context and release the memory context used for path creation. This is similar to how prepared statements save their plans. Once we have the plan, the memory consumed by paths won't be referenced, and hence can not create problems. At the end we create an Append/MergeAppend plan with all the child plans and return it. Costing PartitionJoinPath needs more thought so that we don't end up with bad overall plans. Here's an idea. Partition-wise joins are better compared to the unpartitioned ones, because of the smaller sizes of partitions. If we think of join as O(MN) operation where M and N are sizes of unpartitioned tables being joined, partition-wise join computes P joins each with average O(M/P * N/P) order where P is the number of partitions, which is still O(MN) with constant factor reduced by P times. I think, we need to apply similar logic to costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) where S and R are setup cost and joining cost (for M, N rows) resp. Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) + P * R(M/P, N/P). Each of the join methods will have different S and R functions and may not be linear on the number of rows. So, PartitionJoinPath costs are obtained from corresponding regular path costs subjected to above transformation. This way, we will be protected from choosing a PartitionJoinPath when it's not optimal. Take example of a join where the joining relations are very small in size, thus hash join on full relation is optimal compared to hash join of each partition because of setup cost. In such a case, the function which calculates the cost of hash table setup, would result in almost same cost for full table as well as each of the partitions, thus increasing P * S(M/P, N/P) as compared to S(M, N). Let me know your comments. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> > So, I am thinking about your approach of creating PartitionJoinPaths > without actually creating child paths and then at a later stage > actually plan the child joins. Here's rough sketch of how that may be > done. > > At the time of creating regular paths, we identify the join orders > which can use partition-wise join and save those in the RelOptInfo of > the parent table. If no such join order exists, we do not create > PartitionJoinPaths for that relation. Otherwise, once we have > considered all the join orders i.e. in > generate_partition_wise_join_paths(), we create one PartitionJoinPath > for every path that has survived in the parent or at least for every > path that has distinct properties like pathkeys or parameterisation, > with those properties. > > At the time of creating plans, if PartitionJoinPath is chosen, we > actually create paths for every partition of that relation > recursively. The path creation logic is carried out in a different > memory context. Amongst the paths that survive, we choose the best > path that has the same properties as PartitionJoinPath. We would > expect all parameterized paths to be retained and any unparameterized > path can be sorted to match the pathkeys of reference > PartitionJoinPath. We then create the plan out of this path and copy > it into the outer memory context and release the memory context used > for path creation. This is similar to how prepared statements save > their plans. Once we have the plan, the memory consumed by paths won't > be referenced, and hence can not create problems. At the end we create > an Append/MergeAppend plan with all the child plans and return it. > > Costing PartitionJoinPath needs more thought so that we don't end up > with bad overall plans. Here's an idea. Partition-wise joins are > better compared to the unpartitioned ones, because of the smaller > sizes of partitions. If we think of join as O(MN) operation where M > and N are sizes of unpartitioned tables being joined, partition-wise > join computes P joins each with average O(M/P * N/P) order where P is > the number of partitions, which is still O(MN) with constant factor > reduced by P times. I think, we need to apply similar logic to > costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) > where S and R are setup cost and joining cost (for M, N rows) resp. > Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) > + P * R(M/P, N/P). Each of the join methods will have different S and > R functions and may not be linear on the number of rows. So, > PartitionJoinPath costs are obtained from corresponding regular path > costs subjected to above transformation. This way, we will be > protected from choosing a PartitionJoinPath when it's not optimal. > Take example of a join where the joining relations are very small in > size, thus hash join on full relation is optimal compared to hash join > of each partition because of setup cost. In such a case, the function > which calculates the cost of hash table setup, would result in almost > same cost for full table as well as each of the partitions, thus > increasing P * S(M/P, N/P) as compared to S(M, N). > > Let me know your comments. I tried to measure the impact of having a memory context reset 1000 times (once for each partition) with the attached patch. Without this patch make check in regress/ takes about 24 seconds on my laptop and with this patch it takes 26 seconds. This is almost 10% increase in time. I hope that's fine. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Costing PartitionJoinPath needs more thought so that we don't end up > with bad overall plans. Here's an idea. Partition-wise joins are > better compared to the unpartitioned ones, because of the smaller > sizes of partitions. If we think of join as O(MN) operation where M > and N are sizes of unpartitioned tables being joined, partition-wise > join computes P joins each with average O(M/P * N/P) order where P is > the number of partitions, which is still O(MN) with constant factor > reduced by P times. I think, we need to apply similar logic to > costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) > where S and R are setup cost and joining cost (for M, N rows) resp. > Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) > + P * R(M/P, N/P). Each of the join methods will have different S and > R functions and may not be linear on the number of rows. So, > PartitionJoinPath costs are obtained from corresponding regular path > costs subjected to above transformation. This way, we will be > protected from choosing a PartitionJoinPath when it's not optimal. I'm not sure that I really understand the stuff with big-O notation and M, N, and P. But I think what you are saying is that we could cost a PartitionJoinPath by costing some of the partitions (it might be a good idea to choose the biggest ones) and assuming the cost for the remaining ones will be roughly proportional. That does seem like a reasonable strategy to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Costing PartitionJoinPath needs more thought so that we don't end up >> with bad overall plans. Here's an idea. Partition-wise joins are >> better compared to the unpartitioned ones, because of the smaller >> sizes of partitions. If we think of join as O(MN) operation where M >> and N are sizes of unpartitioned tables being joined, partition-wise >> join computes P joins each with average O(M/P * N/P) order where P is >> the number of partitions, which is still O(MN) with constant factor >> reduced by P times. I think, we need to apply similar logic to >> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) >> where S and R are setup cost and joining cost (for M, N rows) resp. >> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) >> + P * R(M/P, N/P). Each of the join methods will have different S and >> R functions and may not be linear on the number of rows. So, >> PartitionJoinPath costs are obtained from corresponding regular path >> costs subjected to above transformation. This way, we will be >> protected from choosing a PartitionJoinPath when it's not optimal. > I'm not sure that I really understand the stuff with big-O notation > and M, N, and P. But I think what you are saying is that we could > cost a PartitionJoinPath by costing some of the partitions (it might > be a good idea to choose the biggest ones) and assuming the cost for > the remaining ones will be roughly proportional. That does seem like > a reasonable strategy to me. I'm not sure to what extent the above argument depends on the assumption that join is O(MN), but I will point out that in no case of practical interest for large tables is it actually O(MN). That would be true only for the stupidest possible nested-loop join method. It would be wise to convince ourselves that the argument holds for more realistic big-O costs, eg hash join is more like O(M+N) if all goes well. regards, tom lane
On Mon, Nov 14, 2016 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> Costing PartitionJoinPath needs more thought so that we don't end up >>> with bad overall plans. Here's an idea. Partition-wise joins are >>> better compared to the unpartitioned ones, because of the smaller >>> sizes of partitions. If we think of join as O(MN) operation where M >>> and N are sizes of unpartitioned tables being joined, partition-wise >>> join computes P joins each with average O(M/P * N/P) order where P is >>> the number of partitions, which is still O(MN) with constant factor >>> reduced by P times. I think, we need to apply similar logic to >>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N) >>> where S and R are setup cost and joining cost (for M, N rows) resp. >>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P) >>> + P * R(M/P, N/P). Each of the join methods will have different S and >>> R functions and may not be linear on the number of rows. So, >>> PartitionJoinPath costs are obtained from corresponding regular path >>> costs subjected to above transformation. This way, we will be >>> protected from choosing a PartitionJoinPath when it's not optimal. > >> I'm not sure that I really understand the stuff with big-O notation >> and M, N, and P. But I think what you are saying is that we could >> cost a PartitionJoinPath by costing some of the partitions (it might >> be a good idea to choose the biggest ones) and assuming the cost for >> the remaining ones will be roughly proportional. That does seem like >> a reasonable strategy to me. > > I'm not sure to what extent the above argument depends on the assumption > that join is O(MN), but I will point out that in no case of practical > interest for large tables is it actually O(MN). That would be true > only for the stupidest possible nested-loop join method. It would be > wise to convince ourselves that the argument holds for more realistic > big-O costs, eg hash join is more like O(M+N) if all goes well. Yeah, I agree. To recap briefly, the problem we're trying to solve here is how to build a path for a partitionwise join without an explosion in the amount of memory the planner uses or the number of paths created. In the initial design, if there are N partitions per relation, the total number of paths generated by the planner increases by a factor of N+1, which gets ugly if, say, N = 1000, or even N = 100. To reign that in, we want to do a rough cut at costing the partitionwise join that will be good enough to let us throw away obviously inferior paths, and then work out the exact paths we're going to use only for partitionwise joins that are actually selected. I think costing one or a few of the larger sub-joins and assuming those costs are representative is probably a reasonable approach to that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, Sorry for delayed response. The attached patch implements following ideas: 1. At the time of creating paths - If the joining relations are both partitioned and join can use partition-wise join, we create paths for few child-joins. Similar to inheritance relations (set_append_rel_pathlist()), we collect paths with similar properties from all sampled child-joins and create one PartitionJoinPath with each set of paths. The cost of the PartitionJoinPath is obtained by multiplying the sum of costs of paths in the given set by the ratio of (number of rows estimated in the parent-join/sum of rows in child-joins). 2. If the PartitionJoinPath emerges as the best path, we create paths for each of the remaining child-joins. Then we collect paths with properties same as the given PartitionJoinPath, one from each child-join. These paths are converted into plans and a Merge/Append plan is created combing these plans. The paths and plans for child-join are created in a temporary memory context. The final plan for each child-join is copied into planner's context and the temporary memory context is reset. Right now, we choose 1% or 1 (whichever is higher) child-joins to base PartitionJoinPath costs on. Memory consumption ----------------------------- I tested a 5-way self-join for a table with 1000 partitions, each partition having 1M rows. The memory consumed in standard_planner() was measured with some granular tracking (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join consumed total of 289MB memory which is approx 6.6 times more than non-partition-wise join which consumed 44MB. That's much better than the earlier 16 times consumption for 5-way join with 100 partitions. The extra 245MB memory was consumed by child-join RelOptInfos (48MB), SpecialJoinInfos for child-joins (64MB), restrictlist translation (92MB), paths for sampled child-joins (1.5MB), building targetlists for child-joins (7MB). In order to choose representative child-joins based on the sizes of child-joins, we need to create all the child-join RelOptInfos. In order to estimate sizes of child-joins, we need to create SpecialJoinInfos and restrictlists for at least one join order for all child-joins. For every representative child-join, we need to create SpecialJoinInfo and restrictlist for all join orders for that child-join. We might be able to save of restrictlist translation, if we create restrict lists from joininfo similar to parent joins. I haven't tried that yet. Choosing representative child-joins: -------------------------------------------------- There's another angle to choosing representative child joins. In a partitioned N-way join, different joins covering different subsets of N relations, will have different size distributions across the partitions. This means that the child-joins costed for (N-k) joins, may be different for those required for (N-k+1) joins. With a factor of 1% sampling, N is such that a child-join participates in 100 joins, we will end up creating paths for all partitions before creating PartitionJoinPaths for the final N-way join. Hopefully that will be a rare case and usually we will end up using paths already created. We can not avoid creating PartitionJoinPaths for subset joins, as there might be cases when partition-wise join will be optimal for an N-k way join but not for N-way join. We may avoid this if we choose representative child-joins based on their positions, in which case, we may end up with some or all of those being empty and thus skewing the costs heavily. Partial paths ----------------- AFAIU, we create partial paths for append relation, when all the children have partial paths. Unlike parameterized paths or path with pathkeys, there is no way to create a partial path for a normal path. This means that unless we create paths for all child-joins, we can not create partial paths for appendrel comprising of child-joins, and thus can not use parallel query right now. This may not be that bad, since it would be more efficient to run each child-join in a separate worker, rather than using multiple workers for a single child-join. regression tests ---------------------- I observed that for small relations (1000 rows in each partition and 100 partitions), the size estimates in append relations and sum of those in child relations are very different. As a result, the extrapolated costs for PartitionJoinPaths as described above, are way higher than costs of join of appends (or even append of joins if we are to create paths for all child-joins). Thus with this approach, we choose partition-wise join for large number of partitions with large data (e.g. 1000 partitions with 1M rows each). These are certainly the cases when partition-wise join is a big win. I have not tried to find out a threshold above which partition-wise join gets chosen with above approach, but it's going to be a larger threshold. That makes writing regression tests difficult, as those will require large data. So, we have to find a way so that we can test partition-wise join with smaller data. There are few possibilities like 1. convert the fraction of representative child-joins into GUC and setting it to 100% would start choosing partition-wise joins for tables with a few hundred rows per partition, like it did in earlier approach, 2. provide a way to force partition-wise join whenever possible, by say costing partition-wise joins much lesser than non-partition-wise join when a GUC is set (e.g. enable_partition_wise_join with values always, never, optimal or something like that). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
PFA patch rebased after partitioning code was committed. On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Hi Robert, > Sorry for delayed response. > > The attached patch implements following ideas: > 1. At the time of creating paths - If the joining relations are both > partitioned and join can use partition-wise join, we create paths for > few child-joins. Similar to inheritance relations > (set_append_rel_pathlist()), we collect paths with similar properties > from all sampled child-joins and create one PartitionJoinPath with > each set of paths. The cost of the PartitionJoinPath is obtained by > multiplying the sum of costs of paths in the given set by the ratio of > (number of rows estimated in the parent-join/sum of rows in > child-joins). > > 2. If the PartitionJoinPath emerges as the best path, we create paths > for each of the remaining child-joins. Then we collect paths with > properties same as the given PartitionJoinPath, one from each > child-join. These paths are converted into plans and a Merge/Append > plan is created combing these plans. The paths and plans for > child-join are created in a temporary memory context. The final plan > for each child-join is copied into planner's context and the temporary > memory context is reset. > > Right now, we choose 1% or 1 (whichever is higher) child-joins to base > PartitionJoinPath costs on. > > Memory consumption > ----------------------------- > I tested a 5-way self-join for a table with 1000 partitions, each > partition having 1M rows. The memory consumed in standard_planner() > was measured with some granular tracking > (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join > consumed total of 289MB memory which is approx 6.6 times more than > non-partition-wise join which consumed 44MB. That's much better than > the earlier 16 times consumption for 5-way join with 100 partitions. > > The extra 245MB memory was consumed by child-join RelOptInfos (48MB), > SpecialJoinInfos for child-joins (64MB), restrictlist translation > (92MB), paths for sampled child-joins (1.5MB), building targetlists > for child-joins (7MB). > In the earlier implementation, a given clause which was applicable to multiple join orders was getting translated as many times as the join orders it was applicable in. I changed RestrictInfo for parent to store a list of RestrictInfos applicable to children to avoid multiple translations. My earlier patch created the child-join plans in a temporary context and then copied them into planner context since the translated clauses were allocated memory in temporary memory context then. Now that they are stored in planner's context, we can directly create the plan in the planner's context. Third, I added code to free up child SpecialJoinInfos after using those. As a result the total memory consumption now is 192MB, which is approx 4.4 times the memory consumed during planning in case of non-partition-wise join. > > Choosing representative child-joins: > -------------------------------------------------- > There's another angle to choosing representative child joins. In a > partitioned N-way join, different joins covering different subsets of > N relations, will have different size distributions across the > partitions. This means that the child-joins costed for (N-k) joins, > may be different for those required for (N-k+1) joins. With a factor > of 1% sampling, N is such that a child-join participates in 100 joins, > we will end up creating paths for all partitions before creating > PartitionJoinPaths for the final N-way join. Hopefully that will be a > rare case and usually we will end up using paths already created. We > can not avoid creating PartitionJoinPaths for subset joins, as there > might be cases when partition-wise join will be optimal for an N-k way > join but not for N-way join. We may avoid this if we choose > representative child-joins based on their positions, in which case, we > may end up with some or all of those being empty and thus skewing the > costs heavily. > > Partial paths > ----------------- > AFAIU, we create partial paths for append relation, when all the > children have partial paths. Unlike parameterized paths or path with > pathkeys, there is no way to create a partial path for a normal path. > This means that unless we create paths for all child-joins, we can not > create partial paths for appendrel comprising of child-joins, and thus > can not use parallel query right now. This may not be that bad, since > it would be more efficient to run each child-join in a separate > worker, rather than using multiple workers for a single child-join. This still applies. > > regression tests > ---------------------- > I observed that for small relations (1000 rows in each partition and > 100 partitions), the size estimates in append relations and sum of > those in child relations are very different. As a result, the > extrapolated costs for PartitionJoinPaths as described above, are way > higher than costs of join of appends (or even append of joins if we > are to create paths for all child-joins). Thus with this approach, we > choose partition-wise join for large number of partitions with large > data (e.g. 1000 partitions with 1M rows each). These are certainly the > cases when partition-wise join is a big win. I have not tried to find > out a threshold above which partition-wise join gets chosen with above > approach, but it's going to be a larger threshold. That makes writing > regression tests difficult, as those will require large data. So, we > have to find a way so that we can test partition-wise join with > smaller data. There are few possibilities like 1. convert the fraction > of representative child-joins into GUC and setting it to 100% would > start choosing partition-wise joins for tables with a few hundred rows > per partition, like it did in earlier approach, 2. provide a way to > force partition-wise join whenever possible, by say costing > partition-wise joins much lesser than non-partition-wise join when a > GUC is set (e.g. enable_partition_wise_join with values always, never, > optimal or something like that). > For now I have added a float GUC partition_wise_plan_weight. The partition-wise join cost derived from the samples is multiplied by this GUC and set as the cost of ParitionJoinPath. A value of 1 means that the cost derived from the samples are used as is. A value higher than 1 discourages use of partition-wise join and that lower than 1 encourages use of partition-wise join. I am not very keen on keeping this GUC, in this form. But we need some way to run regression with smaller data. For now I have disabled partition-wise join for multi-level partitions. I will post a patch soon with that enabled. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Dec 27, 2016 at 11:01 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > PFA patch rebased after partitioning code was committed. > > On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Hi Robert, >> Sorry for delayed response. >> >> The attached patch implements following ideas: >> 1. At the time of creating paths - If the joining relations are both >> partitioned and join can use partition-wise join, we create paths for >> few child-joins. Similar to inheritance relations >> (set_append_rel_pathlist()), we collect paths with similar properties >> from all sampled child-joins and create one PartitionJoinPath with >> each set of paths. The cost of the PartitionJoinPath is obtained by >> multiplying the sum of costs of paths in the given set by the ratio of >> (number of rows estimated in the parent-join/sum of rows in >> child-joins). >> >> 2. If the PartitionJoinPath emerges as the best path, we create paths >> for each of the remaining child-joins. Then we collect paths with >> properties same as the given PartitionJoinPath, one from each >> child-join. These paths are converted into plans and a Merge/Append >> plan is created combing these plans. The paths and plans for >> child-join are created in a temporary memory context. The final plan >> for each child-join is copied into planner's context and the temporary >> memory context is reset. >> >> Right now, we choose 1% or 1 (whichever is higher) child-joins to base >> PartitionJoinPath costs on. >> >> Memory consumption >> ----------------------------- >> I tested a 5-way self-join for a table with 1000 partitions, each >> partition having 1M rows. The memory consumed in standard_planner() >> was measured with some granular tracking >> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join >> consumed total of 289MB memory which is approx 6.6 times more than >> non-partition-wise join which consumed 44MB. That's much better than >> the earlier 16 times consumption for 5-way join with 100 partitions. >> >> The extra 245MB memory was consumed by child-join RelOptInfos (48MB), >> SpecialJoinInfos for child-joins (64MB), restrictlist translation >> (92MB), paths for sampled child-joins (1.5MB), building targetlists >> for child-joins (7MB). >> > > In the earlier implementation, a given clause which was applicable to > multiple join orders was getting translated as many times as the join > orders it was applicable in. I changed RestrictInfo for parent to > store a list of RestrictInfos applicable to children to avoid multiple > translations. > > My earlier patch created the child-join plans in a temporary context > and then copied them into planner context since the translated clauses > were allocated memory in temporary memory context then. Now that they > are stored in planner's context, we can directly create the plan in > the planner's context. > > Third, I added code to free up child SpecialJoinInfos after using those. > > As a result the total memory consumption now is 192MB, which is approx > 4.4 times the memory consumed during planning in case of > non-partition-wise join. > >> >> Choosing representative child-joins: >> -------------------------------------------------- >> There's another angle to choosing representative child joins. In a >> partitioned N-way join, different joins covering different subsets of >> N relations, will have different size distributions across the >> partitions. This means that the child-joins costed for (N-k) joins, >> may be different for those required for (N-k+1) joins. With a factor >> of 1% sampling, N is such that a child-join participates in 100 joins, >> we will end up creating paths for all partitions before creating >> PartitionJoinPaths for the final N-way join. Hopefully that will be a >> rare case and usually we will end up using paths already created. We >> can not avoid creating PartitionJoinPaths for subset joins, as there >> might be cases when partition-wise join will be optimal for an N-k way >> join but not for N-way join. We may avoid this if we choose >> representative child-joins based on their positions, in which case, we >> may end up with some or all of those being empty and thus skewing the >> costs heavily. >> >> Partial paths >> ----------------- >> AFAIU, we create partial paths for append relation, when all the >> children have partial paths. Unlike parameterized paths or path with >> pathkeys, there is no way to create a partial path for a normal path. >> This means that unless we create paths for all child-joins, we can not >> create partial paths for appendrel comprising of child-joins, and thus >> can not use parallel query right now. This may not be that bad, since >> it would be more efficient to run each child-join in a separate >> worker, rather than using multiple workers for a single child-join. > > This still applies. > >> >> regression tests >> ---------------------- >> I observed that for small relations (1000 rows in each partition and >> 100 partitions), the size estimates in append relations and sum of >> those in child relations are very different. As a result, the >> extrapolated costs for PartitionJoinPaths as described above, are way >> higher than costs of join of appends (or even append of joins if we >> are to create paths for all child-joins). Thus with this approach, we >> choose partition-wise join for large number of partitions with large >> data (e.g. 1000 partitions with 1M rows each). These are certainly the >> cases when partition-wise join is a big win. I have not tried to find >> out a threshold above which partition-wise join gets chosen with above >> approach, but it's going to be a larger threshold. That makes writing >> regression tests difficult, as those will require large data. So, we >> have to find a way so that we can test partition-wise join with >> smaller data. There are few possibilities like 1. convert the fraction >> of representative child-joins into GUC and setting it to 100% would >> start choosing partition-wise joins for tables with a few hundred rows >> per partition, like it did in earlier approach, 2. provide a way to >> force partition-wise join whenever possible, by say costing >> partition-wise joins much lesser than non-partition-wise join when a >> GUC is set (e.g. enable_partition_wise_join with values always, never, >> optimal or something like that). >> > > For now I have added a float GUC partition_wise_plan_weight. The > partition-wise join cost derived from the samples is multiplied by > this GUC and set as the cost of ParitionJoinPath. A value of 1 means > that the cost derived from the samples are used as is. A value higher > than 1 discourages use of partition-wise join and that lower than 1 > encourages use of partition-wise join. I am not very keen on keeping > this GUC, in this form. But we need some way to run regression with > smaller data. > > For now I have disabled partition-wise join for multi-level > partitions. I will post a patch soon with that enabled. PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased on the latest code. Also, PFA patch to support partition-wise join between multi-level partitioned tables. I copied the Amit Langote's patch for translating partition hierarchy into inheritance hierarchy and added code to support partition-wise join. You had expressed some concerns about Amit's approach in [1], but that discussion is still open. So, I haven't merged those changes to partition-wise join patch. We may continue to work on it as separate patch or I can include it in partition-wise join main patch. BTW, INSERT into multi-level partitioned tables is crashing with latest head. The issue was reported in [2]. Because of that multi_level_partition_join test crashes in pg_dp_join_v6.patch. Intestingly the crash vanishes when we apply patch supporting mult-level partition-wise join. [1] https://www.postgresql.org/message-id/CA%2BTgmoaEU10Kmdy44izcqJYLh1fkh58_6sbGGu0Q4b7PPE46eA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased > on the latest code. Maybe not surprisingly given how fast things are moving around here these days, this needs a rebase. Apart from that, my overall comment on this patch is that it's huge: 37 files changed, 7993 insertions(+), 287 deletions(-) Now, more than half of that is regression test cases and their output, which you will certainly be asked to pare down in any version of this intended for commit. But even excluding those, it's still a fairly patch: 30 files changed, 2783 insertions(+), 272 deletions(-) I think the reason this is so large is because there's a fair amount of refactoring work that has been done as a precondition of the actual meat of the patch, and no attempt has been made to separate the refactoring work from the main body of the patch. I think that's something that needs to be done. If you look at the way Amit Langote submitted the partitioning patches and the follow-up bug fixes, he had a series of patches 0001-blah, 0002-quux, etc. generated using format-patch. Each patch had its own commit message written by him explaining the purpose of that patch, links to relevant discussion, etc. If you can separate this into more digestible chunks it will be easier to get committed. Other questions/comments: Why does find_partition_scheme need to copy the partition bound information instead of just pointing to it? Amit went to some trouble to make sure that this can't change under us while we hold a lock on the relation, and we'd better hold a lock on the relation if we're planning a query against it. I think the PartitionScheme stuff should live in the optimizer rather that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we eventually need a new file in the optimizer just for partitioning stuff, but I'm not sure about that yet. The fact that set_append_rel_size needs to reopen the relation to extract a few more bits of information is not desirable. You need to fish this information through in some other way; for example, you could have get_relation_info() stash the needed bits in the RelOptInfo. + * For two partitioned tables with the same partitioning scheme, it is + * assumed that the Oids of matching partitions from both the tables + * are placed at the same position in the array of partition oids in Rather than saying that we assume this, you should say why it has to be true. (If it doesn't have to be true, we shouldn't assume it.) + * join relations. Partition tables should have same layout as the + * parent table and hence should not need any translation. But rest of The same attributes have to be present with the same types, but they can be rearranged. This comment seems to imply the contrary. FRACTION_PARTS_TO_PLAN seems like it should be a GUC. + /* + * Add this relation to the list of samples ordered by the increasing + * number of rows at appropriate place. + */ + foreach (lc, ordered_child_nos) + { + int child_no = lfirst_int(lc); + RelOptInfo *other_childrel = rel->part_rels[child_no]; + + /* + * Keep track of child with lowest number of rows but higher than the + * that of the child being inserted. Insert the child before a + * child with highest number of rows lesser than it. + */ + if (child_rel->rows <= other_childrel->rows) + insert_after = lc; + else + break; + } Can we use quicksort instead of a hand-coded insertion sort? + if (bms_num_members(outer_relids) > 1) Seems like bms_get_singleton_member could be used. + * Partitioning scheme in join relation indicates a possibilty that the Spelling. There seems to be no reason for create_partition_plan to be separated from create_plan_recurse. You can just add another case for the new path type. Why does create_partition_join_path need to be separate from create_partition_join_path_with_pathkeys? Couldn't that be combined into a single function with a pathkeys argument that might sometimes be NIL? I assume most of the logic is common. From a sort of theoretical standpoint, the biggest danger of this patch seems to be that by deferring path creation until a later stage than normal, we could miss some important processing. subquery_planner() does a lot of stuff after expand_inherited_tables(); if any of those things, especially the ones that happen AFTER path generation, have an effect on the paths, then this code needs to compensate for those changes somehow. It seems like having the planning of unsampled children get deferred until create_plan() time is awfully surprising; here we are creating the plan and suddenly what used to be a straightforward path->plan translation is running around doing major planning work. I can't entirely justify it, but I somehow have a feeling that work ought to be moved earlier. Not sure exactly where. This is not really a full review, mostly because I can't easily figure out the motivation for all of the changes the patch makes. It makes a lot of changes in a lot of places, and it's not really very easy to understand why those changes are necessary. My comments above about splitting the patch into a series of patches that can potentially be reviewed and applied independently, with the main patch being the last in the series, are a suggestion as to how to tackle that. There might be some work that needs to or could be done on the comments, too. For example, the patch splits out add_paths_to_append_rel from set_append_rel_pathlist, but the comments don't say anything helpful like "we need to do X after Y, because Z". They just say that we do it. To some extent I think the comments in the optimizer have that problem generally, so it's not entirely the fault of this patch; still, the lack of those explanations makes the code reorganization harder to follow, and might confuse future patch authors, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Feb 2, 2017 at 2:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased >> on the latest code. > > Maybe not surprisingly given how fast things are moving around here > these days, this needs a rebase. > > Apart from that, my overall comment on this patch is that it's huge: > > 37 files changed, 7993 insertions(+), 287 deletions(-) > > Now, more than half of that is regression test cases and their output, > which you will certainly be asked to pare down in any version of this > intended for commit. Yes. I will work on that once the design and implementation is in acceptable state. I have already toned down testcases compared to the previous patch. > But even excluding those, it's still a fairly > patch: > > 30 files changed, 2783 insertions(+), 272 deletions(-) > > I think the reason this is so large is because there's a fair amount > of refactoring work that has been done as a precondition of the actual > meat of the patch, and no attempt has been made to separate the > refactoring work from the main body of the patch. I think that's > something that needs to be done. If you look at the way Amit Langote > submitted the partitioning patches and the follow-up bug fixes, he had > a series of patches 0001-blah, 0002-quux, etc. generated using > format-patch. Each patch had its own commit message written by him > explaining the purpose of that patch, links to relevant discussion, > etc. If you can separate this into more digestible chunks it will be > easier to get committed. I will try to break down the patch into smaller, easy-to-review, logically cohesive patches. > > Other questions/comments: > > Why does find_partition_scheme need to copy the partition bound > information instead of just pointing to it? Amit went to some trouble > to make sure that this can't change under us while we hold a lock on > the relation, and we'd better hold a lock on the relation if we're > planning a query against it. PartitionScheme is shared across multiple relations, join or base, partitioned similarly. Obviously it can't and does not need to point partition bound informations (which should all be same) of all those base relations. O the the face of it, it looks weird that it points to only one of them, mostly the one which it encounters first. But, since it's going to be the same partition bound information, it doesn't matter which one. So, I think, we can point of any one of those. Do you agree? > > I think the PartitionScheme stuff should live in the optimizer rather > that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we > eventually need a new file in the optimizer just for partitioning > stuff, but I'm not sure about that yet. I placed PartitionScheme stuff in partition.c because most of the functions and structures in partition.c are not visible outside that file. But I will try again to locate PartitionScheme to optimizer. > > The fact that set_append_rel_size needs to reopen the relation to > extract a few more bits of information is not desirable. You need to > fish this information through in some other way; for example, you > could have get_relation_info() stash the needed bits in the > RelOptInfo. I considered this option and discarded it, since not all partitioned relations will have OIDs for partitions e.g. partitioned joins will not have OIDs for their partitions. But now that I think of it, we should probably store those OIDs just for the base relation and leave them unused for non-base relations just like other base relation specific fields in RelOptInfo. > > + * For two partitioned tables with the same > partitioning scheme, it is > + * assumed that the Oids of matching partitions from > both the tables > + * are placed at the same position in the array of > partition oids in > > Rather than saying that we assume this, you should say why it has to > be true. (If it doesn't have to be true, we shouldn't assume it.) Will take care of this. > > + * join relations. Partition tables should have same > layout as the > + * parent table and hence should not need any > translation. But rest of > > The same attributes have to be present with the same types, but they > can be rearranged. This comment seems to imply the contrary. Hmm, will take care of this. > > FRACTION_PARTS_TO_PLAN seems like it should be a GUC. +1. Will take care of this. Does "representative_partitions_fraction" or "sample_partition_fraction" look like a good GUC name? Any other suggestions? > > + /* > + * Add this relation to the list of samples ordered by > the increasing > + * number of rows at appropriate place. > + */ > + foreach (lc, ordered_child_nos) > + { > + int child_no = lfirst_int(lc); > + RelOptInfo *other_childrel = rel->part_rels[child_no]; > + > + /* > + * Keep track of child with lowest number of > rows but higher than the > + * that of the child being inserted. Insert > the child before a > + * child with highest number of rows lesser than it. > + */ > + if (child_rel->rows <= other_childrel->rows) > + insert_after = lc; > + else > + break; > + } > > Can we use quicksort instead of a hand-coded insertion sort? I guess so, if I write comparison functions, which shouldn't be a problem. Will try that. > > + if (bms_num_members(outer_relids) > 1) > > Seems like bms_get_singleton_member could be used. > > + * Partitioning scheme in join relation indicates a possibilty that the > > Spelling. > > There seems to be no reason for create_partition_plan to be separated > from create_plan_recurse. You can just add another case for the new > path type. > > Why does create_partition_join_path need to be separate from > create_partition_join_path_with_pathkeys? Couldn't that be combined > into a single function with a pathkeys argument that might sometimes > be NIL? I assume most of the logic is common. > > From a sort of theoretical standpoint, the biggest danger of this > patch seems to be that by deferring path creation until a later stage > than normal, we could miss some important processing. > subquery_planner() does a lot of stuff after > expand_inherited_tables(); if any of those things, especially the ones > that happen AFTER path generation, have an effect on the paths, then > this code needs to compensate for those changes somehow. It seems > like having the planning of unsampled children get deferred until > create_plan() time is awfully surprising; here we are creating the > plan and suddenly what used to be a straightforward path->plan > translation is running around doing major planning work. I can't > entirely justify it, but I somehow have a feeling that work ought to > be moved earlier. Not sure exactly where. > > This is not really a full review, mostly because I can't easily figure > out the motivation for all of the changes the patch makes. It makes a > lot of changes in a lot of places, and it's not really very easy to > understand why those changes are necessary. My comments above about > splitting the patch into a series of patches that can potentially be > reviewed and applied independently, with the main patch being the last > in the series, are a suggestion as to how to tackle that. There might > be some work that needs to or could be done on the comments, too. For > example, the patch splits out add_paths_to_append_rel from > set_append_rel_pathlist, but the comments don't say anything helpful > like "we need to do X after Y, because Z". They just say that we do > it. To some extent I think the comments in the optimizer have that > problem generally, so it's not entirely the fault of this patch; > still, the lack of those explanations makes the code reorganization > harder to follow, and might confuse future patch authors, too. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
sent the previous mail before completing my reply. Sorry for that. Here's the rest of the reply. >> >> + if (bms_num_members(outer_relids) > 1) >> >> Seems like bms_get_singleton_member could be used. >> >> + * Partitioning scheme in join relation indicates a possibilty that the >> >> Spelling. Will take care of this. >> >> There seems to be no reason for create_partition_plan to be separated >> from create_plan_recurse. You can just add another case for the new >> path type. Will take care of this. >> >> Why does create_partition_join_path need to be separate from >> create_partition_join_path_with_pathkeys? Couldn't that be combined >> into a single function with a pathkeys argument that might sometimes >> be NIL? I assume most of the logic is common. Agreed. will take care of this. >> >> From a sort of theoretical standpoint, the biggest danger of this >> patch seems to be that by deferring path creation until a later stage >> than normal, we could miss some important processing. >> subquery_planner() does a lot of stuff after >> expand_inherited_tables(); if any of those things, especially the ones >> that happen AFTER path generation, have an effect on the paths, then >> this code needs to compensate for those changes somehow. It seems >> like having the planning of unsampled children get deferred until >> create_plan() time is awfully surprising; here we are creating the >> plan and suddenly what used to be a straightforward path->plan >> translation is running around doing major planning work. I can't >> entirely justify it, but I somehow have a feeling that work ought to >> be moved earlier. Not sure exactly where. I agree with this. Probably we should add a path tree mutator before SS_identify_outer_params() to replace any Partition*Paths with Merge/Append paths. The mutator will create paths for child-joins within temporary memory context, copy the relevant paths and create Merge/Append paths. There are two problems there 1. We have to write code to copy paths; most of the paths would be flat copy but custom scan paths might have some unexpected problems. 2. There will be many surviving PartitionPaths, and all the corresponding child paths would need copying and consume memory. In order to reduce that consumption, we have run this mutator after set_cheapest() in subquery_planner(); but then nothing interesting happens between that and create_plan(). Expanding PartitionPaths during create_plan() does not need any path copying and we expand only the PartitionPaths which will be converted to plans. That does save a lot of memory; the reason why we defer creating paths for child-joins. >> >> This is not really a full review, mostly because I can't easily figure >> out the motivation for all of the changes the patch makes. It makes a >> lot of changes in a lot of places, and it's not really very easy to >> understand why those changes are necessary. My comments above about >> splitting the patch into a series of patches that can potentially be >> reviewed and applied independently, with the main patch being the last >> in the series, are a suggestion as to how to tackle that. There might >> be some work that needs to or could be done on the comments, too. For >> example, the patch splits out add_paths_to_append_rel from >> set_append_rel_pathlist, but the comments don't say anything helpful >> like "we need to do X after Y, because Z". They just say that we do >> it. To some extent I think the comments in the optimizer have that >> problem generally, so it's not entirely the fault of this patch; >> still, the lack of those explanations makes the code reorganization >> harder to follow, and might confuse future patch authors, too. Specifically about add_paths_to_append_rel(), what do you expect the comment to say? It would be obvious why we split that functionality into a separate function: in fact, we don't necessarily explain why certain code resides in a separate function in the comments. I think, that particular comment (or for that matter other such comments in the optimizer) can be removed altogether, since it just writes the function names as an "English" sentence. I sometimes find those comments useful, because I can read just those comments and forget about the code, making comprehension easy. If highlighting is ON, your brain habitually ignores the non-comment portions when required. I am open to suggestions. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Per your suggestion I have split the patch into many smaller patches. 0001-Refactor-set_append_rel_pathlist.patch 0002-Refactor-make_join_rel.patch 0003-Refactor-adjust_appendrel_attrs.patch 0004-Refactor-build_join_rel.patch 0005-Add-function-find_param_path_info.patch These four refactor existing code. 0006-Canonical-partition-scheme.patch 0007-Partition-wise-join-tests.patch -- just tests, they fail 0008-Partition-wise-join.patch -- actual patch implementing partition-wise join, still some tests fail\ 0009-Adjust-join-related-to-code-to-accept-child-relation.patch 0010-Parameterized-path-fixes.patch 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch The last three patches change existing code to expect child(-join) relations where they were not expected earlier. Each patch has summary of the changes. Partition-wise join for multi-level partitioned tables is not covered by these patches. I will post those patches soon. > >> >> Other questions/comments: >> >> Why does find_partition_scheme need to copy the partition bound >> information instead of just pointing to it? Amit went to some trouble >> to make sure that this can't change under us while we hold a lock on >> the relation, and we'd better hold a lock on the relation if we're >> planning a query against it. > > PartitionScheme is shared across multiple relations, join or base, > partitioned similarly. Obviously it can't and does not need to point > partition bound informations (which should all be same) of all those > base relations. O the the face of it, it looks weird that it points to > only one of them, mostly the one which it encounters first. But, since > it's going to be the same partition bound information, it doesn't > matter which one. So, I think, we can point of any one of those. Do > you agree? Instead of copying PartitionBoundInfo, used pointer of the first encountered one. > >> >> I think the PartitionScheme stuff should live in the optimizer rather >> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we >> eventually need a new file in the optimizer just for partitioning >> stuff, but I'm not sure about that yet. > > I placed PartitionScheme stuff in partition.c because most of the > functions and structures in partition.c are not visible outside that > file. But I will try again to locate PartitionScheme to optimizer. Moved the code as per your suggestion. > >> >> The fact that set_append_rel_size needs to reopen the relation to >> extract a few more bits of information is not desirable. You need to >> fish this information through in some other way; for example, you >> could have get_relation_info() stash the needed bits in the >> RelOptInfo. > > I considered this option and discarded it, since not all partitioned > relations will have OIDs for partitions e.g. partitioned joins will > not have OIDs for their partitions. But now that I think of it, we > should probably store those OIDs just for the base relation and leave > them unused for non-base relations just like other base relation > specific fields in RelOptInfo. Changed as per your suggestions. > >> >> + * For two partitioned tables with the same >> partitioning scheme, it is >> + * assumed that the Oids of matching partitions from >> both the tables >> + * are placed at the same position in the array of >> partition oids in >> >> Rather than saying that we assume this, you should say why it has to >> be true. (If it doesn't have to be true, we shouldn't assume it.) > > Will take care of this. Done. Please check. > >> >> + * join relations. Partition tables should have same >> layout as the >> + * parent table and hence should not need any >> translation. But rest of >> >> The same attributes have to be present with the same types, but they >> can be rearranged. This comment seems to imply the contrary. > > Hmm, will take care of this. Done. > >> >> FRACTION_PARTS_TO_PLAN seems like it should be a GUC. > > +1. Will take care of this. Does "representative_partitions_fraction" > or "sample_partition_fraction" look like a good GUC name? Any other > suggestions? used "sample_partition_fraction" for now. Suggestions are welcome. > >> >> + /* >> + * Add this relation to the list of samples ordered by >> the increasing >> + * number of rows at appropriate place. >> + */ >> + foreach (lc, ordered_child_nos) >> + { >> + int child_no = lfirst_int(lc); >> + RelOptInfo *other_childrel = rel->part_rels[child_no]; >> + >> + /* >> + * Keep track of child with lowest number of >> rows but higher than the >> + * that of the child being inserted. Insert >> the child before a >> + * child with highest number of rows lesser than it. >> + */ >> + if (child_rel->rows <= other_childrel->rows) >> + insert_after = lc; >> + else >> + break; >> + } >> >> Can we use quicksort instead of a hand-coded insertion sort? > > I guess so, if I write comparison functions, which shouldn't be a > problem. Will try that. Done. > >> >> + if (bms_num_members(outer_relids) > 1) >> >> Seems like bms_get_singleton_member could be used. That code is not required any more. >> >> + * Partitioning scheme in join relation indicates a possibilty that the >> >> Spelling. Done. >> >> There seems to be no reason for create_partition_plan to be separated >> from create_plan_recurse. You can just add another case for the new >> path type. Done. >> >> Why does create_partition_join_path need to be separate from >> create_partition_join_path_with_pathkeys? Couldn't that be combined >> into a single function with a pathkeys argument that might sometimes >> be NIL? I assume most of the logic is common. Combined those into a single function. >> >> From a sort of theoretical standpoint, the biggest danger of this >> patch seems to be that by deferring path creation until a later stage >> than normal, we could miss some important processing. >> subquery_planner() does a lot of stuff after >> expand_inherited_tables(); if any of those things, especially the ones >> that happen AFTER path generation, have an effect on the paths, then >> this code needs to compensate for those changes somehow. It seems >> like having the planning of unsampled children get deferred until >> create_plan() time is awfully surprising; here we are creating the >> plan and suddenly what used to be a straightforward path->plan >> translation is running around doing major planning work. I can't >> entirely justify it, but I somehow have a feeling that work ought to >> be moved earlier. Not sure exactly where. Pasting my previous replies here to keep everything in one mail. I agree with this. Probably we should add a path tree mutator before SS_identify_outer_params() to replace any Partition*Paths with Merge/Append paths. The mutator will create paths for child-joins within temporary memory context, copy the relevant paths and create Merge/Append paths. There are two problems there 1. We have to write code to copy paths; most of the paths would be flat copy but custom scan paths might have some unexpected problems. 2. There will be many surviving PartitionPaths, and all the corresponding child paths would need copying and consume memory. In order to reduce that consumption, we have run this mutator after set_cheapest() in subquery_planner(); but then nothing interesting happens between that and create_plan(). Expanding PartitionPaths during create_plan() does not need any path copying and we expand only the PartitionPaths which will be converted to plans. That does save a lot of memory; the reason why we defer creating paths for child-joins. >> >> This is not really a full review, mostly because I can't easily figure >> out the motivation for all of the changes the patch makes. It makes a >> lot of changes in a lot of places, and it's not really very easy to >> understand why those changes are necessary. My comments above about >> splitting the patch into a series of patches that can potentially be >> reviewed and applied independently, with the main patch being the last >> in the series, are a suggestion as to how to tackle that. There might >> be some work that needs to or could be done on the comments, too. For >> example, the patch splits out add_paths_to_append_rel from >> set_append_rel_pathlist, but the comments don't say anything helpful >> like "we need to do X after Y, because Z". They just say that we do >> it. To some extent I think the comments in the optimizer have that >> problem generally, so it's not entirely the fault of this patch; >> still, the lack of those explanations makes the code reorganization >> harder to follow, and might confuse future patch authors, too. >> Specifically about add_paths_to_append_rel(), what do you expect the comment to say? It would be obvious why we split that functionality into a separate function: in fact, we don't necessarily explain why certain code resides in a separate function in the comments. I think, that particular comment (or for that matter other such comments in the optimizer) can be removed altogether, since it just writes the function names as an "English" sentence. I sometimes find those comments useful, because I can read just those comments and forget about the code, making comprehension easy. If highlighting is ON, your brain habitually ignores the non-comment portions when required. I am open to suggestions. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
- 0001-Refactor-set_append_rel_pathlist.patch
- 0002-Refactor-make_join_rel.patch
- 0003-Refactor-adjust_appendrel_attrs.patch
- 0004-Refactor-build_join_rel.patch
- 0005-Add-function-find_param_path_info.patch
- 0006-Canonical-partition-scheme.patch
- 0007-Partition-wise-join-tests.patch
- 0008-Partition-wise-join.patch
- 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
- 0010-Parameterized-path-fixes.patch
- 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Fixed a problem with the way qsort was being used in the earlier set of patches. Attached PFA the set of patches with that fixed. On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Per your suggestion I have split the patch into many smaller patches. > > 0001-Refactor-set_append_rel_pathlist.patch > 0002-Refactor-make_join_rel.patch > 0003-Refactor-adjust_appendrel_attrs.patch > 0004-Refactor-build_join_rel.patch > 0005-Add-function-find_param_path_info.patch > > These four refactor existing code. > > 0006-Canonical-partition-scheme.patch > 0007-Partition-wise-join-tests.patch -- just tests, they fail > 0008-Partition-wise-join.patch -- actual patch implementing > partition-wise join, still some tests fail\ > > 0009-Adjust-join-related-to-code-to-accept-child-relation.patch > 0010-Parameterized-path-fixes.patch > 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch > > The last three patches change existing code to expect child(-join) > relations where they were not expected earlier. > > Each patch has summary of the changes. > > Partition-wise join for multi-level partitioned tables is not covered > by these patches. I will post those patches soon. > >> >>> >>> Other questions/comments: >>> >>> Why does find_partition_scheme need to copy the partition bound >>> information instead of just pointing to it? Amit went to some trouble >>> to make sure that this can't change under us while we hold a lock on >>> the relation, and we'd better hold a lock on the relation if we're >>> planning a query against it. >> >> PartitionScheme is shared across multiple relations, join or base, >> partitioned similarly. Obviously it can't and does not need to point >> partition bound informations (which should all be same) of all those >> base relations. O the the face of it, it looks weird that it points to >> only one of them, mostly the one which it encounters first. But, since >> it's going to be the same partition bound information, it doesn't >> matter which one. So, I think, we can point of any one of those. Do >> you agree? > > Instead of copying PartitionBoundInfo, used pointer of the first > encountered one. > >> >>> >>> I think the PartitionScheme stuff should live in the optimizer rather >>> that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we >>> eventually need a new file in the optimizer just for partitioning >>> stuff, but I'm not sure about that yet. >> >> I placed PartitionScheme stuff in partition.c because most of the >> functions and structures in partition.c are not visible outside that >> file. But I will try again to locate PartitionScheme to optimizer. > > Moved the code as per your suggestion. > >> >>> >>> The fact that set_append_rel_size needs to reopen the relation to >>> extract a few more bits of information is not desirable. You need to >>> fish this information through in some other way; for example, you >>> could have get_relation_info() stash the needed bits in the >>> RelOptInfo. >> >> I considered this option and discarded it, since not all partitioned >> relations will have OIDs for partitions e.g. partitioned joins will >> not have OIDs for their partitions. But now that I think of it, we >> should probably store those OIDs just for the base relation and leave >> them unused for non-base relations just like other base relation >> specific fields in RelOptInfo. > > Changed as per your suggestions. > >> >>> >>> + * For two partitioned tables with the same >>> partitioning scheme, it is >>> + * assumed that the Oids of matching partitions from >>> both the tables >>> + * are placed at the same position in the array of >>> partition oids in >>> >>> Rather than saying that we assume this, you should say why it has to >>> be true. (If it doesn't have to be true, we shouldn't assume it.) >> >> Will take care of this. > > Done. Please check. > >> >>> >>> + * join relations. Partition tables should have same >>> layout as the >>> + * parent table and hence should not need any >>> translation. But rest of >>> >>> The same attributes have to be present with the same types, but they >>> can be rearranged. This comment seems to imply the contrary. >> >> Hmm, will take care of this. > > Done. > >> >>> >>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC. >> >> +1. Will take care of this. Does "representative_partitions_fraction" >> or "sample_partition_fraction" look like a good GUC name? Any other >> suggestions? > > used "sample_partition_fraction" for now. Suggestions are welcome. > >> >>> >>> + /* >>> + * Add this relation to the list of samples ordered by >>> the increasing >>> + * number of rows at appropriate place. >>> + */ >>> + foreach (lc, ordered_child_nos) >>> + { >>> + int child_no = lfirst_int(lc); >>> + RelOptInfo *other_childrel = rel->part_rels[child_no]; >>> + >>> + /* >>> + * Keep track of child with lowest number of >>> rows but higher than the >>> + * that of the child being inserted. Insert >>> the child before a >>> + * child with highest number of rows lesser than it. >>> + */ >>> + if (child_rel->rows <= other_childrel->rows) >>> + insert_after = lc; >>> + else >>> + break; >>> + } >>> >>> Can we use quicksort instead of a hand-coded insertion sort? >> >> I guess so, if I write comparison functions, which shouldn't be a >> problem. Will try that. > > Done. > >> >>> >>> + if (bms_num_members(outer_relids) > 1) >>> >>> Seems like bms_get_singleton_member could be used. > > That code is not required any more. > >>> >>> + * Partitioning scheme in join relation indicates a possibilty that the >>> >>> Spelling. > > Done. > >>> >>> There seems to be no reason for create_partition_plan to be separated >>> from create_plan_recurse. You can just add another case for the new >>> path type. > > Done. > >>> >>> Why does create_partition_join_path need to be separate from >>> create_partition_join_path_with_pathkeys? Couldn't that be combined >>> into a single function with a pathkeys argument that might sometimes >>> be NIL? I assume most of the logic is common. > > Combined those into a single function. > >>> >>> From a sort of theoretical standpoint, the biggest danger of this >>> patch seems to be that by deferring path creation until a later stage >>> than normal, we could miss some important processing. >>> subquery_planner() does a lot of stuff after >>> expand_inherited_tables(); if any of those things, especially the ones >>> that happen AFTER path generation, have an effect on the paths, then >>> this code needs to compensate for those changes somehow. It seems >>> like having the planning of unsampled children get deferred until >>> create_plan() time is awfully surprising; here we are creating the >>> plan and suddenly what used to be a straightforward path->plan >>> translation is running around doing major planning work. I can't >>> entirely justify it, but I somehow have a feeling that work ought to >>> be moved earlier. Not sure exactly where. > > Pasting my previous replies here to keep everything in one mail. > > I agree with this. Probably we should add a path tree mutator before > SS_identify_outer_params() to replace any Partition*Paths with > Merge/Append paths. The mutator will create paths for child-joins > within temporary memory context, copy the relevant paths and create > Merge/Append paths. There are two problems there 1. We have to write > code to copy paths; most of the paths would be flat copy but custom > scan paths might have some unexpected problems. 2. There will be many > surviving PartitionPaths, and all the corresponding child paths would > need copying and consume memory. In order to reduce that consumption, > we have run this mutator after set_cheapest() in subquery_planner(); > but then nothing interesting happens between that and create_plan(). > Expanding PartitionPaths during create_plan() does not need any path > copying and we expand only the PartitionPaths which will be converted > to plans. That does save a lot of memory; the reason why we defer > creating paths for child-joins. > >>> >>> This is not really a full review, mostly because I can't easily figure >>> out the motivation for all of the changes the patch makes. It makes a >>> lot of changes in a lot of places, and it's not really very easy to >>> understand why those changes are necessary. My comments above about >>> splitting the patch into a series of patches that can potentially be >>> reviewed and applied independently, with the main patch being the last >>> in the series, are a suggestion as to how to tackle that. There might >>> be some work that needs to or could be done on the comments, too. For >>> example, the patch splits out add_paths_to_append_rel from >>> set_append_rel_pathlist, but the comments don't say anything helpful >>> like "we need to do X after Y, because Z". They just say that we do >>> it. To some extent I think the comments in the optimizer have that >>> problem generally, so it's not entirely the fault of this patch; >>> still, the lack of those explanations makes the code reorganization >>> harder to follow, and might confuse future patch authors, too. >>> > > Specifically about add_paths_to_append_rel(), what do you expect the > comment to say? It would be obvious why we split that functionality > into a separate function: in fact, we don't necessarily explain why > certain code resides in a separate function in the comments. I think, > that particular comment (or for that matter other such comments in the > optimizer) can be removed altogether, since it just writes the > function names as an "English" sentence. I sometimes find those > comments useful, because I can read just those comments and forget > about the code, making comprehension easy. If highlighting is ON, your > brain habitually ignores the non-comment portions when required. I am > open to suggestions. > > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
- 0001-Refactor-set_append_rel_pathlist.patch
- 0002-Refactor-make_join_rel.patch
- 0003-Refactor-adjust_appendrel_attrs.patch
- 0004-Refactor-build_join_rel.patch
- 0005-Add-function-find_param_path_info.patch
- 0006-Canonical-partition-scheme.patch
- 0007-Partition-wise-join-tests.patch
- 0008-Partition-wise-join.patch
- 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
- 0010-Parameterized-path-fixes.patch
- 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Here is set of patches with support for partition-wise join between multi-level partitioned tables. On Fri, Feb 10, 2017 at 11:19 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Fixed a problem with the way qsort was being used in the earlier set > of patches. Attached PFA the set of patches with that fixed. This fix is included. > > On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Per your suggestion I have split the patch into many smaller patches. >> >> 0001-Refactor-set_append_rel_pathlist.patch >> 0002-Refactor-make_join_rel.patch >> 0003-Refactor-adjust_appendrel_attrs.patch >> 0004-Refactor-build_join_rel.patch >> 0005-Add-function-find_param_path_info.patch >> >> These four refactor existing code. >> >> 0006-Canonical-partition-scheme.patch >> 0007-Partition-wise-join-tests.patch -- just tests, they fail >> 0008-Partition-wise-join.patch -- actual patch implementing >> partition-wise join, still some tests fail\ >> >> 0009-Adjust-join-related-to-code-to-accept-child-relation.patch >> 0010-Parameterized-path-fixes.patch >> 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch >> patch to translate partition hierarchy into inheritance hierarchy without flattening 0012-Multi-level-partitioned-table-expansion.patch patches for multi-level partition-wise join support 0013-Multi-level-partition-wise-join-tests.patch 0014-Multi-level-partition-wise-join-support.patch -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
- 0001-Refactor-set_append_rel_pathlist.patch
- 0002-Refactor-make_join_rel.patch
- 0003-Refactor-adjust_appendrel_attrs.patch
- 0004-Refactor-build_join_rel.patch
- 0005-Add-function-find_param_path_info.patch
- 0006-Canonical-partition-scheme.patch
- 0007-Partition-wise-join-tests.patch
- 0008-Partition-wise-join-implementation.patch
- 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
- 0010-Parameterized-path-fixes.patch
- 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
- 0012-Multi-level-partitioned-table-expansion.patch
- 0013-Multi-level-partition-wise-join-tests.patch
- 0014-Multi-level-partition-wise-join-support.patch
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Feb 6, 2017 at 3:34 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > PartitionScheme is shared across multiple relations, join or base, > partitioned similarly. Obviously it can't and does not need to point > partition bound informations (which should all be same) of all those > base relations. O the the face of it, it looks weird that it points to > only one of them, mostly the one which it encounters first. But, since > it's going to be the same partition bound information, it doesn't > matter which one. So, I think, we can point of any one of those. Do > you agree? Yes. >> The fact that set_append_rel_size needs to reopen the relation to >> extract a few more bits of information is not desirable. You need to >> fish this information through in some other way; for example, you >> could have get_relation_info() stash the needed bits in the >> RelOptInfo. > > I considered this option and discarded it, since not all partitioned > relations will have OIDs for partitions e.g. partitioned joins will > not have OIDs for their partitions. But now that I think of it, we > should probably store those OIDs just for the base relation and leave > them unused for non-base relations just like other base relation > specific fields in RelOptInfo. Right. >> FRACTION_PARTS_TO_PLAN seems like it should be a GUC. > > +1. Will take care of this. Does "representative_partitions_fraction" > or "sample_partition_fraction" look like a good GUC name? Any other > suggestions? I like the second one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> > 2. If the PartitionJoinPath emerges as the best path, we create paths > for each of the remaining child-joins. Then we collect paths with > properties same as the given PartitionJoinPath, one from each > child-join. These paths are converted into plans and a Merge/Append > plan is created combing these plans. The paths and plans for > child-join are created in a temporary memory context. The final plan > for each child-join is copied into planner's context and the temporary > memory context is reset. > Robert and I discussed this in more detail. Path creation code may allocate objects other than paths. postgres_fdw, for example, allocates character array to hold the name of relation being pushed-down. When the temporary context gets zapped after creating paths for a given child-join, those other objects also gets thrown away. Attached patch has implemented the idea that came out of the discussion. We create a memory context for holding paths at the time of creating PlannerGlobal and save it in PlannerGlobal. The patch introduces a new macro makePathNode() which allocates the memory for given type of path from this context. Every create_*_path function has been changed to use this macro instead of makeNode(). In standard_planner(), at the end of planning we destroy the memory context freeing all the paths allocated. While creating a plan node, planner copies everything required by the plan from the path, so the path is not needed any more. So, freeing corresponding memory should not have any adverse effects. Most of the create_*_path() functions accept root as an argument, thus the temporary path context is available through root->glob everywhere. An exception is create_append_path() which does not accept root as an argument. The patch changes create_append_path() and its callers like set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an argument. Ideally paths are not required after creating plan, so we should be able to free the context right after the call to create_plan(). But we need dummy paths while creating flat rtable in set_plan_references()->add_rtes_to_flat_rtable(). We used to So free the path context at the end of planning cycle. Now that we are allocating all the paths in a different memory context, it doesn't make sense to switch context in mark_dummy_rel(). 0001 patch implements the idea described above. 0002 patch adds instrumentation to measure memory consumed in standard_planner() call. 0003 patch adds a GUC zap_paths to enable/disable destroying path context. The last two patches are for testing only. Attached also find the SQL script and its output showing the memory saved. For a 5 way self-join of pg_class, the total memory consumed in standard_planner() is 760K without patch and with patch it comes down to 713K, saving 47K memory otherwise occupied by paths. It looks like something useful even without partition-wise joins. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Updated 0001 patch with some more comments. Attaching all the patches for quick access. On Wed, Mar 1, 2017 at 2:26 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> >> 2. If the PartitionJoinPath emerges as the best path, we create paths >> for each of the remaining child-joins. Then we collect paths with >> properties same as the given PartitionJoinPath, one from each >> child-join. These paths are converted into plans and a Merge/Append >> plan is created combing these plans. The paths and plans for >> child-join are created in a temporary memory context. The final plan >> for each child-join is copied into planner's context and the temporary >> memory context is reset. >> > > Robert and I discussed this in more detail. Path creation code may > allocate objects other than paths. postgres_fdw, for example, > allocates character array to hold the name of relation being > pushed-down. When the temporary context gets zapped after creating > paths for a given child-join, those other objects also gets thrown > away. Attached patch has implemented the idea that came out of the > discussion. > > We create a memory context for holding paths at the time of creating > PlannerGlobal and save it in PlannerGlobal. The patch introduces a new > macro makePathNode() which allocates the memory for given type of path > from this context. Every create_*_path function has been changed to > use this macro instead of makeNode(). In standard_planner(), at the > end of planning we destroy the memory context freeing all the paths > allocated. While creating a plan node, planner copies everything > required by the plan from the path, so the path is not needed any > more. So, freeing corresponding memory should not have any adverse > effects. > > Most of the create_*_path() functions accept root as an argument, thus > the temporary path context is available through root->glob everywhere. > An exception is create_append_path() which does not accept root as an > argument. The patch changes create_append_path() and its callers like > set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an > argument. Ideally paths are not required after creating plan, so we > should be > able to free the context right after the call to create_plan(). But we > need dummy paths while creating flat rtable in > set_plan_references()->add_rtes_to_flat_rtable(). We used to So free > the path context at the end of planning cycle. Now that we are > allocating all the paths in a different memory context, it doesn't > make sense to switch context in mark_dummy_rel(). > > 0001 patch implements the idea described above. > 0002 patch adds instrumentation to measure memory consumed in > standard_planner() call. > 0003 patch adds a GUC zap_paths to enable/disable destroying path context. > The last two patches are for testing only. > > Attached also find the SQL script and its output showing the memory > saved. For a 5 way self-join of pg_class, the total memory consumed in > standard_planner() is 760K without patch and with patch it comes down > to 713K, saving 47K memory otherwise occupied by paths. It looks like > something useful even without partition-wise joins. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> 2. If the PartitionJoinPath emerges as the best path, we create paths >> for each of the remaining child-joins. Then we collect paths with >> properties same as the given PartitionJoinPath, one from each >> child-join. These paths are converted into plans and a Merge/Append >> plan is created combing these plans. The paths and plans for >> child-join are created in a temporary memory context. The final plan >> for each child-join is copied into planner's context and the temporary >> memory context is reset. >> > > Robert and I discussed this in more detail. Path creation code may > allocate objects other than paths. postgres_fdw, for example, > allocates character array to hold the name of relation being > pushed-down. When the temporary context gets zapped after creating > paths for a given child-join, those other objects also gets thrown > away. Attached patch has implemented the idea that came out of the > discussion. > > We create a memory context for holding paths at the time of creating > PlannerGlobal and save it in PlannerGlobal. The patch introduces a new > macro makePathNode() which allocates the memory for given type of path > from this context. Every create_*_path function has been changed to > use this macro instead of makeNode(). In standard_planner(), at the > end of planning we destroy the memory context freeing all the paths > allocated. While creating a plan node, planner copies everything > required by the plan from the path, so the path is not needed any > more. So, freeing corresponding memory should not have any adverse > effects. > > Most of the create_*_path() functions accept root as an argument, thus > the temporary path context is available through root->glob everywhere. > An exception is create_append_path() which does not accept root as an > argument. The patch changes create_append_path() and its callers like > set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an > argument. Ideally paths are not required after creating plan, so we > should be > able to free the context right after the call to create_plan(). But we > need dummy paths while creating flat rtable in > set_plan_references()->add_rtes_to_flat_rtable(). We used to So free > the path context at the end of planning cycle. Now that we are > allocating all the paths in a different memory context, it doesn't > make sense to switch context in mark_dummy_rel(). > > 0001 patch implements the idea described above. > 0002 patch adds instrumentation to measure memory consumed in > standard_planner() call. > 0003 patch adds a GUC zap_paths to enable/disable destroying path context. > The last two patches are for testing only. > > Attached also find the SQL script and its output showing the memory > saved. For a 5 way self-join of pg_class, the total memory consumed in > standard_planner() is 760K without patch and with patch it comes down > to 713K, saving 47K memory otherwise occupied by paths. It looks like > something useful even without partition-wise joins. Of course, that's not a lot, but the savings will be a lot better for partition-wise joins. Do you have a set of patches for that feature that apply on top of 0001? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
PFA the zip containing all the patches rebased on 56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free memory consumed by paths using a separate path context. There are some more changes wrt earlier set of patches 1. Since we don't need a separate context for planning for each child_join, changed code in create_partition_join_plan() to not do that. The function collects all child_join paths into merge/append path and calls create_plan_recurse() on that path instead of converting each child_join path to plan one at a time. 2. Changed optimizer/README and some comments referring to temporary memory context, since we do not use that anymore. 3. reparameterize_path_by_child() is fixed to translate the merge and hash clause in Hash/Merge path. On Thu, Mar 9, 2017 at 6:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> 2. If the PartitionJoinPath emerges as the best path, we create paths >>> for each of the remaining child-joins. Then we collect paths with >>> properties same as the given PartitionJoinPath, one from each >>> child-join. These paths are converted into plans and a Merge/Append >>> plan is created combing these plans. The paths and plans for >>> child-join are created in a temporary memory context. The final plan >>> for each child-join is copied into planner's context and the temporary >>> memory context is reset. >>> >> >> Robert and I discussed this in more detail. Path creation code may >> allocate objects other than paths. postgres_fdw, for example, >> allocates character array to hold the name of relation being >> pushed-down. When the temporary context gets zapped after creating >> paths for a given child-join, those other objects also gets thrown >> away. Attached patch has implemented the idea that came out of the >> discussion. >> >> We create a memory context for holding paths at the time of creating >> PlannerGlobal and save it in PlannerGlobal. The patch introduces a new >> macro makePathNode() which allocates the memory for given type of path >> from this context. Every create_*_path function has been changed to >> use this macro instead of makeNode(). In standard_planner(), at the >> end of planning we destroy the memory context freeing all the paths >> allocated. While creating a plan node, planner copies everything >> required by the plan from the path, so the path is not needed any >> more. So, freeing corresponding memory should not have any adverse >> effects. >> >> Most of the create_*_path() functions accept root as an argument, thus >> the temporary path context is available through root->glob everywhere. >> An exception is create_append_path() which does not accept root as an >> argument. The patch changes create_append_path() and its callers like >> set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an >> argument. Ideally paths are not required after creating plan, so we >> should be >> able to free the context right after the call to create_plan(). But we >> need dummy paths while creating flat rtable in >> set_plan_references()->add_rtes_to_flat_rtable(). We used to So free >> the path context at the end of planning cycle. Now that we are >> allocating all the paths in a different memory context, it doesn't >> make sense to switch context in mark_dummy_rel(). >> >> 0001 patch implements the idea described above. >> 0002 patch adds instrumentation to measure memory consumed in >> standard_planner() call. >> 0003 patch adds a GUC zap_paths to enable/disable destroying path context. >> The last two patches are for testing only. >> >> Attached also find the SQL script and its output showing the memory >> saved. For a 5 way self-join of pg_class, the total memory consumed in >> standard_planner() is 760K without patch and with patch it comes down >> to 713K, saving 47K memory otherwise occupied by paths. It looks like >> something useful even without partition-wise joins. > > Of course, that's not a lot, but the savings will be a lot better for > partition-wise joins. Do you have a set of patches for that feature > that apply on top of 0001? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Mar 10, 2017 at 5:43 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > PFA the zip containing all the patches rebased on > 56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free > memory consumed by paths using a separate path context. Some very high-level thoughts based on a look through these patches: In 0001, you've removed a comment about how GEQO needs special handling, but it doesn't look as if you've made any compensating change elsewhere. That seems unlikely to be correct. If GEQO needs some paths to survive longer than others, how can it be right for this code to create them all in the same context? Incidentally, geqo_eval() seems to be an existing precedent for the idea of throwing away paths and RelOptInfos, so we might want to use similar code for partitionwise join. 0002 and 0003 look OK. Probably 0004 is OK too, although that seems to be adding some overhead to existing callers for the benefit of new ones. Might be insignificant, though. 0005 looks OK, except that add_join_rel's definition is missing a "static" qualifier. That's not just cosmetic; based on previous expereince, this will break the BF. 0006 seems to be unnecessary; the new function isn't used in later patches. Haven't looked at 0007 yet. 0008 is, as previously mentioned, more than we probably want to commit. Haven't looked at 0009 yet. 0010 - 0012 seem to be various fixes which would need to be done before or along with 0009, rather than afterward, so I am confused about the ordering of those patches in the patch series. The commit message for 0013 is a bit unclear about what it's doing, although I can guess, a bit, based on the commit message for 0007. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Haven't looked at 0007 yet. + if (rel->part_scheme) + { + int cnt_parts; + + for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++) + { + if (rel->part_oids[cnt_parts] == childRTE->relid) + { + Assert(!rel->part_rels[cnt_parts]); + rel->part_rels[cnt_parts] = childrel; + } + } + } It's not very appealing to use an O(n^2) algorithm here. I wonder if we could arrange things so that inheritance expansion expands partitions in the right order, and then we could just match them up one-to-one. This would probably require an alternate version of find_all_inheritors() that expand_inherited_rtentry() would call only for partitioned tables. Failing that, another idea would be to use qsort() or qsort_arg() to put the partitions in the right order. + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || + !inhparent || + !(rel->part_scheme = find_partition_scheme(root, relation))) Maybe just don't call this function in the first place in the !inhparent case, instead of passing down an argument that must always be true. + /* Match the partition key types. */ + for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++) + { + /* + * For types, it suffices to match the type id, mod and collation; + * len, byval and align are depedent on the first two. + */ + if (part_key->partopfamily[cnt_pks] != part_scheme->partopfamily[cnt_pks] || + part_key->partopcintype[cnt_pks] != part_scheme->partopcintype[cnt_pks] || + part_key->parttypid[cnt_pks] != part_scheme->key_types[cnt_pks] || + part_key->parttypmod[cnt_pks] != part_scheme->key_typmods[cnt_pks] || + part_key->parttypcoll[cnt_pks] != part_scheme->key_collations[cnt_pks]) + break; + } I think memcmp() might be better than a for-loop. Overall this one looks pretty good and straightforward. Of course, I haven't looked at the main act (0009) yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Amit Langote
Дата:
On 2017/03/14 9:17, Robert Haas wrote: > On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Haven't looked at 0007 yet. > > Overall this one looks pretty good and straightforward. In the following code of find_partition_scheme(): + /* Did not find matching partition scheme. Create one. */ + part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData)); + + /* Copy partition bounds/lists. */ + part_scheme->nparts = part_desc->nparts; + part_scheme->strategy = part_key->strategy; + part_scheme->boundinfo = part_desc->boundinfo; + + /* Store partition key information. */ + part_scheme->partnatts = part_key->partnatts; + + part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts); + memcpy(part_scheme->partopfamily, part_key->partopfamily, + sizeof(Oid) * partnatts); + + part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts); + memcpy(part_scheme->partopcintype, part_key->partopcintype, + sizeof(Oid) * partnatts); + + part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts); + memcpy(part_scheme->key_types, part_key->parttypid, + sizeof(Oid) * partnatts); + + part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts); + memcpy(part_scheme->key_typmods, part_key->parttypmod, + sizeof(int32) * partnatts); + + part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts); + memcpy(part_scheme->key_collations, part_key->parttypcoll, + sizeof(Oid) * partnatts); Couldn't we avoid the memcpy() on individual members of part_key? After all, RelationData.rd_partkey is guarded just like rd_partdesc by relcache.c in face of invalidations (see keep_partkey logic in RelationClearRelation). Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Thanks for the review. > > Some very high-level thoughts based on a look through these patches: > > In 0001, you've removed a comment about how GEQO needs special > handling, but it doesn't look as if you've made any compensating > change elsewhere. That seems unlikely to be correct. If GEQO needs > some paths to survive longer than others, how can it be right for this > code to create them all in the same context? Thanks for pointing that out. I have replaced the code and the comments back. There was another issue that the temporary paths created by geqo will not be freed when geqo moves one genetic string to next genetic string (or what it calls as tour: an list of relation to be joined in a given order). To fix this, we need to set the path context to the temporary context of geqo inside geqo_eval() before calling gimme_tree() and reset it later. That way the temporary paths are also created in the temporary memory context of geqo. Fixed in the patch. > Incidentally, > geqo_eval() seems to be an existing precedent for the idea of throwing > away paths and RelOptInfos, so we might want to use similar code for > partitionwise join. There are some differences in what geqo does and what partition-wise needs to do. geqo tries many joining orders each one in a separate temporary context. The way geqo slices the work, every slice produces a full plan. For partition-wise join I do not see a way to slice the work such that the whole path and corresponding RelOptInfos come from the same slice. So, we can't use the same method as GEQO. It's worth noticing that paths are created twice for the cheapest joining order that it finds: once in the trial phase and second time when the final plan is created. The second time, the paths, RelOptInfos, expressions used by the final plan are in the context as the plan. > > 0002 and 0003 look OK. > > Probably 0004 is OK too, although that seems to be adding some > overhead to existing callers for the benefit of new ones. Might be > insignificant, though. Yes, the overhead is to add and extract the appinfo from a list when there is only one appinfo. We may optimize it by passing appinfo directly when there's only one and pass list when there are more, but that is complicating the code unnecessarily. The overhead seems to be worth the cost to keep the code simpler. > > 0005 looks OK, except that add_join_rel's definition is missing a > "static" qualifier. That's not just cosmetic; based on previous > expereince, this will break the BF. Thanks for pointing it out. Done. > > 0006 seems to be unnecessary; the new function isn't used in later patches. It's required by 0011 - reparameterize_path_by_child(). BTW, I need to know whether reparameterize_path_by_child() looks good, so that I can complete it by adding support for all kinds of path in that function. > > Haven't looked at 0007 yet. > > 0008 is, as previously mentioned, more than we probably want to commit. I agree, and I will work on that. > > Haven't looked at 0009 yet. > > 0010 - 0012 seem to be various fixes which would need to be done > before or along with 0009, rather than afterward, so I am confused > about the ordering of those patches in the patch series. They are needed only when we have 0009. But when those are clubbed with 0009, it makes 0009 review difficult as the code for those fixes mixes with the code for partition-wise support. So, I have separated those out into patches categorized by functionality. Reviewer may then apply 0009 and see what failures each of the changes in 0010-0012 fixes, if required. They need to be committed along-with 0009. > > The commit message for 0013 is a bit unclear about what it's doing, > although I can guess, a bit, based on the commit message for 0007. > This is preparatory patch for 0015 which supports partition-wise join for multi-level partitioned tables. We have discussed about partition-wise join support for multi-level partitioned tables in [1]. We may decide to postpone patches 0013-0015 to v11, if this gets too much for v10. [1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Mar 14, 2017 at 5:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Haven't looked at 0007 yet. > > + if (rel->part_scheme) > + { > + int cnt_parts; > + > + for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++) > + { > + if (rel->part_oids[cnt_parts] == > childRTE->relid) > + { > + Assert(!rel->part_rels[cnt_parts]); > + rel->part_rels[cnt_parts] = childrel; > + } > + } > + } > > It's not very appealing to use an O(n^2) algorithm here. I wonder if > we could arrange things so that inheritance expansion expands > partitions in the right order, and then we could just match them up > one-to-one. This would probably require an alternate version of > find_all_inheritors() that expand_inherited_rtentry() would call only > for partitioned tables. That seems a much better solution, but 1. Right now when we expand a multi-level partitioned table, we include indirect partitions as direct children in inheritance hierachy. part_rels array OTOH should correspond to the partitioning scheme and should hold RelOptInfos of direct partitions. 0013 patch fixes that to include only direct partitions as direct children preserving partitioning hierarchy in the inheritance hierarchy. That patch right now uses find_inheritance_children() to get Oids of direct partitions, but instead it could return rd_partdesc->oids in the form of list; OIDs ordered same as the array. Once we do that, we should expect the appinfos to appear in the same order as the rd_partdesc->oids and so RelOptInfo::part_oids. We just need to make sure that the order is preserved and assign part_rels as they appear in that loop. One would argue that we preserve the OIDs only for single-level partitioned tables, but in expand_inheritance_rtentry(), if we want to detect whether a relation is single-level partitioned or multi-level, we need to look up its direct partitions to see if they are further partitioned. That will look a bit ugly and will not be necessary once we have 0013. In case we decide to defer multi-level partitioned table changes to v11 and based on the progress in [1], I will work on fixing the order in which appinfos are created for single-level partitioned tables. > Failing that, another idea would be to use > qsort() or qsort_arg() to put the partitions in the right order. I didn't get this. I could not find documentation for qsort_arg(). Can you please elaborate? I guess, if we fix expand_inheritance_rtentry() we don't need this. It looks like we will change expand_inheritance_rtentry() anyway. > > + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || > + !inhparent || > + !(rel->part_scheme = find_partition_scheme(root, relation))) > > Maybe just don't call this function in the first place in the > !inhparent case, instead of passing down an argument that must always > be true. The function serves a single place to re/set partitioning information. It would set the partitioning information if the above three conditions are met. Otherwise it would nullify that information. If we decide not to call this function when !inhparent, we will need to nullify the partitioning information outside of this function as well as inside this function, duplicating the code. > > + /* Match the partition key types. */ > + for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++) > + { > + /* > + * For types, it suffices to match the type > id, mod and collation; > + * len, byval and align are depedent on the first two. > + */ > + if (part_key->partopfamily[cnt_pks] != > part_scheme->partopfamily[cnt_pks] || > + part_key->partopcintype[cnt_pks] != > part_scheme->partopcintype[cnt_pks] || > + part_key->parttypid[cnt_pks] != > part_scheme->key_types[cnt_pks] || > + part_key->parttypmod[cnt_pks] != > part_scheme->key_typmods[cnt_pks] || > + part_key->parttypcoll[cnt_pks] != > part_scheme->key_collations[cnt_pks]) > + break; > + } > > I think memcmp() might be better than a for-loop. Done. PFA patches. [1] https://www.postgresql.org/message-id/2b0d42f2-3a53-763b-c9c2-47139e4b1c2e@lab.ntt.co.jp -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Mar 14, 2017 at 6:28 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/03/14 9:17, Robert Haas wrote: >> On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Haven't looked at 0007 yet. >> >> Overall this one looks pretty good and straightforward. > > In the following code of find_partition_scheme(): > > + /* Did not find matching partition scheme. Create one. */ > + part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData)); > + > + /* Copy partition bounds/lists. */ > + part_scheme->nparts = part_desc->nparts; > + part_scheme->strategy = part_key->strategy; > + part_scheme->boundinfo = part_desc->boundinfo; > + > + /* Store partition key information. */ > + part_scheme->partnatts = part_key->partnatts; > + > + part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts); > + memcpy(part_scheme->partopfamily, part_key->partopfamily, > + sizeof(Oid) * partnatts); > + > + part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts); > + memcpy(part_scheme->partopcintype, part_key->partopcintype, > + sizeof(Oid) * partnatts); > + > + part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts); > + memcpy(part_scheme->key_types, part_key->parttypid, > + sizeof(Oid) * partnatts); > + > + part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts); > + memcpy(part_scheme->key_typmods, part_key->parttypmod, > + sizeof(int32) * partnatts); > + > + part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts); > + memcpy(part_scheme->key_collations, part_key->parttypcoll, > + sizeof(Oid) * partnatts); > > Couldn't we avoid the memcpy() on individual members of part_key? After > all, RelationData.rd_partkey is guarded just like rd_partdesc by > relcache.c in face of invalidations (see keep_partkey logic in > RelationClearRelation). This suggestion looks good to me. Incorporated in the latest set of patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Tue, Mar 14, 2017 at 8:04 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> In 0001, you've removed a comment about how GEQO needs special >> handling, but it doesn't look as if you've made any compensating >> change elsewhere. That seems unlikely to be correct. If GEQO needs >> some paths to survive longer than others, how can it be right for this >> code to create them all in the same context? > > Thanks for pointing that out. I have replaced the code and the > comments back. There was another issue that the temporary paths > created by geqo will not be freed when geqo moves one genetic string > to next genetic string (or what it calls as tour: an list of relation > to be joined in a given order). To fix this, we need to set the path > context to the temporary context of geqo inside geqo_eval() before > calling gimme_tree() and reset it later. That way the temporary paths > are also created in the temporary memory context of geqo. Fixed in the > patch. Yeah, that looks better. >> Incidentally, >> geqo_eval() seems to be an existing precedent for the idea of throwing >> away paths and RelOptInfos, so we might want to use similar code for >> partitionwise join. > > There are some differences in what geqo does and what partition-wise > needs to do. geqo tries many joining orders each one in a separate > temporary context. The way geqo slices the work, every slice produces > a full plan. For partition-wise join I do not see a way to slice the > work such that the whole path and corresponding RelOptInfos come from > the same slice. So, we can't use the same method as GEQO. What I was thinking about was the use of this technique for getting rid of joinrels: root->join_rel_list = list_truncate(root->join_rel_list, savelength); root->join_rel_hash= savehash; makePathNode() serves to segregate paths into a separate memory context that can then be destroyed, but as you point out, the path lists are still hanging around, and so are the RelOptInfo nodes. It seems to me we could do a lot better using this technique. Suppose we jigger things so that the List objects created by add_path go into path_cxt, and so that RelOptInfo nodes also go into path_cxt. Then when we blow up path_cxt we won't have dangling pointers in the RelOptInfo objects any more because the RelOptInfos themselves will be gone. The only problem is that the join_rel_list (and join_rel_hash if it exists) will be corrupt, but we can fix that using the technique demonstrated above. Of course, that supposes that 0009 can manage to postpone creating non-sampled child joinrels until create_partition_join_plan(), which it currently doesn't. In fact, unless I'm missing something, 0009 hasn't been even slightly adapted to take advantage of the infrastructure in 0001; it doesn't seem to reset the path_cxt or anything. That seems like a fairly major omission. Incidentally, I committed 0002, 0003, and 0005 as a single commit with a few tweaks; I think you may need to do a bit of rebasing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Of course, that supposes that 0009 can manage to postpone creating > non-sampled child joinrels until create_partition_join_plan(), which > it currently doesn't. In fact, unless I'm missing something, 0009 > hasn't been even slightly adapted to take advantage of the > infrastructure in 0001; it doesn't seem to reset the path_cxt or > anything. That seems like a fairly major omission. Some other comments on 0009: Documentation changes for the new GUCs are missing. +between the partition keys of the joining tables. The equi-join between +partition keys implies that for a given row in a given partition of a given +partitioned table, its joining row, if exists, should exist only in the +matching partition of the other partitioned table; no row from non-matching There could be more than one. I'd write: The equi-join between partition keys implies that all join partners for a given row in one partitioned table must be in the corresponding partition of the other partitioned table. +#include "miscadmin.h"#include <limits.h>#include <math.h> Added in wrong place. + * System attributes do not need translation. In such a case, + * the attribute numbers of the parent and the child should + * start from the same minimum attribute. I would delete the second sentence and add an Assert() to that effect instead. + /* Pass top parent's relids down the inheritance hierarchy. */ Why? + for (attno = rel->min_attr; attno <= rel->max_attr; attno++) Add add a comment explaining why we need to do this. - add_paths_to_append_rel(root, rel, live_childrels); + add_paths_to_append_rel(root, rel, live_childrels, false);} - No need to remove blank line. + * When called on partitioned join relation with partition_join_path = true, it + * adds PartitionJoinPath instead of Merge/Append path. This path is costed + * based on the costs of sampled child-join and is expanded later into + * Merge/Append plan. I'm not a big fan of the Merge/Append terminology here. If somebody adds another kind of append-path someday, then all of these comments will have to be updated. I think this can be phrased more generically. /* + * While creating PartitionJoinPath, we sample paths from only a few child + * relations. Even if all of sampled children have partial paths, it's not + * guaranteed that all the unsampled children will have partial paths. + * Hence we do not create partial PartitionJoinPaths. + */ Very sad. I guess if we had parallel append available, we could maybe dodge this problem, but for now I suppose we're stuck with it. + /* + * Partitioning scheme in join relation indicates a possibility that the + * join may be partitioned, but it's not necessary that every pair of + * joining relations can use partition-wise join technique. If one of + * joining relations turns out to be unpartitioned, this pair of joining + * relations can not use partition-wise join technique. + */ + if (!rel1->part_scheme || !rel2->part_scheme) + return; How can this happen? If rel->part_scheme != NULL, doesn't that imply that every rel covered by the joinrel is partitioned that way, and therefore this condition must necessarily hold? In general, I think it's better style to write explicit tests against NULL or NIL than to just write if (blahptr). + partitioned_join->sjinfo = copyObject(parent_sjinfo); Why do we need to copy it? + /* + * Remove the relabel decoration. We can assume that there is at most one + * RelabelType node; eval_const_expressions() simplifies multiple + * RelabelType nodes into one. + */ + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; Still, instead of assuming this, you could just s/if/while/, and then you wouldn't need the assumption any more. Also, consider castNode(). partition_wise_plan_weight may be useful for testing, but I don't think it should be present in the final patch. This is not a full review; I ran out of mental energy before I got to the end. (Sorry.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
>> >> There are some differences in what geqo does and what partition-wise >> needs to do. geqo tries many joining orders each one in a separate >> temporary context. The way geqo slices the work, every slice produces >> a full plan. For partition-wise join I do not see a way to slice the >> work such that the whole path and corresponding RelOptInfos come from >> the same slice. So, we can't use the same method as GEQO. > > What I was thinking about was the use of this technique for getting > rid of joinrels: > > root->join_rel_list = list_truncate(root->join_rel_list, > savelength); > root->join_rel_hash = savehash; > > makePathNode() serves to segregate paths into a separate memory > context that can then be destroyed, but as you point out, the path > lists are still hanging around, and so are the RelOptInfo nodes. It > seems to me we could do a lot better using this technique. Suppose we > jigger things so that the List objects created by add_path go into > path_cxt, and so that RelOptInfo nodes also go into path_cxt. Then > when we blow up path_cxt we won't have dangling pointers in the > RelOptInfo objects any more because the RelOptInfos themselves will be > gone. The only problem is that the join_rel_list (and join_rel_hash > if it exists) will be corrupt, but we can fix that using the technique > demonstrated above. > > Of course, that supposes that 0009 can manage to postpone creating > non-sampled child joinrels until create_partition_join_plan(), which > it currently doesn't. Right. We need the child-join's RelOptInfos to estimate sizes, so that we could sample the largest ones. So postponing it looks difficult. > In fact, unless I'm missing something, 0009 > hasn't been even slightly adapted to take advantage of the > infrastructure in 0001; it doesn't seem to reset the path_cxt or > anything. That seems like a fairly major omission. The path_cxt reset introduced by 0001 recycles memory used by all the paths, including paths created for the children. But that happens only after all the planning has completed. I thought that's what we discussed to be done. We could create a separate path context for every top-level child-join. That will require either copying the cheapest path-tree into root->glob->path_cxt memory context OR will require it to be converted to a plan immediately. The first will require spending CPU cycles and memory in copying path-tree. The later requires almost all the create_*_append_plan() code to be duplicated in create_partition_join_plan() which is ugly. In an earlier version of this patch I had that code, which I got rid of in the latest set of patches. Between those two the first looks better. > > Incidentally, I committed 0002, 0003, and 0005 as a single commit with > a few tweaks; I think you may need to do a bit of rebasing. Thanks. I will have fewer patches to rebase now :). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Mar 15, 2017 at 6:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Of course, that supposes that 0009 can manage to postpone creating >> non-sampled child joinrels until create_partition_join_plan(), which >> it currently doesn't. In fact, unless I'm missing something, 0009 >> hasn't been even slightly adapted to take advantage of the >> infrastructure in 0001; it doesn't seem to reset the path_cxt or >> anything. That seems like a fairly major omission. > > Some other comments on 0009: > > Documentation changes for the new GUCs are missing. Done. The description might need more massaging, but I will work on that once we have fixed their names and usage. I think sample_partition_fraction and partition_wise_plan_weight, if retained, will be applicable to other partition-wise planning like partition-wise aggregates. So we will need more generic description there. > > +between the partition keys of the joining tables. The equi-join between > +partition keys implies that for a given row in a given partition of a given > +partitioned table, its joining row, if exists, should exist only in the > +matching partition of the other partitioned table; no row from non-matching > > There could be more than one. I'd write: The equi-join between > partition keys implies that all join partners for a given row in one > partitioned table must be in the corresponding partition of the other > partitioned table. Done. I think it's important to emphasize the the joining partners can not be in other partitions. So, added that sentence after your suggested sentence. > > +#include "miscadmin.h" > #include <limits.h> > #include <math.h> > > Added in wrong place. Done. > > + * System attributes do not need > translation. In such a case, > + * the attribute numbers of the parent > and the child should > + * start from the same minimum attribute. > > I would delete the second sentence and add an Assert() to that effect instead. The assertion is there just few lines down. Please let me know if that suffices. Deleted the second sentence. > > + /* Pass top parent's relids down the inheritance hierarchy. */ > > Why? That is required for a multi-level partitioned table. top_parent_relids are used for translating expressions of the top parent to that of child table. > > + for (attno = rel->min_attr; attno <= > rel->max_attr; attno++) > > Add add a comment explaining why we need to do this. The comment is there just few lines above. I have moved it just above this for loop. > > - add_paths_to_append_rel(root, rel, live_childrels); > + add_paths_to_append_rel(root, rel, live_childrels, false); > } > > - > > No need to remove blank line. Sorry. That was added by my patch to refactor set_append_rel_pathlist(). I have added a patch in the series to remove that line. > > + * When called on partitioned join relation with partition_join_path = true, it > + * adds PartitionJoinPath instead of Merge/Append path. This path is costed > + * based on the costs of sampled child-join and is expanded later into > + * Merge/Append plan. > > I'm not a big fan of the Merge/Append terminology here. If somebody > adds another kind of append-path someday, then all of these comments > will have to be updated. I think this can be phrased more > generically. Reworded as + * When partition_join_path is true, the caller intends to add a + * PartitionJoinPath costed based on the sampled child-joins passed as + * live_childrels. Also added an assertion to make sure the partition_join_path is true only for join relations. > > /* > + * While creating PartitionJoinPath, we sample paths from only > a few child > + * relations. Even if all of sampled children have partial > paths, it's not > + * guaranteed that all the unsampled children will have partial paths. > + * Hence we do not create partial PartitionJoinPaths. > + */ > > Very sad. I guess if we had parallel append available, we could maybe > dodge this problem, but for now I suppose we're stuck with it. Really sad. Is there a way to look at the relation (without any partial paths yet) and see whether the relation will have partial paths or not. Even if we don't have actual partial paths but know that there will be at least one added in the future, we will be able to fix this problem. > > + /* > + * Partitioning scheme in join relation indicates a possibility that the > + * join may be partitioned, but it's not necessary that every pair of > + * joining relations can use partition-wise join technique. If one of > + * joining relations turns out to be unpartitioned, this pair of joining > + * relations can not use partition-wise join technique. > + */ > + if (!rel1->part_scheme || !rel2->part_scheme) > + return; > > How can this happen? If rel->part_scheme != NULL, doesn't that imply > that every rel covered by the joinrel is partitioned that way, and > therefore this condition must necessarily hold? I don't remember exactly, but this was added considering a more generic partition-wise join. But then we would have more changes when we support that. So, turned this into an assertion. > > In general, I think it's better style to write explicit tests against > NULL or NIL than to just write if (blahptr). PG code uses both the styles. Take for example src/backend/rewrite/rewriteManip.c or any file, both styles are being used. I find this style useful, when I want to code, say "if this relation does not have a partitioning scheme" rather than "if this relation have NULL partitioning scheme". Although I don't have objections changing it as per your suggestion. > > + partitioned_join->sjinfo = copyObject(parent_sjinfo); > > Why do we need to copy it? > sjinfo in make_join_rel() may be from root->join_info_list or it could be one made up locally in that function. The one made up in that function would go away with that function, whereas we need it much later to create paths for child-joins. So, I thought it's better to copy it. But now I have changed to code to pass NULL for a made-up sjinfo. In such a case, the child-join's sjinfo is also made up. This required some refactoring to separate out the making-up code. So, there's new refactoring patch. > + /* > + * Remove the relabel decoration. We can assume that there is > at most one > + * RelabelType node; eval_const_expressions() simplifies multiple > + * RelabelType nodes into one. > + */ > + if (IsA(expr, RelabelType)) > + expr = (Expr *) ((RelabelType *) expr)->arg; > > Still, instead of assuming this, you could just s/if/while/, and then > you wouldn't need the assumption any more. Also, consider castNode(). Done. > > partition_wise_plan_weight may be useful for testing, but I don't > think it should be present in the final patch. partition_join test needs it so that it can work with smaller dataset and complete faster. For smaller data sets the partition-wise join paths come out to be costlier than other kinds and are never chosen. By setting partition_wise_plan_weight I can force partition-wise join to be chosen. An alternate solution would be to use sample_partition_fraction = 1.0, but then we will never test delayed planning for unsampled child-joins. I also think that users will find partition_wise_plan_weight useful when estimates based on samples are unrealistic. Obviously, in a longer run we should be able to provide better estimates. Apart from this, I have also removed recursive calls to try_partition_wise_join() and generate_partition_wise_join_paths() from 0009 and places them in the 0014 patch. Those are required for multi-level partitioned tables, which are not supported in 0009. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> Of course, that supposes that 0009 can manage to postpone creating >> non-sampled child joinrels until create_partition_join_plan(), which >> it currently doesn't. > > Right. We need the child-join's RelOptInfos to estimate sizes, so that > we could sample the largest ones. So postponing it looks difficult. You have a point. >> In fact, unless I'm missing something, 0009 >> hasn't been even slightly adapted to take advantage of the >> infrastructure in 0001; it doesn't seem to reset the path_cxt or >> anything. That seems like a fairly major omission. > > The path_cxt reset introduced by 0001 recycles memory used by all the > paths, including paths created for the children. But that happens only > after all the planning has completed. I thought that's what we > discussed to be done. We could create a separate path context for > every top-level child-join. I don't think we need to create a new context for each top-level child-join, but I think we should create a context to be used across all top-level child-joins and then reset it after planning each one. I thought the whole point here was that NOT doing that caused the memory usage for partitionwise join to get out of control. Am I confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Sorry. That was added by my patch to refactor > set_append_rel_pathlist(). I have added a patch in the series to > remove that line. It's not worth an extra commit just to change what isn't broken. Let's just leave it alone. >> Very sad. I guess if we had parallel append available, we could maybe >> dodge this problem, but for now I suppose we're stuck with it. > > Really sad. Is there a way to look at the relation (without any > partial paths yet) and see whether the relation will have partial > paths or not. Even if we don't have actual partial paths but know that > there will be at least one added in the future, we will be able to fix > this problem. I don't think so. If we know that rel->consider_parallel will end up true for a plain table, we should always get a parallel sequential scan path at least, but if there are foreign tables involved, then nothing is guaranteed. >> partition_wise_plan_weight may be useful for testing, but I don't >> think it should be present in the final patch. > > partition_join test needs it so that it can work with smaller dataset > and complete faster. For smaller data sets the partition-wise join > paths come out to be costlier than other kinds and are never chosen. > By setting partition_wise_plan_weight I can force partition-wise join > to be chosen. An alternate solution would be to use > sample_partition_fraction = 1.0, but then we will never test delayed > planning for unsampled child-joins. I also think that users will find > partition_wise_plan_weight useful when estimates based on samples are > unrealistic. Obviously, in a longer run we should be able to provide > better estimates. I still don't like it -- we have no other similar knob. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
So I am looking at this part of 0008: + /* + * Do not copy parent_rinfo and child_rinfos because 1. they create a + * circular dependency between child and parent RestrictInfo 2. dropping + * those links just means that we loose some memory optimizations. 3. There + * is a possibility that the child and parent RestrictInfots themselves may + * have got copied and thus the old links may no longer be valid. The + * caller may set up those links itself, if needed. + */ I don't think that it's very clear whether or not this is safe. I experimented with making _copyRestrictInfo PANIC, which, interestingly, does not affect the core regression tests at all, but does trip on this bit from the postgres_fdw tests: -- subquery using stable function (can't be sent to remote) PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) = '1970-01-17'::date) ORDER BY c1; EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20); I'm not sure why this particular case is affected when so many others are not, and the comment doesn't help me very much in figuring it out. Why do we need this cache in the RestrictInfo, anyway? Aside from the comment above, I looked at the comment in the RestrictInfo struct, and I looked at the comment in build_child_restrictinfo, and I looked at the comment in build_child_clauses, and I looked at the place where build_child_clauses is called in set_append_rel_size, and none of those places explain why we need this cache. I would assume we'd need a separate translation of the RestrictInfo for every separate child-join, so how does the cache help? Maybe the answer is that build_child_clauses() is also called from try_partition_wise_join() and add_paths_to_child_joinrel(), and those three call sights all end up producing the same set of translated RestrictInfos. But if that's the case, somehow it seems like we ought to be producing these in one place where we can get convenient access to them from each child join, rather than having to search through this cache to find it. It's a pretty inefficient cache: it takes O(n) time to search it, I think, where n is the number of partitions. And you do O(n) searches. So it's an O(n^2) algorithm, which is a little unfortunate. Can't we affix the translated RestrictInfos someplace where they can be found more efficiently? Yet another thing that the comments don't explain is why the existing adjust_appendrel_attrs call needs to be replaced with build_child_clauses. So I feel, overall, that the point of all of this is not explained well at all. ...Robert
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Mar 16, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> Of course, that supposes that 0009 can manage to postpone creating >>> non-sampled child joinrels until create_partition_join_plan(), which >>> it currently doesn't. >> >> Right. We need the child-join's RelOptInfos to estimate sizes, so that >> we could sample the largest ones. So postponing it looks difficult. > > You have a point. > >>> In fact, unless I'm missing something, 0009 >>> hasn't been even slightly adapted to take advantage of the >>> infrastructure in 0001; it doesn't seem to reset the path_cxt or >>> anything. That seems like a fairly major omission. >> >> The path_cxt reset introduced by 0001 recycles memory used by all the >> paths, including paths created for the children. But that happens only >> after all the planning has completed. I thought that's what we >> discussed to be done. We could create a separate path context for >> every top-level child-join. > > I don't think we need to create a new context for each top-level > child-join, but I think we should create a context to be used across > all top-level child-joins and then reset it after planning each one. Sorry, that's what I meant by creating a new context for each top-level child-join. So, we need to copy the required path tree before resetting the context. I am fine doing that but read on. > I thought the whole point here was that NOT doing that caused the > memory usage for partitionwise join to get out of control. Am I > confused? We took a few steps to reduce the memory footprint of partition-wise join in [1] and [2]. According to the numbers reported in [1] and then in [2], if the total memory consumed by a planner is 44MB (memory consumed by paths 150K) for a 5-way non-parition-wise join between tables with 1000 partitions, partition-wise join consumed 192MB which is 4.4 times the non-partitino-wise case. The earlier implementation of blowing away a memory context after each top-level child-join, just got rid of the paths created for that child-join. The total memory consumed by paths created for all the child-joins was about 150MB. Remember that we can not get rid of memory consumed by expressions, RelOptInfos, RestrictInfos etc. since their pointers will be copied into the plan nodes. With changes in 0001, what happens is we accumulate 150MB till the end of the planning and get rid of it after we have created a plan. So, till the plan is created we are consuming approx. 192MB + 150MB = 342MB memory and are getting rid of 150MB memory after we have created the plan. I am not sure whether consuming extra 150MB or for that matter 342MB in a setup with a thousand partitions is "going out of control". (342MB is approx. 7.8 time 44MB; not 1000 times, and not even 10 times). But if you think that we should throw away unused paths after planning every top-level child-join I am fine with it. [1] https://www.postgresql.org/message-id/CAFjFpRcZ_M3-JxoiDkdoPS%2B-9Cok4ux9Si%2B4drcRL-62af%3DjWw@mail.gmail.com [2] https://www.postgresql.org/message-id/CAFjFpRe66z%2Bw9%2BdnAkWGiaB1CU2CUQsLGsqzHzYBoA%3DKJFf%2BPQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Mar 16, 2017 at 12:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Sorry. That was added by my patch to refactor >> set_append_rel_pathlist(). I have added a patch in the series to >> remove that line. > > It's not worth an extra commit just to change what isn't broken. > Let's just leave it alone. Ok. Removed that patch from the set of patches. > >>> Very sad. I guess if we had parallel append available, we could maybe >>> dodge this problem, but for now I suppose we're stuck with it. >> >> Really sad. Is there a way to look at the relation (without any >> partial paths yet) and see whether the relation will have partial >> paths or not. Even if we don't have actual partial paths but know that >> there will be at least one added in the future, we will be able to fix >> this problem. > > I don't think so. If we know that rel->consider_parallel will end up > true for a plain table, we should always get a parallel sequential > scan path at least, but if there are foreign tables involved, then > nothing is guaranteed. Ok. > >>> partition_wise_plan_weight may be useful for testing, but I don't >>> think it should be present in the final patch. >> >> partition_join test needs it so that it can work with smaller dataset >> and complete faster. For smaller data sets the partition-wise join >> paths come out to be costlier than other kinds and are never chosen. >> By setting partition_wise_plan_weight I can force partition-wise join >> to be chosen. An alternate solution would be to use >> sample_partition_fraction = 1.0, but then we will never test delayed >> planning for unsampled child-joins. I also think that users will find >> partition_wise_plan_weight useful when estimates based on samples are >> unrealistic. Obviously, in a longer run we should be able to provide >> better estimates. > > I still don't like it -- we have no other similar knob. We have another cost-skewing GUC, disable_cost, which adds a huge cost to anything that needs to be disabled. This is different in the sense that it multiplies the cost. Well, in that case, we can replace it with force_partition_wise_plan (on/off) for the sake of regression, to test with smaller data. Even then we will need to adjust the costs, so that partition-wise join plan comes out to be the cheapest. Probably we will need set partition-wise join plan costs to very low or even 0 when force_partition_wise_plan is set to on. Does that look good? Any other ideas? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > So I am looking at this part of 0008: > > + /* > + * Do not copy parent_rinfo and child_rinfos because 1. they create a > + * circular dependency between child and parent RestrictInfo 2. dropping > + * those links just means that we loose some memory > optimizations. 3. There > + * is a possibility that the child and parent RestrictInfots > themselves may > + * have got copied and thus the old links may no longer be valid. The > + * caller may set up those links itself, if needed. > + */ > > I don't think that it's very clear whether or not this is safe. I > experimented with making _copyRestrictInfo PANIC, I am not able to understand how to make _copyRestrictInfo PANIC. Can you please share the patch or compiler flags or settings? I will look at the case below once I have that. > which, > interestingly, does not affect the core regression tests at all, but > does trip on this bit from the postgres_fdw tests: > > -- subquery using stable function (can't be sent to remote) > PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN > (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) = > '1970-01-17'::date) ORDER BY c1; > EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20); > > I'm not sure why this particular case is affected when so many others > are not, and the comment doesn't help me very much in figuring it out. > > Why do we need this cache in the RestrictInfo, anyway? Aside from the > comment above, I looked at the comment in the RestrictInfo struct, and > I looked at the comment in build_child_restrictinfo, and I looked at > the comment in build_child_clauses, and I looked at the place where > build_child_clauses is called in set_append_rel_size, and none of > those places explain why we need this cache. I would assume we'd need > a separate translation of the RestrictInfo for every separate > child-join, so how does the cache help? > > Maybe the answer is that build_child_clauses() is also called from > try_partition_wise_join() and add_paths_to_child_joinrel(), and those > three call sights all end up producing the same set of translated > RestrictInfos. But if that's the case, somehow it seems like we ought > to be producing these in one place where we can get convenient access > to them from each child join, rather than having to search through > this cache to find it. I had explained this briefly in [1]. But forgot to add it as comments. There are multiple means by which a RestrictInfo gets translated multiple times for the same child. 1. Consider a join A J (B J C on B.b = C.c) ON (A.a = B.b) the clause A.a = B.b is part of the restrictlist for orders (AB)C and A(BC) (and (AC)B depending upon the type of join). So, the clause gets translated twice once for each of those join orders. 2. In the above example, A.a = B.b is part of joininfo list (if it happens to be an outer join) of A, B and BC. So, it should be part of joininfo list of children of A, B and BC. But the RestrictInfo which is part of joininfo of B and BC looks exactly same. Similarly param_info->clauses get translated multiple times each time with a different set of required_outer. In order to avoid multiple translations and spend memory in each translation it's better to cache the result and retrieve it. Updated prologue of build_child_restrictinfo with this explanation. > It's a pretty inefficient cache: it takes O(n) > time to search it, I think, where n is the number of partitions. Above explanation shows that it's worse than that. > And > you do O(n) searches. So it's an O(n^2) algorithm, which is a little > unfortunate. Can't we affix the translated RestrictInfos someplace > where they can be found more efficiently? Would a hash similar to root->join_rel_hash help? That will reduce the searches to O(1). I have added a separate patch (0008) for using hashtable to store child restrictinfos. If that patch looks good to you, I will merge it with the main patch supporting partition-wise join. > > Yet another thing that the comments don't explain is why the existing > adjust_appendrel_attrs call needs to be replaced with > build_child_clauses. The call to adjust_appendrel_attrs() used to translate joininfo for child has been replaced by build_child_clauses to take advantage of the RestrictInfo cache. As explained above a clause which is part of joininfo of a child, is also part of joininfo of the child-join in which it participates except the child-joins covering the clause. So, a cached copy of that RestrictInfo helps. I have added a patch (0010) to use build_child_clause() only for partitioned tables and use adjust_appendrel_attrs() for non-partitioned case. If this change looks good, I will merge it with the main patch. > > So I feel, overall, that the point of all of this is not explained well at all. Sorry for that. I should have added the explanation in the comments. Corrected this in this set of patches. [1] https://www.postgresql.org/message-id/CAFjFpRe66z%2Bw9%2BdnAkWGiaB1CU2CUQsLGsqzHzYBoA%3DKJFf%2BPQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> I thought the whole point here was that NOT doing that caused the >> memory usage for partitionwise join to get out of control. Am I >> confused? > > We took a few steps to reduce the memory footprint of partition-wise > join in [1] and [2]. According to the numbers reported in [1] and then > in [2], if the total memory consumed by a planner is 44MB (memory > consumed by paths 150K) for a 5-way non-parition-wise join between > tables with 1000 partitions, partition-wise join consumed 192MB which > is 4.4 times the non-partitino-wise case. The earlier implementation > of blowing away a memory context after each top-level child-join, just > got rid of the paths created for that child-join. The total memory > consumed by paths created for all the child-joins was about 150MB. > Remember that we can not get rid of memory consumed by expressions, > RelOptInfos, RestrictInfos etc. since their pointers will be copied > into the plan nodes. All right, I propose that we revise our plan for attacking this problem. The code in this patch that proposes to reduce memory utilization is very complicated and it's likely to cause us to miss this release altogether if we keep hacking on it. So, I propose that you refactor this patch series so that the first big patch is partition-wise join without any of the optimizations that save memory - essentially the sample_partition_fraction = 1 case with all memory-saving optimizations removed. If it's only there to save memory, rip it out. Also, change the default value of enable_partition_wise_join to false and document that turning it on may cause a large increase in planner memory utilization, and that's why it's not enabled by default. If we get that committed, then we can have follow-on patches that add the incremental path creation stuff and other memory-saving features, and then at the end we can flip the default from "off" to "on". Probably that last part will slip beyond v10 since we're only two weeks from the end of the release cycle, but I think that's still better than having everything slip. Let's also put the multi-level partition-wise join stuff ahead of the memory-saving stuff, because being able to do only a single-level of partition-wise join is a fairly unimpressive feature; I'm not sure this is really even committable without that. I realize in some sense that I'm telling you to go and undo all of the work that you just did based on what I told you before, but I think we've actually made some pretty good progress here: it's now clear that there are viable strategies for getting the memory usage down to an acceptable level, and we've got draft patches for those strategies. So committing the core feature without immediately including that work can't be regarded as breaking everything hopelessly; rather, it now looks (I think, anyway) like a reasonable intermediate step towards the eventual goal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Thu, Mar 16, 2017 at 7:19 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> So I am looking at this part of 0008: >> >> + /* >> + * Do not copy parent_rinfo and child_rinfos because 1. they create a >> + * circular dependency between child and parent RestrictInfo 2. dropping >> + * those links just means that we loose some memory >> optimizations. 3. There >> + * is a possibility that the child and parent RestrictInfots >> themselves may >> + * have got copied and thus the old links may no longer be valid. The >> + * caller may set up those links itself, if needed. >> + */ >> >> I don't think that it's very clear whether or not this is safe. I >> experimented with making _copyRestrictInfo PANIC, > > I am not able to understand how to make _copyRestrictInfo PANIC. Can > you please share the patch or compiler flags or settings? I will look > at the case below once I have that. I just put elog(PANIC, "_copyRestrictInfo") into the function. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Mar 16, 2017 at 8:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> I thought the whole point here was that NOT doing that caused the >>> memory usage for partitionwise join to get out of control. Am I >>> confused? >> >> We took a few steps to reduce the memory footprint of partition-wise >> join in [1] and [2]. According to the numbers reported in [1] and then >> in [2], if the total memory consumed by a planner is 44MB (memory >> consumed by paths 150K) for a 5-way non-parition-wise join between >> tables with 1000 partitions, partition-wise join consumed 192MB which >> is 4.4 times the non-partitino-wise case. The earlier implementation >> of blowing away a memory context after each top-level child-join, just >> got rid of the paths created for that child-join. The total memory >> consumed by paths created for all the child-joins was about 150MB. >> Remember that we can not get rid of memory consumed by expressions, >> RelOptInfos, RestrictInfos etc. since their pointers will be copied >> into the plan nodes. > > All right, I propose that we revise our plan for attacking this > problem. The code in this patch that proposes to reduce memory > utilization is very complicated and it's likely to cause us to miss > this release altogether if we keep hacking on it. So, I propose that > you refactor this patch series so that the first big patch is > partition-wise join without any of the optimizations that save memory > - essentially the sample_partition_fraction = 1 case with all > memory-saving optimizations removed. If it's only there to save > memory, rip it out. Also, change the default value of > enable_partition_wise_join to false and document that turning it on > may cause a large increase in planner memory utilization, and that's > why it's not enabled by default. > > If we get that committed, then we can have follow-on patches that add > the incremental path creation stuff and other memory-saving features, > and then at the end we can flip the default from "off" to "on". > Probably that last part will slip beyond v10 since we're only two > weeks from the end of the release cycle, but I think that's still > better than having everything slip. Let's also put the multi-level > partition-wise join stuff ahead of the memory-saving stuff, because > being able to do only a single-level of partition-wise join is a > fairly unimpressive feature; I'm not sure this is really even > committable without that. > > I realize in some sense that I'm telling you to go and undo all of the > work that you just did based on what I told you before, but I think > we've actually made some pretty good progress here: it's now clear > that there are viable strategies for getting the memory usage down to > an acceptable level, and we've got draft patches for those strategies. > So committing the core feature without immediately including that work > can't be regarded as breaking everything hopelessly; rather, it now > looks (I think, anyway) like a reasonable intermediate step towards > the eventual goal. Here's the set of patches with all the memory saving stuff removed. It's now bare partition-wise joins. I have tried to eliminate all memory saving stuff carefully, except few bms_free() and list_free() which fit the functions they were part of and mostly were present in my earlier versions of patches. But I might have missed some. Also, I have corrected any indentation/white space mistakes introduced by editing patches with +/-, but I might have missed some. Please let me know. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> > Here's the set of patches with all the memory saving stuff removed. > It's now bare partition-wise joins. I have tried to eliminate all > memory saving stuff carefully, except few bms_free() and list_free() > which fit the functions they were part of and mostly were present in > my earlier versions of patches. But I might have missed some. Also, I > have corrected any indentation/white space mistakes introduced by > editing patches with +/-, but I might have missed some. Please let me > know. > Rajkumar offlist reported two issues with earlier set of patches. 1. 0008 conflicted with latest changes in postgres_fdw/deparse.c. 2. In the earlier set of patches part_scheme of a join relation was being set when joining relations had same part_scheme even if there was no equi-join between partition keys. The idea being that rel->part_scheme and rel->part_rels together tell whether a relation is partitioned or not. At a later stage if none of the joining pairs resulted in partitioned join, part_rels would be NULL and then we would reset part_scheme as well. But this logic not required. For the exact partition scheme matching, that we are using, if one pair of joining relation has an equi-join on partition keys, and both of those have exactly same partitioning scheme, all other pairs of joining relations would have an equi-join on partition keys and also exactly same partitioning scheme. So, we can set part_scheme only by looking at the first pair of joining relation while building the child-join. This set of patches fixes both of those things. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > This set of patches fixes both of those things. 0001 changes the purpose of a function and then 0007 renames it. It would be better to include the renaming in 0001 so that you're not taking multiple whacks at the same function in the same patch series. I believe it would also be best to include 0011's changes to adjust_appendrel_attrs_multilevel in 0001. 0002 should either add find_param_path_info() to the relevant header file as extern from the beginning, or it should declare and define it as static and then 0007 can remove those markings. It makes no sense to declare it as extern but put the prototype in the .c file. 0004 still needs to be pared down. If you want to get something committed this release cycle, you have to get these details taken care of, uh, more or less immediately. Actually, preferably, several weeks ago. You're welcome to maintain your own test suite locally but what you submit should be what you are proposing for commit -- or if not, then you should separate the part proposed for commit and the part included for dev testing into two different patches. In 0005's README, the part about planning partition-wise joins in two phases needs to be removed. This patch also contains a small change to partition_join.sql that belongs in 0004. 0008 removes direct tests against RELOPT_JOINREL almost everywhere, but it overlooks the new ones added to postgres_fdw.c by b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to cover those as well, I suspect. The commit message claims that it will "Similarly replace RELOPT_OTHER_MEMBER_REL test with IS_OTHER_REL() where we want to test for child relations of all kinds, but in fact it makes exactly zero such substitutions. While I was studying what you did with reparameterize_path_by_child(), I started to wonder whether reparameterize_path() doesn't need to start handling join paths. I think it only handles scan paths right now because that's the only thing that can appear under an appendrel created by inheritance expansion, but you're changing that. Maybe it's not critical -- I think the worst consequences of missing some handling there is that we won't consider a parameterized path in some case where it would be advantageous to do so. Still, you might want to investigate a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> This set of patches fixes both of those things. > > 0001 changes the purpose of a function and then 0007 renames it. It > would be better to include the renaming in 0001 so that you're not > taking multiple whacks at the same function in the same patch series. > I believe it would also be best to include 0011's changes to > adjust_appendrel_attrs_multilevel in 0001. > > 0002 should either add find_param_path_info() to the relevant header > file as extern from the beginning, or it should declare and define it > as static and then 0007 can remove those markings. It makes no sense > to declare it as extern but put the prototype in the .c file. > > 0004 still needs to be pared down. If you want to get something > committed this release cycle, you have to get these details taken care > of, uh, more or less immediately. Actually, preferably, several weeks > ago. You're welcome to maintain your own test suite locally but what > you submit should be what you are proposing for commit -- or if not, > then you should separate the part proposed for commit and the part > included for dev testing into two different patches. > > In 0005's README, the part about planning partition-wise joins in two > phases needs to be removed. This patch also contains a small change > to partition_join.sql that belongs in 0004. > > 0008 removes direct tests against RELOPT_JOINREL almost everywhere, > but it overlooks the new ones added to postgres_fdw.c by > b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to > cover those as well, I suspect. The commit message claims that it > will "Similarly replace RELOPT_OTHER_MEMBER_REL test with > IS_OTHER_REL() where we want to test for child relations of all kinds, > but in fact it makes exactly zero such substitutions. > > While I was studying what you did with reparameterize_path_by_child(), > I started to wonder whether reparameterize_path() doesn't need to > start handling join paths. I think it only handles scan paths right > now because that's the only thing that can appear under an appendrel > created by inheritance expansion, but you're changing that. Maybe > it's not critical -- I think the worst consequences of missing some > handling there is that we won't consider a parameterized path in some > case where it would be advantageous to do so. Still, you might want > to investigate a bit. > I was trying to play around with this patch and came across following case when without the patch query completes in 9 secs and with it in 15 secs. Theoretically, I tried to capture the case when each partition is having good amount of rows in output and each has to build their own hash, in that case the cost of building so many hashes comes to be more costly than having an append and then join. Thought it might be helpful to consider this case in better designing of the algorithm. Please feel free to point out if I missed something. Test details: commit: b4ff8609dbad541d287b332846442b076a25a6df Please find the attached .sql file for the complete schema and data and .out file for the result of explain analyse with and without patch. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Sun, Mar 19, 2017 at 12:15 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > I was trying to play around with this patch and came across following > case when without the patch query completes in 9 secs and with it in > 15 secs. Theoretically, I tried to capture the case when each > partition is having good amount of rows in output and each has to > build their own hash, in that case the cost of building so many hashes > comes to be more costly than having an append and then join. Thought > it might be helpful to consider this case in better designing of the > algorithm. Please feel free to point out if I missed something. In the non-partitionwise plan, the query planner correctly chooses to hash the same table (prt2) and probe from the large table (prt). In the partition-wise plan, it generally does the opposite. There is a mix of merge joins and hash joins, but of the 15 children that picked merge joins, 14 of them hashed the larger partition (in each case, from prt) and probed from the smaller one (in each case, from prt2), which seems like an odd strategy. So I think the problem is not that building lots of hash tables is slower than building just one, but rather that for some reason it's choosing the wrong table to hash. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > While I was studying what you did with reparameterize_path_by_child(), > I started to wonder whether reparameterize_path() doesn't need to > start handling join paths. I think it only handles scan paths right > now because that's the only thing that can appear under an appendrel > created by inheritance expansion, but you're changing that. Maybe > it's not critical -- I think the worst consequences of missing some > handling there is that we won't consider a parameterized path in some > case where it would be advantageous to do so. Still, you might want > to investigate a bit. I spent a fair amount of time this weekend musing over reparameterize_path_by_child(). I think a key question for this patch - as you already pointed out - is whether we're happy with that approach. When we discover that we want to perform a partitionwise parameterized nestloop, and therefore that we need the paths for each inner appendrel to get their input values from the corresponding outer appendrel members rather than from the outer parent, we've got two choices. The first is to do what the patch actually does, which is to build a new path tree for the nestloop inner path parameterized by the appropriate childrel. The second is to use the existing paths, which are parameterized by the parent rel, and then somehow allow make that work. For example, you can imagine that create_plan_recurse() could pass down a list of parameterized nestloops above the current point in the path tree, and a parent-child mapping for each, and then we could try to substitute everything while actually generating the plan instead of creating paths sooner. Which is better? It would be nice to hear opinions from anyone else who cares, but after some thought I think the approach you've picked is probably better, because it's more like what we do already. We have existing precedent for reparameterizing a path, but none for allowing a Var for one relation (the parent) to in effect refer to another relation (the child). That having been said, having try_nestloop_path() perform the reparameterization at the very top of the function seems quite undesirable. You're creating a new path there before you know whether it's going to be rejected by the invalid-parameterization test and also before you know whether initial_cost_nestloop is going to reject it. It would be much better if you could find a way to postpone the reparameterization until after those steps, and only do it if you're going to try add_path(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> This set of patches fixes both of those things. > > 0001 changes the purpose of a function and then 0007 renames it. It > would be better to include the renaming in 0001 so that you're not > taking multiple whacks at the same function in the same patch series. adjust_relid_set was renamed as adjust_child_relids() post "extern"alising. I think, this comment is about that function. Done. > I believe it would also be best to include 0011's changes to > adjust_appendrel_attrs_multilevel in 0001. > The function needs to repeat the "adjustment" process for every "other" relation (join or base) that it encounters, by testing using OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last macros are added by the partition-wise join implementation patch 0005. It doesn't make sense to add that macro in 0001 OR modify that function twice, once in 0001 and then after 0005. So, I will leave it to be part of 0011, where the changes are actually needed. > 0002 should either add find_param_path_info() to the relevant header > file as extern from the beginning, or it should declare and define it > as static and then 0007 can remove those markings. It makes no sense > to declare it as extern but put the prototype in the .c file. Done, added find_param_path_info() as an extern definition to start with. I have also squashed 0001 and 0002 together, since they are both refactoring patches and from your next mail about reparameterize_path_by_child(), it seems that we are going to accept the approach in that patch. > > 0004 still needs to be pared down. If you want to get something > committed this release cycle, you have to get these details taken care > of, uh, more or less immediately. Actually, preferably, several weeks > ago. You're welcome to maintain your own test suite locally but what > you submit should be what you are proposing for commit -- or if not, > then you should separate the part proposed for commit and the part > included for dev testing into two different patches. > Done. Now SQL file has 325 lines and output has 1697 lines as against 515 and 4085 lines resp. earlier. > In 0005's README, the part about planning partition-wise joins in two > phases needs to be removed. Done. > This patch also contains a small change > to partition_join.sql that belongs in 0004. The reason I added the test patch prior to implementation was 1. for me to make sure the tests that the queries run without the optimization and the results they produce to catch any issues with partitioning implementation. That would help someone looking at those patches as well. 2. Once partitioning implementation patch was applied, once could see the purpose of changes in two follow on patches. Now that that purpose has served, I have reordered the patches so that test patch comes after the implementation and follow on fixes. If you still want to run the test before or after any of those patches, you could apply the patch separately. > > 0008 removes direct tests against RELOPT_JOINREL almost everywhere, > but it overlooks the new ones added to postgres_fdw.c by > b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to > cover those as well, I suspect. Done. deparseSubqueryTargetList() and some other functions are excluding "other" base relation from the assertions. I guess, that's a problem. Will submit a separate patch to fix this. > The commit message claims that it > will "Similarly replace RELOPT_OTHER_MEMBER_REL test with > IS_OTHER_REL() where we want to test for child relations of all kinds, > but in fact it makes exactly zero such substitutions. The relevant changes have been covered by other commits. Removed this line from the commit message. > > While I was studying what you did with reparameterize_path_by_child(), > I started to wonder whether reparameterize_path() doesn't need to > start handling join paths. I think it only handles scan paths right > now because that's the only thing that can appear under an appendrel > created by inheritance expansion, but you're changing that. Maybe > it's not critical -- I think the worst consequences of missing some > handling there is that we won't consider a parameterized path in some > case where it would be advantageous to do so. Still, you might want > to investigate a bit. Yes, we need to update reparameterize_path() for child-joins. A path for child base relation gets reparameterized, if there exists a path with that parameterization in at least one other child. The parameterization bubbles up the join tree from base relations. So, if a child required to be reparameterized, probably all its joins require reparameterization, since that parameterization would bubble up the child-join tree in which some other child participates. But as you said it's an optimization and not a correctness issue. The function get_cheapest_parameterized_child_path() returns NULL, if it can not find or create a path (by reparameterization) with required parameterization. Its caller add_paths_to_append_rel() is capable of handling NULL values by not creating append paths with that paramterization. If the "append" relation requires minimum parameterization, all its children will create that minimum parameterization, hence do not require to reparameterize path. So, there isn't any correctness issue there. There are two ways to fix it, 1. when we create a reparameterized path add it to the list of paths, thus the parameterization bubbles up the join tree. But then we will be changing the path list after set_cheapest() has been called OR may be throwing out paths which other paths refer to. That's not desirable. May be we can save this path in another list and create join paths using this path instead of reparameterizing existing join paths. 2. Add code to reparameterize_path() to handle join paths, and I think all kinds of paths since we might have trickle the parameterization down the joining paths which could be almost anything including sort_paths, unique_paths etc. That looks like a significant effort. I think, we should attack it separately after the stock partition-wise join has been committed. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> While I was studying what you did with reparameterize_path_by_child(), >> I started to wonder whether reparameterize_path() doesn't need to >> start handling join paths. I think it only handles scan paths right >> now because that's the only thing that can appear under an appendrel >> created by inheritance expansion, but you're changing that. Maybe >> it's not critical -- I think the worst consequences of missing some >> handling there is that we won't consider a parameterized path in some >> case where it would be advantageous to do so. Still, you might want >> to investigate a bit. > > I spent a fair amount of time this weekend musing over > reparameterize_path_by_child(). I think a key question for this patch > - as you already pointed out - is whether we're happy with that > approach. When we discover that we want to perform a partitionwise > parameterized nestloop, and therefore that we need the paths for each > inner appendrel to get their input values from the corresponding outer > appendrel members rather than from the outer parent, we've got two > choices. The first is to do what the patch actually does, which is to > build a new path tree for the nestloop inner path parameterized by the > appropriate childrel. The second is to use the existing paths, which > are parameterized by the parent rel, and then somehow allow make that > work. For example, you can imagine that create_plan_recurse() could > pass down a list of parameterized nestloops above the current point in > the path tree, and a parent-child mapping for each, and then we could > try to substitute everything while actually generating the plan > instead of creating paths sooner. Which is better? > > It would be nice to hear opinions from anyone else who cares, but > after some thought I think the approach you've picked is probably > better, because it's more like what we do already. We have existing > precedent for reparameterizing a path, but none for allowing a Var for > one relation (the parent) to in effect refer to another relation (the > child). Right. If we could use parent Vars to indicate parent Var or child Var depending upon the context, a lot of memory issues would be solved; we wouldn't need to translate a single expression. But I think that's not straight forward. I have been thinking about some kind of polymorphic Var node, but it seems a lot more invasive change. Although, if we could get something like that, we would save a huge memory. :) > > That having been said, having try_nestloop_path() perform the > reparameterization at the very top of the function seems quite > undesirable. You're creating a new path there before you know whether > it's going to be rejected by the invalid-parameterization test and > also before you know whether initial_cost_nestloop is going to reject > it. It would be much better if you could find a way to postpone the > reparameterization until after those steps, and only do it if you're > going to try add_path(). Hmm. I think we can do that by refactoring calc_nestloop_required_outer(), allow_star_schema_join() and have_dangerous_phv() to use relids instead of paths. If the checks pass for a join between parents, those should pass for joins between children. Done in the attached set of patches. try_nestloop_path has few new variables. Among those innerrelids and outerrelids indicate the relids to be used by the parameterization checks (see patch for details). They are not relids of inner and outer relations resp. but kind of effective relids to be used. But I couldn't come up with better names which convey proper meaning and still are short enough. effective_innerrelids is mouthful. I am wondering whether we need to change calc_non_nestloop_required_outer() similar to calc_nestloop_required_outer() just to keep their signatures in sync. Should I work on completing reparamterized_path_by_child() to support all kinds of paths? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> While I was studying what you did with reparameterize_path_by_child(), >> I started to wonder whether reparameterize_path() doesn't need to >> start handling join paths. I think it only handles scan paths right >> now because that's the only thing that can appear under an appendrel >> created by inheritance expansion, but you're changing that. Maybe >> it's not critical -- I think the worst consequences of missing some >> handling there is that we won't consider a parameterized path in some >> case where it would be advantageous to do so. Still, you might want >> to investigate a bit. > > I spent a fair amount of time this weekend musing over > reparameterize_path_by_child(). I think a key question for this patch > - as you already pointed out - is whether we're happy with that > approach. When we discover that we want to perform a partitionwise > parameterized nestloop, and therefore that we need the paths for each > inner appendrel to get their input values from the corresponding outer > appendrel members rather than from the outer parent, we've got two > choices. The first is to do what the patch actually does, which is to > build a new path tree for the nestloop inner path parameterized by the > appropriate childrel. The second is to use the existing paths, which > are parameterized by the parent rel, and then somehow allow make that > work. For example, you can imagine that create_plan_recurse() could > pass down a list of parameterized nestloops above the current point in > the path tree, and a parent-child mapping for each, and then we could > try to substitute everything while actually generating the plan > instead of creating paths sooner. Which is better? > > It would be nice to hear opinions from anyone else who cares, but > after some thought I think the approach you've picked is probably > better, because it's more like what we do already. We have existing > precedent for reparameterizing a path, but none for allowing a Var for > one relation (the parent) to in effect refer to another relation (the > child). > > That having been said, having try_nestloop_path() perform the > reparameterization at the very top of the function seems quite > undesirable. You're creating a new path there before you know whether > it's going to be rejected by the invalid-parameterization test and > also before you know whether initial_cost_nestloop is going to reject > it. It would be much better if you could find a way to postpone the > reparameterization until after those steps, and only do it if you're > going to try add_path(). On a further testing of this patch I find another case when it is showing regression, the time taken with patch is around 160 secs and without it is 125 secs. Another minor thing to note that is planning time is almost twice with this patch, though I understand that this is for scenarios with really big 'big data' so this may not be a serious issue in such cases, but it'd be good if we can keep an eye on this that it doesn't exceed the computational bounds for a really large number of tables.. Please find the attached .out file to check the output I witnessed and let me know if anymore information is required Schema and data was similar to the preciously shared schema with the addition of more data for this case, parameter settings used were: work_mem = 1GB random_page_cost = seq_page_cost = 0.1 -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> I believe it would also be best to include 0011's changes to >> adjust_appendrel_attrs_multilevel in 0001. > > The function needs to repeat the "adjustment" process for every > "other" relation (join or base) that it encounters, by testing using > OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last > macros are added by the partition-wise join implementation patch 0005. > It doesn't make sense to add that macro in 0001 OR modify that > function twice, once in 0001 and then after 0005. So, I will leave it > to be part of 0011, where the changes are actually needed. Hmm. I would kind of like to move the IS_JOIN_REL() and IS_OTHER_REL() stuff to the front of the series. In other words, I propose that we add those macros first, each testing for only the one kind of RelOptInfo that exists today, and change all the code to use them. Then, when we add child joinrels, we can modify the macros at the same time. The problem with doing it the way you have it is that those changes will have to be squashed into the main partitionwise join commit, because otherwise stuff will be broken. Doing it the other way around lets us commit that bit separately. > Done. Now SQL file has 325 lines and output has 1697 lines as against > 515 and 4085 lines resp. earlier. Sounds reasonable. > Now that that purpose has served, I have reordered the > patches so that test patch comes after the implementation and follow > on fixes. Sounds good. > There are two ways to fix it, > > 1. when we create a reparameterized path add it to the list of paths, > thus the parameterization bubbles up the join tree. But then we will > be changing the path list after set_cheapest() has been called OR may > be throwing out paths which other paths refer to. That's not > desirable. May be we can save this path in another list and create > join paths using this path instead of reparameterizing existing join > paths. > 2. Add code to reparameterize_path() to handle join paths, and I think > all kinds of paths since we might have trickle the parameterization > down the joining paths which could be almost anything including > sort_paths, unique_paths etc. That looks like a significant effort. I > think, we should attack it separately after the stock partition-wise > join has been committed. I don't understand #1. #2 sounds like what I was expecting. I agree it can be postponed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Right. If we could use parent Vars to indicate parent Var or child Var > depending upon the context, a lot of memory issues would be solved; we > wouldn't need to translate a single expression. But I think that's not > straight forward. I have been thinking about some kind of polymorphic > Var node, but it seems a lot more invasive change. Although, if we > could get something like that, we would save a huge memory. :) Yes, that's why I'm interested in exploring that approach once the basic framework is in place here. > I am wondering whether we need to change > calc_non_nestloop_required_outer() similar to > calc_nestloop_required_outer() just to keep their signatures in sync. I haven't looked at the patch, but I don't think you need to worry about that. > Should I work on completing reparamterized_path_by_child() to support > all kinds of paths? Yes, or at the very least all scans, like reparameterize_path() already does. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On a further testing of this patch I find another case when it is > showing regression, the time taken with patch is around 160 secs and > without it is 125 secs. This is basically the same problem as before; the partitionwise case is doing the hash joins with the sides flipped from the optimal strategy. I bet that's a bug in the code rather than a problem with the concept. > Another minor thing to note that is planning time is almost twice with > this patch, though I understand that this is for scenarios with really > big 'big data' so this may not be a serious issue in such cases, but > it'd be good if we can keep an eye on this that it doesn't exceed the > computational bounds for a really large number of tables.. Yes, this is definitely going to use significant additional planning time and memory. There are several possible strategies for improving that situation, but I think we need to get the basics in place first. That's why the proposal is now to have this turned off by default. People joining really big tables that happen to be equipartitioned are likely to want to turn it on, though, even before those optimizations are done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> > On a further testing of this patch I find another case when it is > showing regression, the time taken with patch is around 160 secs and > without it is 125 secs. > Another minor thing to note that is planning time is almost twice with > this patch, though I understand that this is for scenarios with really > big 'big data' so this may not be a serious issue in such cases, but > it'd be good if we can keep an eye on this that it doesn't exceed the > computational bounds for a really large number of tables. Right, planning time would be proportional to the number of partitions at least in the first version. We may improve upon it later. > Please find the attached .out file to check the output I witnessed and > let me know if anymore information is required > Schema and data was similar to the preciously shared schema with the > addition of more data for this case, parameter settings used were: > work_mem = 1GB > random_page_cost = seq_page_cost = 0.1 The patch does not introduce any new costing model. It costs the partition-wise join as sum of costs of joins between partitions. The method to create the paths for joins between partitions is same as creating the paths for joins between regular tables and then the method to collect paths across partition-wise joins is same as collecting paths across child base relations. So, there is a large chance that the costing for joins between partitions might have a problem which is showing up here. There may be some special handling for regular tables versus child tables that may be the root cause. But I have not seen that kind of code till now. Can you please provide the outputs of individual partition-joins? If the plans for joins between partitions are same as the ones chosen for partition-wise joins, we may need to fix the existing join cost models. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> > Hmm. I would kind of like to move the IS_JOIN_REL() and > IS_OTHER_REL() stuff to the front of the series. In other words, I > propose that we add those macros first, each testing for only the one > kind of RelOptInfo that exists today, and change all the code to use > them. Then, when we add child joinrels, we can modify the macros at > the same time. The problem with doing it the way you have it is that > those changes will have to be squashed into the main partitionwise > join commit, because otherwise stuff will be broken. Doing it the > other way around lets us commit that bit separately. > I can provide a patch with adjust_appendrel_attrs_multilevel() changed to child-joins, which can be applied before multi-level partitioin-wise support patch but after partition-wise implementation patch. You may consider applying that patch separately before multi-level partition-wise support, in case we see that multi-level partition-wise join support can be committed. Does that sound good? That way we save changing those macros twice. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> Hmm. I would kind of like to move the IS_JOIN_REL() and >> IS_OTHER_REL() stuff to the front of the series. In other words, I >> propose that we add those macros first, each testing for only the one >> kind of RelOptInfo that exists today, and change all the code to use >> them. Then, when we add child joinrels, we can modify the macros at >> the same time. The problem with doing it the way you have it is that >> those changes will have to be squashed into the main partitionwise >> join commit, because otherwise stuff will be broken. Doing it the >> other way around lets us commit that bit separately. > > I can provide a patch with adjust_appendrel_attrs_multilevel() changed > to child-joins, which can be applied before multi-level > partitioin-wise support patch but after partition-wise implementation > patch. You may consider applying that patch separately before > multi-level partition-wise support, in case we see that multi-level > partition-wise join support can be committed. Does that sound good? > That way we save changing those macros twice. That seems different than what I suggested and I'm not sure what the reason is for the difference? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Mar 20, 2017 at 10:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> Hmm. I would kind of like to move the IS_JOIN_REL() and >>> IS_OTHER_REL() stuff to the front of the series. In other words, I >>> propose that we add those macros first, each testing for only the one >>> kind of RelOptInfo that exists today, and change all the code to use >>> them. Then, when we add child joinrels, we can modify the macros at >>> the same time. The problem with doing it the way you have it is that >>> those changes will have to be squashed into the main partitionwise >>> join commit, because otherwise stuff will be broken. Doing it the >>> other way around lets us commit that bit separately. >> >> I can provide a patch with adjust_appendrel_attrs_multilevel() changed >> to child-joins, which can be applied before multi-level >> partitioin-wise support patch but after partition-wise implementation >> patch. You may consider applying that patch separately before >> multi-level partition-wise support, in case we see that multi-level >> partition-wise join support can be committed. Does that sound good? >> That way we save changing those macros twice. > > That seems different than what I suggested and I'm not sure what the > reason is for the difference? > The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing the code to use it will look quite odd by itself. We are not changing all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use those. There is code which needs to check those kinds, instead of "all join rels" or "all other rels" resp. So the patch will add those macros, change only few places to use those macros, which are intended to be changed while applying partition-wise join support for single level partitioned table. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> That seems different than what I suggested and I'm not sure what the >> reason is for the difference? > > The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing > the code to use it will look quite odd by itself. We are not changing > all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use > those. There is code which needs to check those kinds, instead of "all > join rels" or "all other rels" resp. So the patch will add those > macros, change only few places to use those macros, which are intended > to be changed while applying partition-wise join support for single > level partitioned table. Hmm. You might be right, but I'm not convinced. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rajkumar Raghuwanshi
Дата:
> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: I have created some test to cover partition wise joins with postgres_fdw, also verified make check. patch attached. Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> >> On a further testing of this patch I find another case when it is >> showing regression, the time taken with patch is around 160 secs and >> without it is 125 secs. >> Another minor thing to note that is planning time is almost twice with >> this patch, though I understand that this is for scenarios with really >> big 'big data' so this may not be a serious issue in such cases, but >> it'd be good if we can keep an eye on this that it doesn't exceed the >> computational bounds for a really large number of tables. > > Right, planning time would be proportional to the number of partitions > at least in the first version. We may improve upon it later. > >> Please find the attached .out file to check the output I witnessed and >> let me know if anymore information is required >> Schema and data was similar to the preciously shared schema with the >> addition of more data for this case, parameter settings used were: >> work_mem = 1GB >> random_page_cost = seq_page_cost = 0.1 this doesn't look good. Why do you set both these costs to the same value? > > The patch does not introduce any new costing model. It costs the > partition-wise join as sum of costs of joins between partitions. The > method to create the paths for joins between partitions is same as > creating the paths for joins between regular tables and then the > method to collect paths across partition-wise joins is same as > collecting paths across child base relations. So, there is a large > chance that the costing for joins between partitions might have a > problem which is showing up here. There may be some special handling > for regular tables versus child tables that may be the root cause. But > I have not seen that kind of code till now. > > Can you please provide the outputs of individual partition-joins? If > the plans for joins between partitions are same as the ones chosen for > partition-wise joins, we may need to fix the existing join cost > models. Offlist, Rafia shared the outputs of joins between partitions and join between partitioned table. The joins between partitions look similar to those pick up by the partition-wise join. So, it looks that some costing error in regular joins is resulting in an costing error in partition-wise join as suspected. Attached the SQL and the output. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> That seems different than what I suggested and I'm not sure what the >>> reason is for the difference? >> >> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing >> the code to use it will look quite odd by itself. We are not changing >> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use >> those. There is code which needs to check those kinds, instead of "all >> join rels" or "all other rels" resp. So the patch will add those >> macros, change only few places to use those macros, which are intended >> to be changed while applying partition-wise join support for single >> level partitioned table. > > Hmm. You might be right, but I'm not convinced. Ok. changed as per your request in the latest set of patches. There are some more changes as follows 1. In the earlier patch set the changes related to calc_nestloop_required_outer() and related functions were spread across multiple patches. That was unintentional. This patch set has all those changes in a single patch. 2. Rajkumar reported a crash offlist. When one of the joining multi-level partitioned relations is empty, an assertion in try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels); failed since it didn't find part_rels for a subpartition. The problem here is set_append_rel_size() does not call set_rel_size() and hence set_append_rel_size() if a child is found to be empty, a scenario described in [1]. It's the later one which sets the part_rels for a partitioned relation and hence the subpartitions do not get part_rels since set_append_rel_size() is never called for those. Generally, if a partitioned relation is found to be empty before we set part_rels, we may not want to spend time in creating/collecting child RelOptInfos, since they will be empty anyway. If part_rels isn't present, part_scheme doesn't make sense. So an empty partitioned table without any partitions can be treated as unpartitioned. So, I have fixed set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a relation empty, to reset partition scheme when those conditions are met. This fix is included as a separate patch. Let me know if this looks good to you. 3. I am in the process of completing reparameterize_paths_by_child() by adding all possible paths. I have restructured the function to look better and have one switch case instead of two. Also added more path types including ForeignPath, for which I have added a FDW hook, with documentation, for handling fdw_private. Please let me know if this looks good to you. I am thinking of similar hook for CustomPath. I will continue to add more path types to reparameterize_path_by_child(). I am wondering whether we should bring 0007 he patche adjusting code to work with child-joins before 0006, partition-wise join. 0006 needs it, but 0007 doesn't depend upon 0006. Will that be any better? [1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com, subject "Asymmetry between parent and child wrt "false" quals" -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Thanks Rajkumar. Added those in the latest set of patches. On Tue, Mar 21, 2017 at 3:52 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: >> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: > > I have created some test to cover partition wise joins with > postgres_fdw, also verified make check. > patch attached. > > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Tue, Mar 21, 2017 at 7:41 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>> >>> On a further testing of this patch I find another case when it is >>> showing regression, the time taken with patch is around 160 secs and >>> without it is 125 secs. >>> Another minor thing to note that is planning time is almost twice with >>> this patch, though I understand that this is for scenarios with really >>> big 'big data' so this may not be a serious issue in such cases, but >>> it'd be good if we can keep an eye on this that it doesn't exceed the >>> computational bounds for a really large number of tables. >> >> Right, planning time would be proportional to the number of partitions >> at least in the first version. We may improve upon it later. >> >>> Please find the attached .out file to check the output I witnessed and >>> let me know if anymore information is required >>> Schema and data was similar to the preciously shared schema with the >>> addition of more data for this case, parameter settings used were: >>> work_mem = 1GB >>> random_page_cost = seq_page_cost = 0.1 > > this doesn't look good. Why do you set both these costs to the same value? That's a perfectly reasonable configuration if the data is in memory on a medium with fast random access, like an SSD. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Here's the set of patches rebased on latest head, which also has a commit to eliminate scans on partitioned tables. This change has caused problems with multi-level partitioned tables, that I have not fixed in this patch set. Also a couple of partition-wise join plans for single-level partitioned tables have changed to non-partition-wise joins. I haven't fixed those as well. I have added a separate patch to fix add_paths_to_append_rel() to collect partitioned_rels list for join relations. Please let me know if this looks good. I think it needs to be merged into some other patch, but I am not sure which. Probably we should just treat it as another refactoring patch. On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>>> That seems different than what I suggested and I'm not sure what the >>>> reason is for the difference? >>> >>> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing >>> the code to use it will look quite odd by itself. We are not changing >>> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use >>> those. There is code which needs to check those kinds, instead of "all >>> join rels" or "all other rels" resp. So the patch will add those >>> macros, change only few places to use those macros, which are intended >>> to be changed while applying partition-wise join support for single >>> level partitioned table. >> >> Hmm. You might be right, but I'm not convinced. > > Ok. changed as per your request in the latest set of patches. > > There are some more changes as follows > 1. In the earlier patch set the changes related to > calc_nestloop_required_outer() and related functions were spread > across multiple patches. That was unintentional. This patch set has > all those changes in a single patch. > > 2. Rajkumar reported a crash offlist. When one of the joining > multi-level partitioned relations is empty, an assertion in > try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels); > failed since it didn't find part_rels for a subpartition. The problem > here is set_append_rel_size() does not call set_rel_size() and hence > set_append_rel_size() if a child is found to be empty, a scenario > described in [1]. It's the later one which sets the part_rels for a > partitioned relation and hence the subpartitions do not get part_rels > since set_append_rel_size() is never called for those. Generally, if a > partitioned relation is found to be empty before we set part_rels, we > may not want to spend time in creating/collecting child RelOptInfos, > since they will be empty anyway. If part_rels isn't present, > part_scheme doesn't make sense. So an empty partitioned table without > any partitions can be treated as unpartitioned. So, I have fixed > set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a > relation empty, to reset partition scheme when those conditions are > met. This fix is included as a separate patch. Let me know if this > looks good to you. > > 3. I am in the process of completing reparameterize_paths_by_child() > by adding all possible paths. I have restructured the function to look > better and have one switch case instead of two. Also added more path > types including ForeignPath, for which I have added a FDW hook, with > documentation, for handling fdw_private. Please let me know if this > looks good to you. I am thinking of similar hook for CustomPath. I > will continue to add more path types to > reparameterize_path_by_child(). > > I am wondering whether we should bring 0007 he patche adjusting code > to work with child-joins before 0006, partition-wise join. 0006 needs > it, but 0007 doesn't depend upon 0006. Will that be any better? > > [1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com, > subject "Asymmetry between parent and child wrt "false" quals" > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Tue, Mar 21, 2017 at 10:40 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here's the set of patches rebased on latest head, which also has a > commit to eliminate scans on partitioned tables. This change has > caused problems with multi-level partitioned tables, that I have not > fixed in this patch set. Also a couple of partition-wise join plans > for single-level partitioned tables have changed to non-partition-wise > joins. I haven't fixed those as well. > > I have added a separate patch to fix add_paths_to_append_rel() to > collect partitioned_rels list for join relations. Please let me know > if this looks good. I think it needs to be merged into some other > patch, but I am not sure which. Probably we should just treat it as > another refactoring patch. > > On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>> That seems different than what I suggested and I'm not sure what the >>>>> reason is for the difference? >>>> >>>> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing >>>> the code to use it will look quite odd by itself. We are not changing >>>> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use >>>> those. There is code which needs to check those kinds, instead of "all >>>> join rels" or "all other rels" resp. So the patch will add those >>>> macros, change only few places to use those macros, which are intended >>>> to be changed while applying partition-wise join support for single >>>> level partitioned table. >>> >>> Hmm. You might be right, but I'm not convinced. >> >> Ok. changed as per your request in the latest set of patches. >> >> There are some more changes as follows >> 1. In the earlier patch set the changes related to >> calc_nestloop_required_outer() and related functions were spread >> across multiple patches. That was unintentional. This patch set has >> all those changes in a single patch. >> >> 2. Rajkumar reported a crash offlist. When one of the joining >> multi-level partitioned relations is empty, an assertion in >> try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels); >> failed since it didn't find part_rels for a subpartition. The problem >> here is set_append_rel_size() does not call set_rel_size() and hence >> set_append_rel_size() if a child is found to be empty, a scenario >> described in [1]. It's the later one which sets the part_rels for a >> partitioned relation and hence the subpartitions do not get part_rels >> since set_append_rel_size() is never called for those. Generally, if a >> partitioned relation is found to be empty before we set part_rels, we >> may not want to spend time in creating/collecting child RelOptInfos, >> since they will be empty anyway. If part_rels isn't present, >> part_scheme doesn't make sense. So an empty partitioned table without >> any partitions can be treated as unpartitioned. So, I have fixed >> set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a >> relation empty, to reset partition scheme when those conditions are >> met. This fix is included as a separate patch. Let me know if this >> looks good to you. >> >> 3. I am in the process of completing reparameterize_paths_by_child() >> by adding all possible paths. I have restructured the function to look >> better and have one switch case instead of two. Also added more path >> types including ForeignPath, for which I have added a FDW hook, with >> documentation, for handling fdw_private. Please let me know if this >> looks good to you. I am thinking of similar hook for CustomPath. I >> will continue to add more path types to >> reparameterize_path_by_child(). >> >> I am wondering whether we should bring 0007 he patche adjusting code >> to work with child-joins before 0006, partition-wise join. 0006 needs >> it, but 0007 doesn't depend upon 0006. Will that be any better? >> >> [1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com, >> subject "Asymmetry between parent and child wrt "false" quals" >> In an attempt to test the geqo side of this patch, I reduced geqo_threshold to 6 and set enable_partitionwise_join to to true and tried following query, which crashed, explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a = prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000 order by prt.a desc; Stack-trace for the crash is as follows, Program received signal SIGSEGV, Segmentation fault. 0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30, required_outer=0x2ff6d30) at relnode.c:1534 1534 if (bms_equal(ppi->ppi_req_outer, required_outer)) (gdb) bt #0 0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30, required_outer=0x2ff6d30) at relnode.c:1534 #1 0x000000000079b8bb in reparameterize_path_by_child (root=0x2df7550, path=0x2f6dec0, child_rel=0x2d4a860) at pathnode.c:3455 #2 0x000000000075be30 in try_nestloop_path (root=0x2df7550, joinrel=0x2ff51b0, outer_path=0x2f96540, inner_path=0x2f6dec0, pathkeys=0x0, jointype=JOIN_INNER, extra=0x7fffe6b4e130) at joinpath.c:344 #3 0x000000000075d55b in match_unsorted_outer (root=0x2df7550, joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30, jointype=JOIN_INNER, extra=0x7fffe6b4e130) at joinpath.c:1389 #4 0x000000000075bc5f in add_paths_to_joinrel (root=0x2df7550, joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30, jointype=JOIN_INNER, sjinfo=0x3076bc8, restrictlist=0x3077168) at joinpath.c:234 #5 0x000000000075f1d5 in populate_joinrel_with_paths (root=0x2df7550, rel1=0x2d3fe30, rel2=0x2d4a860, joinrel=0x2ff51b0, sjinfo=0x3076bc8, restrictlist=0x3077168) at joinrels.c:793 #6 0x0000000000760107 in try_partition_wise_join (root=0x2df7550, rel1=0x2d3f6d8, rel2=0x2d4a1a0, joinrel=0x30752f0, parent_sjinfo=0x7fffe6b4e2d0, parent_restrictlist=0x3075768) at joinrels.c:1401 #7 0x000000000075f0e6 in make_join_rel (root=0x2df7550, rel1=0x2d3f6d8, rel2=0x2d4a1a0) at joinrels.c:744 #8 0x0000000000742053 in merge_clump (root=0x2df7550, clumps=0x3075270, new_clump=0x30752a8, force=0 '\000') at geqo_eval.c:260 #9 0x0000000000741f1c in gimme_tree (root=0x2df7550, tour=0x2ff2430, num_gene=6) at geqo_eval.c:199 #10 0x0000000000741df5 in geqo_eval (root=0x2df7550, tour=0x2ff2430, num_gene=6) at geqo_eval.c:102 #11 0x000000000074288a in random_init_pool (root=0x2df7550, pool=0x2ff23d0) at geqo_pool.c:109 #12 0x00000000007422a6 in geqo (root=0x2df7550, number_of_rels=6, initial_rels=0x2ff22d0) at geqo_main.c:114 #13 0x0000000000747f19 in make_rel_from_joinlist (root=0x2df7550, joinlist=0x2dce940) at allpaths.c:2333 #14 0x0000000000744e7e in make_one_rel (root=0x2df7550, joinlist=0x2dce940) at allpaths.c:182 #15 0x0000000000772df9 in query_planner (root=0x2df7550, tlist=0x2dec2c0, qp_callback=0x777ce1 <standard_qp_callback>, qp_extra=0x7fffe6b4e700) at planmain.c:254 Please let me know if any more information is required on this. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
>>> > In an attempt to test the geqo side of this patch, I reduced > geqo_threshold to 6 and set enable_partitionwise_join to to true and > tried following query, which crashed, > > explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a > = prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000 > order by prt.a desc; > > Stack-trace for the crash is as follows, > Nice catch. When reparameterize_path_by_child() may be running in a temporary memory context while running in GEQO mode. It may add a new PPI to base relation all in the temporary context. In the next GEQO cycle, the ppilist will be clobbered since the temporary context is reset for each geqo cycle. The fix is to allocate PPI in the same memory context as the RelOptInfo similar to mark_dummy_rel(). I also found another problem. In geqo, we never call generate_partition_wise_join_paths() which set cheapest paths for each child-join. Because of this cheapest_*_paths are never set for those rels, thus segfaulting in functions like sort_inner_and_outer() which use those. Here's patch fixing both the issues. Please let me know if it fixes the issues you are seeing. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Wed, Mar 22, 2017 at 3:19 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>>> >> In an attempt to test the geqo side of this patch, I reduced >> geqo_threshold to 6 and set enable_partitionwise_join to to true and >> tried following query, which crashed, >> >> explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a >> = prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000 >> order by prt.a desc; >> >> Stack-trace for the crash is as follows, >> > Nice catch. When reparameterize_path_by_child() may be running in a > temporary memory context while running in GEQO mode. It may add a new > PPI to base relation all in the temporary context. In the next GEQO > cycle, the ppilist will be clobbered since the temporary context is > reset for each geqo cycle. The fix is to allocate PPI in the same > memory context as the RelOptInfo similar to mark_dummy_rel(). > > I also found another problem. In geqo, we never call > generate_partition_wise_join_paths() which set cheapest paths for each > child-join. Because of this cheapest_*_paths are never set for those > rels, thus segfaulting in functions like sort_inner_and_outer() which > use those. > > Here's patch fixing both the issues. Please let me know if it fixes > the issues you are seeing. I tested the applied patch, it is fixing the reported issue. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Here's set of updated patches rebased on 1148e22a82edc96172fc78855da392b6f0015c88. I have fixed all the issues reported till now. I have also completed reparameterize_path_by_child() for all the required paths. There's no TODO there now. :) The function has grown quite long now and might take some time to review. Given the size, I am wondering whether we should separate that fix from the main partition-wise join fix. That will make reviewing that function easier, allowing a careful review. Here's the idea how that can be done. As explained in the commit of 0009, the function is required in case of lateral joins between partitioned relations. For a A LATERAL JOIN B, B is the minimum required parameterization by A. Hence children of A i.e. A1, A2 ... all require their paths to be parameterized by B. When that comes to partition-wise joins, A1 requires its paths to be parameterized by B1 (matching partition from B). Otherwise we can not create paths for A1B1. This means that we require to reparameterize all A1's paths to be reparameterized by B1 using function reparameterize_paths_by_child(). So the function needs to support reparameterization of all the paths; we do not know which of those have survived add_path(). But if we disable partition-wise join for lateral joins i.e. when direct_lateral_relids of one relation contains the any subset of the relids in the other relation, we do not need reparameterize_path_by_child(). Please let me know if this strategy will help to make review and commit easier. After the commit, commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50 Author: Robert Haas <rhaas@postgresql.org> Date: Tue Mar 21 09:48:04 2017 -0400 Don't scan partitioned tables. We do not create any AppendRelInfos and hence RelOptInfos for the partitioned tables. My approach to attach multi-level partitioned join was to store RelOptInfos of immediate partitions in part_rels of RelOptInfo of a partitioned table, thus maintaining a tree of RelOptInfos reflecting partitioning tree. This allows to add append paths to intermediate RelOptInfos, flattening them as we go up the partitioning hierarchy. With no RelOptInfos for intermediate partitions, we can support multi-level partition-wise join only in limited cases when the partitioning hierarchy of the joining table exactly matches. Please refer [1] for some more discussion. I think we need the RelOptInfos for the partitions, which are partitioned to hold the "append" paths containing paths from their children and to match the partitions in partition-wise join. Similar hierarchy will be created for partitioned joins, with partitioned join partitions. So, I have not changed the multi-level partition-wise join support patches. After applying 0011-0013 the multi-level partitioning tests fail with error "could not find the RelOptInfo of a partition with oid", since it does not find the RelOptInfos of partitions which are partitioned. [1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> > I tested the applied patch, it is fixing the reported issue. Thanks for the confirmation Rafia. I have included the fix in the latest set of patches. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I have also completed reparameterize_path_by_child() for all the > required paths. There's no TODO there now. :) The function has grown > quite long now and might take some time to review. Given the size, I > am wondering whether we should separate that fix from the main > partition-wise join fix. That will make reviewing that function > easier, allowing a careful review. Here's the idea how that can be > done. As explained in the commit of 0009, the function is required in > case of lateral joins between partitioned relations. For a A LATERAL > JOIN B, B is the minimum required parameterization by A. Hence > children of A i.e. A1, A2 ... all require their paths to be > parameterized by B. When that comes to partition-wise joins, A1 > requires its paths to be parameterized by B1 (matching partition from > B). Otherwise we can not create paths for A1B1. This means that we > require to reparameterize all A1's paths to be reparameterized by B1 > using function reparameterize_paths_by_child(). So the function needs > to support reparameterization of all the paths; we do not know which > of those have survived add_path(). But if we disable partition-wise > join for lateral joins i.e. when direct_lateral_relids of one relation > contains the any subset of the relids in the other relation, we do not > need reparameterize_path_by_child(). Please let me know if this > strategy will help to make review and commit easier. In my testing last week, reparameterize_path_by_child() was essential for nested loops to work properly, even without LATERAL. Without it, the parameterized path ends up containing vars that reference the parent varno instead of the child varno. That confused later planner stages so that those Vars did not get replaced with Param during replace_nestloop_params(), eventually resulting in a crash at execution time. Based on that experiment, I think we could consider having reparameterize_path_by_child() handle only scan paths as reparameterize_path() does, and just give up on plans like this: Append -> Left Join -> Scan on a -> Inner Join -> Index Scan on b -> Index Scan on c [repeat for each partition] But I doubt we can get by without it altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Mar 22, 2017 at 6:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I have also completed reparameterize_path_by_child() for all the >> required paths. There's no TODO there now. :) The function has grown >> quite long now and might take some time to review. Given the size, I >> am wondering whether we should separate that fix from the main >> partition-wise join fix. That will make reviewing that function >> easier, allowing a careful review. Here's the idea how that can be >> done. As explained in the commit of 0009, the function is required in >> case of lateral joins between partitioned relations. For a A LATERAL >> JOIN B, B is the minimum required parameterization by A. Hence >> children of A i.e. A1, A2 ... all require their paths to be >> parameterized by B. When that comes to partition-wise joins, A1 >> requires its paths to be parameterized by B1 (matching partition from >> B). Otherwise we can not create paths for A1B1. This means that we >> require to reparameterize all A1's paths to be reparameterized by B1 >> using function reparameterize_paths_by_child(). So the function needs >> to support reparameterization of all the paths; we do not know which >> of those have survived add_path(). But if we disable partition-wise >> join for lateral joins i.e. when direct_lateral_relids of one relation >> contains the any subset of the relids in the other relation, we do not >> need reparameterize_path_by_child(). Please let me know if this >> strategy will help to make review and commit easier. > > In my testing last week, reparameterize_path_by_child() was essential > for nested loops to work properly, even without LATERAL. Without it, > the parameterized path ends up containing vars that reference the > parent varno instead of the child varno. That confused later planner > stages so that those Vars did not get replaced with Param during > replace_nestloop_params(), eventually resulting in a crash at > execution time. I half-described the solution. Sorry. Along-with disabling partition-wise lateral joins, we have to disable nested loop child-joins where inner child is parameterized by the parent of the outer one. We will still have nestloop join between parents where inner relation is parameterized by the outer and every child of inner is parameterized by the outer. But we won't create nest loop joins where inner child is parameterized by the outer child, where we require reparameterize_path_by_child. We will loose this optimization only till we get reparameterize_path_by_child() committed. Basically, in try_nestloop_path() (in the patch 0009), if (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up creating nest loop path. That shouldn't create any problems. Did you experiment with this change in try_nestloop_path()? Can you please share the testcase? I will take a look at it. > Based on that experiment, I think we could consider > having reparameterize_path_by_child() handle only scan paths as > reparameterize_path() does, and just give up on plans like this: > > Append > -> Left Join > -> Scan on a > -> Inner Join > -> Index Scan on b > -> Index Scan on c > [repeat for each partition] > I am assuming that a, b and c are partitions of A, B and C resp. which are being joined and both or one of the scans on b and c are parameteried by a or scan of c is parameterized by b. I don't think we will get away by supporting just scan paths, since the inner side of lateral join can be any paths not just scan path. Or you are suggesting that we disable partition-wise lateral join and support reparameterization of only scan paths? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 22, 2017 at 9:59 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> In my testing last week, reparameterize_path_by_child() was essential >> for nested loops to work properly, even without LATERAL. Without it, >> the parameterized path ends up containing vars that reference the >> parent varno instead of the child varno. That confused later planner >> stages so that those Vars did not get replaced with Param during >> replace_nestloop_params(), eventually resulting in a crash at >> execution time. > > I half-described the solution. Sorry. Along-with disabling > partition-wise lateral joins, we have to disable nested loop > child-joins where inner child is parameterized by the parent of the > outer one. We will still have nestloop join between parents where > inner relation is parameterized by the outer and every child of inner > is parameterized by the outer. But we won't create nest loop joins > where inner child is parameterized by the outer child, where we > require reparameterize_path_by_child. We will loose this optimization > only till we get reparameterize_path_by_child() committed. Basically, > in try_nestloop_path() (in the patch 0009), if > (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up > creating nest loop path. That shouldn't create any problems. > > Did you experiment with this change in try_nestloop_path()? Can you > please share the testcase? I will take a look at it. I didn't save the test case. It was basically just forcing a partitionwise nestloop join between two equipartitioned tables, with the calls to adjust_appendrel_attrs() ripped out of reparameterize_path_by_child(), just to see what would break. >> Based on that experiment, I think we could consider >> having reparameterize_path_by_child() handle only scan paths as >> reparameterize_path() does, and just give up on plans like this: >> >> Append >> -> Left Join >> -> Scan on a >> -> Inner Join >> -> Index Scan on b >> -> Index Scan on c >> [repeat for each partition] >> > > I am assuming that a, b and c are partitions of A, B and C resp. which > are being joined and both or one of the scans on b and c are > parameteried by a or scan of c is parameterized by b. Yes. > I don't think we will get away by supporting just scan paths, since > the inner side of lateral join can be any paths not just scan path. Or > you are suggesting that we disable partition-wise lateral join and > support reparameterization of only scan paths? I think if you can do a straight-up partitionwise nested loop between two tables A and B, that's pretty bad. But if there are more complex cases that involve parameterizing entire join trees which aren't covered, that's less bad. Parallel query almost entirely punts on LATERAL right now, and nobody's complained yet. I'm sure that'll need to get fixed someday, but not today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here's set of updated patches rebased on > 1148e22a82edc96172fc78855da392b6f0015c88. > > I have fixed all the issues reported till now. I don't understand why patch 0001 ends up changing every existing test for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(), yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up getting changed to use IS_OTHER_REL(). That's very surprising. Some of those tests are essentially checking for something that is going to have a scan plan rather than a join or upper plan, and those tests probably don't need to be modified; for example, the test in set_rel_consider_parallel() is obviously of this type. But others are testing whether we've got some kind of child rel, and those seem like they might need work. Going through a few specific examples: - generate_join_implied_equalities_for_ecs() assumes that any child rel is an other member rel. - generate_join_implied_equalities_broken() assumes that any child rel is an other member rel. - generate_implied_equalities_for_column() set is_child_rel on the assumption that only an other member rel can be a child rel. - eclass_useful_for_merging() assumes that the only kind of child rel is an other member rel. - find_childrel_appendrelinfo() assumes that any child rel is an other member rel. - find_childrel_top_parent() and find_childrel_parents() assume that children must be other member rels and their parents must be baserels. - adjust_appendrel_attrs_multilevel() assumes that children must be other member rels and their parents must be baserels. It's possible that, for various reasons, none of these code paths would ever be reachable by a child join, but it doesn't look likely to me. And even if that's true, some comment updates are probably needed, and maybe some renaming of functions too. In postgres_fdw, get_useful_ecs_for_relation() assumes that any child rel is an other member rel. I'm not sure if we're hoping that partitionwise join will work with postgres_fdw's join pushdown out of the chute, but clearly this would need to be adjusted to have any chance of being right. Some that seem OK: - set_rel_consider_parallel() is fine. - set_append_rel_size() is only going to be called for baserels or their children, so it's fine. - relation_excluded_by_constraints() is only intended to be called on baserels or their children, so it's fine. - check_index_predicates() is only intended to be called on baserels or their children, so it's fine. - query_planner() loops over baserels and their children, so it's fine. Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be used in some of these places, just for symmetry. The point is that there are really three questions here: (1) is it some kind of baserel (parent or child)? (2) is it some kind of joinrel (parent or child)? and (3) is it some kind of child (baserel or join)? Right now, both #2 and #3 are tested by just comparing against RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we add child joinrels. The goal of 0001, IMV, ought to be to try to figure out which of #1, #2, and #3 is being checked in each case and make that clear via use of an appropriate macro. (If is-other-baserel is the real test, then fine, but I bet that's a rare case.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
> >> I don't think we will get away by supporting just scan paths, since >> the inner side of lateral join can be any paths not just scan path. Or >> you are suggesting that we disable partition-wise lateral join and >> support reparameterization of only scan paths? > > I think if you can do a straight-up partitionwise nested loop between > two tables A and B, that's pretty bad. Ok. > But if there are more complex > cases that involve parameterizing entire join trees which aren't > covered, that's less bad. Parallel query almost entirely punts on > LATERAL right now, and nobody's complained yet. I'm sure that'll need > to get fixed someday, but not today. > Ok. I am suggesting this possibility in case we run of time to review and commit reparameterize_path_by_child() entirely. If we can do that, I will be happy. In case, we have to include a stripped down version of reparameterize_path_by_child(), with which I am fine too, we will need to disable LATERAL joins, so that we don't end up with an error "could not devise a query plan for the given query". -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Mar 23, 2017 at 1:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Here's set of updated patches rebased on >> 1148e22a82edc96172fc78855da392b6f0015c88. >> >> I have fixed all the issues reported till now. > > I don't understand why patch 0001 ends up changing every existing test > for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(), > yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up > getting changed to use IS_OTHER_REL(). That's very surprising. Some > of those tests are essentially checking for something that is going to > have a scan plan rather than a join or upper plan, and those tests > probably don't need to be modified; for example, the test in > set_rel_consider_parallel() is obviously of this type. But others are > testing whether we've got some kind of child rel, and those seem like > they might need work. Going through a few specific examples: > > - generate_join_implied_equalities_for_ecs() assumes that any child > rel is an other member rel. > - generate_join_implied_equalities_broken() assumes that any child rel > is an other member rel. Fixed those. > - generate_implied_equalities_for_column() set is_child_rel on the > assumption that only an other member rel can be a child rel. This function is called for indexes, which are not defined on the join relations. So, we shouldn't worry about child-joins here. I have added an assertion in there to make sure that that function gets called only for base and "other" member rels. > - eclass_useful_for_merging() assumes that the only kind of child rel > is an other member rel. This was being fixed in a later patch which had many small fixes for handling child-joins. But now I have moved that fix into 0001. > - find_childrel_appendrelinfo() assumes that any child rel is an other > member rel. The function is called for "other" member relation only. For joins we use find_appinfos_by_relids() We could replace find_childrel_appendrelinfo() with find_appinfos_by_relids(), which does same thing as find_childrel_appendrelinfo() for a relids set. But find_appinfos_by_relids() returns a list of AppendRelInfos, hence using it instead of find_childrel_appendrelinfo() will spend some memory and CPU cycles in creating a one element list and then extracting that element out of the list. So, I have not replaced usages of find_childrel_appendrelinfo() with find_appinfos_by_relids(). This also simplifies changes to get_useful_ecs_for_relation(). > - find_childrel_top_parent() and find_childrel_parents() assume that > children must be other member rels and their parents must be baserels. For partition-wise join implementation we save relids of topmost parent in RelOptInfo of child. We can directly use that instead of calling find_childrel_top_parent(). So, in 0001 I am adding top_parent_relids in RelOptInfo and getting rid of find_childrel_top_parent(). This also fixes get_useful_ecs_for_relation() in a better way. find_childrel_parents() is called only for simple relations not joins, since it's callers are called only for simple relations. I have added an assertion to that effect. > - adjust_appendrel_attrs_multilevel() assumes that children must be > other member rels and their parents must be baserels. It was being fixed in a later patch. In the attached patch set 0001 changes it to use IS_OTHER_REL(). > > It's possible that, for various reasons, none of these code paths > would ever be reachable by a child join, but it doesn't look likely to > me. And even if that's true, some comment updates are probably > needed, and maybe some renaming of functions too. Now commit messages of 0001 explains which instances of RELOPT_OTHER_MEMBER_REL and RELOPT_BASEREL have been changed, and which have been retained and why. Also, added assertions wherever necessary. > > In postgres_fdw, get_useful_ecs_for_relation() assumes that any child > rel is an other member rel. I'm not sure if we're hoping that > partitionwise join will work with postgres_fdw's join pushdown out of > the chute, but clearly this would need to be adjusted to have any > chance of being right. Fixed this as explained above. > > Some that seem OK: > > - set_rel_consider_parallel() is fine. > - set_append_rel_size() is only going to be called for baserels or > their children, so it's fine. > - relation_excluded_by_constraints() is only intended to be called on > baserels or their children, so it's fine. > - check_index_predicates() is only intended to be called on baserels > or their children, so it's fine. > - query_planner() loops over baserels and their children, so it's fine. > Right. > Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be > used in some of these places, just for symmetry. I was wondering about this as well. Although, I though it better not to touch base relations in partition-wise join. But now, I have added that macro and adjusted corresponding tests in the code. See 0001. You may actually want to squash 0001 and 0002 into a single patch. But for now, I have left those as separate. > The point is that > there are really three questions here: (1) is it some kind of baserel > (parent or child)? (2) is it some kind of joinrel (parent or child)? > and (3) is it some kind of child (baserel or join)? Right now, both > #2 and #3 are tested by just comparing against > RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we > add child joinrels. The goal of 0001, IMV, ought to be to try to > figure out which of #1, #2, and #3 is being checked in each case and > make that clear via use of an appropriate macro. (If is-other-baserel > is the real test, then fine, but I bet that's a rare case.) Agreed. I have gone through all the cases, and fixed the necessary ones as explained above and in the commit messages of 0001. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Amit Langote
Дата:
Hi Ashutosh, On 2017/03/23 21:48, Ashutosh Bapat wrote: >>> I have fixed all the issues reported till now. In patch 0007, the following code in have_partkey_equi_join() looks potentially unsafe: /* * If the clause refers to different partition keys from * both relations, it can not be used forpartition-wise join. */ if (ipk1 != ipk2) continue; /* * The clause allows partition-wise join if only it uses the same * operator family as that specifiedby the partition key. */ if (!list_member_oid(rinfo->mergeopfamilies, part_scheme->partopfamily[ipk1])) continue; What if ipk1 and ipk2 both turn out to be -1? Accessing part_schem->partopfamily[ipk1] would be incorrect, no? Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Amit Langote
Дата:
Hi Ashutosh, On 2017/03/23 21:48, Ashutosh Bapat wrote: >>> I have fixed all the issues reported till now. I've tried to fix your 0012 patch (Multi-level partitioned table expansion) considering your message earlier on this thread [1]. Especially the fact that no AppendRelInfo and RelOptInfo are allocated for partitioned child tables as of commit d3cc37f1d [2]. I've fixed expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a partitioned child RTEs whose rte->inh is set to true. Such an RTE is recursively expanded with that RTE the parent. Also as I mentioned elsewhere [3], the multi-level inheritance expansion of partitioned table will break update/delete for partitioned table, which is because inheritance_planner() is not ready to handle inheritance sets structured that way. I tried to refactor inheritance_planner() such that its core logic can be recursively invoked for partitioned child RTEs. The resulting child paths and other auxiliary information related to planning across the hierarchy are maintained in one place using a struct to hold the same in a few flat lists. The refactoring didn't break any existing tests and a couple of new tests are added to check that it indeed works for multi-level partitioned tables expanded using new multi-level structure. There is some test failure in 0014 (Multi-level partition-wise join tests), probably because of the changes I made to 0012, which I didn't get time to check why, although I've checked using an example that multi-level join planning still works, so it's not completely broken either. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT%3DuQ%40mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d [3] https://www.postgresql.org/message-id/744d20fe-fc7b-f89e-8d06-6496ec537b86%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Fri, Mar 24, 2017 at 1:57 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Ashutosh, > > On 2017/03/23 21:48, Ashutosh Bapat wrote: >>>> I have fixed all the issues reported till now. > > In patch 0007, the following code in have_partkey_equi_join() looks > potentially unsafe: > > /* > * If the clause refers to different partition keys from > * both relations, it can not be used for partition-wise join. > */ > if (ipk1 != ipk2) > continue; > > /* > * The clause allows partition-wise join if only it uses the same > * operator family as that specified by the partition key. > */ > if (!list_member_oid(rinfo->mergeopfamilies, > part_scheme->partopfamily[ipk1])) > continue; > > What if ipk1 and ipk2 both turn out to be -1? Accessing > part_schem->partopfamily[ipk1] would be incorrect, no? Thanks for the report. Surprising this should have crashed sometime, but didn't ever. Neither it showed wrong output for queries where partition keys were not part of equi-joins. The reason being partopfamily[-1] had 0 in it, which when tested again list_member_oid(rinfo->mergeopfamilies, ..) returned false. Attached patches fix this code. Also, I have fixed few grammar mistakes, typos, renamed variables in PartitionSchemeData to match those in PartitionKey. I have squashed the patches introducing IS_JOIN_REL, IS_OTHER_REL, IS_SIMPLE_REL into one. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Fri, Mar 24, 2017 at 4:18 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Ashutosh, > > On 2017/03/23 21:48, Ashutosh Bapat wrote: >>>> I have fixed all the issues reported till now. > > I've tried to fix your 0012 patch (Multi-level partitioned table > expansion) considering your message earlier on this thread [1]. > Especially the fact that no AppendRelInfo and RelOptInfo are allocated for > partitioned child tables as of commit d3cc37f1d [2]. I've fixed > expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a > partitioned child RTEs whose rte->inh is set to true. Such an RTE is > recursively expanded with that RTE the parent. > > Also as I mentioned elsewhere [3], the multi-level inheritance expansion > of partitioned table will break update/delete for partitioned table, which > is because inheritance_planner() is not ready to handle inheritance sets > structured that way. I tried to refactor inheritance_planner() such that > its core logic can be recursively invoked for partitioned child RTEs. The > resulting child paths and other auxiliary information related to planning > across the hierarchy are maintained in one place using a struct to hold > the same in a few flat lists. The refactoring didn't break any existing > tests and a couple of new tests are added to check that it indeed works > for multi-level partitioned tables expanded using new multi-level structure. > > There is some test failure in 0014 (Multi-level partition-wise join > tests), probably because of the changes I made to 0012, which I didn't get > time to check why, although I've checked using an example that multi-level > join planning still works, so it's not completely broken either. > I have gone through the patch, and it looks good to me. Here's the set of patches with this patch included. Fixed the testcase failures. Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I have gone through the patch, and it looks good to me. Here's the set > of patches with this patch included. Fixed the testcase failures. > Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9. This version of 0001 looks much better to me, but I still have some concerns. I think we should also introduce IS_UPPER_REL() at the same time, for symmetry and because partitionwise aggregate will need it, and use it in place of direct tests against RELOPT_UPPER_REL. I think it would make sense to change the test in deparseFromExpr() to check for IS_JOIN_REL() || IS_SIMPLE_REL(). There's no obvious reason why that shouldn't be OK, and it would remove the last direct test against RELOPT_JOINREL in the tree, and it will probably need to be changed for partitionwise aggregate anyway. Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))? I notice that you did this in some other places such as generate_implied_equalities_for_column(), and I like that. If for some reason that's not going to work, then it's doubtful whether Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to survive either. Similarly, I think relation_excluded_by_constraints() would also benefit from Assert(IS_SIMPLE_REL(rel)). Why not set top_parent_relids earlier, when actually creating the RelOptInfo? I think you could just change build_simple_rel() so that instead of passing RelOptKind reloptkind, you instead pass RelOptInfo *parent. I think postponing that work until set_append_rel_size() just introduces possible bugs resulting from it not being set early enough. Apart from the above, I think 0001 is in good shape. Regarding 0002, I think the parts that involve factoring out find_param_path_info() are uncontroversial. Regarding the changes to adjust_appendrel_attrs(), my main question is whether we wouldn't be better off using an array representation rather than a List representation. In other words, this function could take PlannerInfo *root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing callers doing adjust_appendrel_attrs(root, whatever, appinfo) could just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not needing to allocate. To make this work, adjust_child_relids() and find_appinfos_by_relids() would need to be adjusted to use a similar argument-passing convention. I suspect this makes iterating over the AppendRelInfos mildly faster, too, apart from the memory savings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Regarding 0002, I think the parts that involve factoring out > find_param_path_info() are uncontroversial. Regarding the changes to > adjust_appendrel_attrs(), my main question is whether we wouldn't be > better off using an array representation rather than a List > representation. In other words, this function could take PlannerInfo > *root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing > callers doing adjust_appendrel_attrs(root, whatever, appinfo) could > just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not > needing to allocate. To make this work, adjust_child_relids() and > find_appinfos_by_relids() would need to be adjusted to use a similar > argument-passing convention. I suspect this makes iterating over the > AppendRelInfos mildly faster, too, apart from the memory savings. Still regarding 0002, looking at adjust_appendrel_attrs_multilevel, could we have a common code path for the baserel and joinrel cases? It seems like perhaps you could just loop over root->append_rel_list. For each appinfo, if (bms_is_member(appinfo->child_relid, child_rel->relids)) bms_add_member(parent_relids, appinfo->parent_relid). This implementation would have some loss of efficiency in the single-rel case because we'd scan all of the AppendRelInfos in the list even if there's only one relid. But you could fix that by writing it like this: foreach (lc, root->append_rel_list) { if (bms_is_member(appinfo->child_relid, child_rel->relids)) { bms_add_member(parent_relids, appinfo->parent_relid); if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL) break; /* only one relidto find, and we've found it */ } } Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids)); That seems pretty slick. It is just as fast as the current implementation for the single-rel case. It allocates no memory (unlike what you've got now). And it handles the joinrel case using essentially the same code as the simple rel case. In 0003, it seems that it would be more consistent with what you did elsewhere if the last argument to allow_star_schema_join were named inner_paramrels rather than innerparams. Other than that, I don't see anything to complain about. In 0004: + Assert(!rel->part_rels[cnt_parts]); + rel->part_rels[cnt_parts] = childrel; break here? +static void +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, + Relation relation, bool inhparent) +{ + /* No partitioning information for an unpartitioned relation. */ + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || + !inhparent || I still think the inhparent check should be moved to the caller. In 0005: + * Returns a list of the RT indexes of the partitioned child relations + * with any of joining relations' rti as the root parent RT index. I found this wording confusing. Maybe: Build and return a list containing the RTI of every partitioned relation which is a child of some rel included in the join. + * Note: Only call this function on joins between partitioned tables. Or what, the boogeyman will come and get you? (In other words, I don't think that's a very informative comment.) I don't think 0011 is likely to be acceptable in current form. I can't imagine that we just went to the trouble of getting rid of AppendRelInfos for child partitioned rels only to turn around and put them back again. If you just need the parent-child mappings, you can get that from the PartitionedChildRelInfo list. Unfortunately, I don't think we're likely to be able to get this whole patch series into a committable form in the next few days, but I'd like to keep reviewing it and working with you on it; there's always next cycle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 28, 2017 at 10:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I have gone through the patch, and it looks good to me. Here's the set >> of patches with this patch included. Fixed the testcase failures. >> Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9. > > This version of 0001 looks much better to me, but I still have some concerns. > > I think we should also introduce IS_UPPER_REL() at the same time, for > symmetry and because partitionwise aggregate will need it, and use it > in place of direct tests against RELOPT_UPPER_REL. Ok. Done. I introduced IS_JOIN_REL and IS_OTHER_REL only to simplify the tests for child-joins. But now we have grown this patch with IS_SIMPLE_REL() and IS_UPPER_REL(). That has introduced changes unrelated to partition-wise join. But I am happy with the way the code looks now with all IS_*_REL() macros. If we delay this commit, some more usages of bare RELOPT_* would creep in the code. To avoid that, we may want to commit these changes in v10. > > I think it would make sense to change the test in deparseFromExpr() to > check for IS_JOIN_REL() || IS_SIMPLE_REL(). There's no obvious reason > why that shouldn't be OK, and it would remove the last direct test > against RELOPT_JOINREL in the tree, and it will probably need to be > changed for partitionwise aggregate anyway. Done. However, we need another assertion to make sure than an "other" upper rel has an "other" rel as scanrel. That can be added when partition-wise aggregate, which would introduce "other" upper rels, is implemented. > > Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))? I notice that > you did this in some other places such as > generate_implied_equalities_for_column(), and I like that. If for > some reason that's not going to work, then it's doubtful whether > Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to > survive either. Done. Also modified prologue of that function to explicitly say simple "append relation" since we can have join "append relations" and upper "append relations" with partition-wise operations. > > Similarly, I think relation_excluded_by_constraints() would also > benefit from Assert(IS_SIMPLE_REL(rel)). Done. > > Why not set top_parent_relids earlier, when actually creating the > RelOptInfo? I think you could just change build_simple_rel() so that > instead of passing RelOptKind reloptkind, you instead pass RelOptInfo > *parent. I think postponing that work until set_append_rel_size() > just introduces possible bugs resulting from it not being set early > enough. Done. > > Apart from the above, I think 0001 is in good shape. > > Regarding 0002, I think the parts that involve factoring out > find_param_path_info() are uncontroversial. Regarding the changes to > adjust_appendrel_attrs(), my main question is whether we wouldn't be > better off using an array representation rather than a List > representation. In other words, this function could take PlannerInfo > *root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing > callers doing adjust_appendrel_attrs(root, whatever, appinfo) could > just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not > needing to allocate. To make this work, adjust_child_relids() and > find_appinfos_by_relids() would need to be adjusted to use a similar > argument-passing convention. I suspect this makes iterating over the > AppendRelInfos mildly faster, too, apart from the memory savings. Done. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Regarding 0002, I think the parts that involve factoring out >> find_param_path_info() are uncontroversial. Regarding the changes to >> adjust_appendrel_attrs(), my main question is whether we wouldn't be >> better off using an array representation rather than a List >> representation. In other words, this function could take PlannerInfo >> *root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing >> callers doing adjust_appendrel_attrs(root, whatever, appinfo) could >> just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not >> needing to allocate. To make this work, adjust_child_relids() and >> find_appinfos_by_relids() would need to be adjusted to use a similar >> argument-passing convention. I suspect this makes iterating over the >> AppendRelInfos mildly faster, too, apart from the memory savings. > > Still regarding 0002, looking at adjust_appendrel_attrs_multilevel, > could we have a common code path for the baserel and joinrel cases? > It seems like perhaps you could just loop over root->append_rel_list. > For each appinfo, if (bms_is_member(appinfo->child_relid, > child_rel->relids)) bms_add_member(parent_relids, > appinfo->parent_relid). > > This implementation would have some loss of efficiency in the > single-rel case because we'd scan all of the AppendRelInfos in the > list even if there's only one relid. But you could fix that by > writing it like this: > > foreach (lc, root->append_rel_list) > { > if (bms_is_member(appinfo->child_relid, child_rel->relids)) > { > bms_add_member(parent_relids, appinfo->parent_relid); > if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL) > break; /* only one relid to find, and we've found it */ > } > } > Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids)); > > That seems pretty slick. It is just as fast as the current > implementation for the single-rel case. It allocates no memory > (unlike what you've got now). And it handles the joinrel case using > essentially the same code as the simple rel case. I got rid of those differences completely by using trick similar to adjust_child_relids_multilevel(), which uses top_parent_relids instead of rel->reloptkind to decide whether we have reached the top parent or not. Those can trickle down from the topmost caller to any depth in recursion. This also avoids any call to find_*_rel(), which was the main reason why we had different code paths for base and join relation. > > In 0003, it seems that it would be more consistent with what you did > elsewhere if the last argument to allow_star_schema_join were named > inner_paramrels rather than innerparams. Other than that, I don't see > anything to complain about. I had used the same name as the local variable declared with the same purpose. But this change looks good. Done. > > In 0004: > > + Assert(!rel->part_rels[cnt_parts]); > + rel->part_rels[cnt_parts] = childrel; > > break here? Right, done. > > +static void > +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, > + Relation > relation, bool inhparent) > +{ > + /* No partitioning information for an unpartitioned relation. */ > + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || > + !inhparent || > > I still think the inhparent check should be moved to the caller. Done. > > In 0005: > > + * Returns a list of the RT indexes of the partitioned > child relations > + * with any of joining relations' rti as the root parent RT index. > > I found this wording confusing. Maybe: Build and return a list > containing the RTI of every partitioned relation which is a child of > some rel included in the join. This is better. Thanks. Done. > > + * Note: Only call this function on joins between partitioned tables. > > Or what, the boogeyman will come and get you? > > (In other words, I don't think that's a very informative comment.) I mimicked the prologue of earlier function. I guess, similar comment in the prologue of earlier function is written because if we use something other than a partitioned table there, the assertion at the end of that function would trip. Similarly, for this function, the assertion at the end of the function will trip, if we use it for something other than a join relation. PFA patches rebased on f90d23d0c51895e0d7db7910538e85d3d38691f0. > > I don't think 0011 is likely to be acceptable in current form. I > can't imagine that we just went to the trouble of getting rid of > AppendRelInfos for child partitioned rels only to turn around and put > them back again. If you just need the parent-child mappings, you can > get that from the PartitionedChildRelInfo list. I will reply to this separately. > > Unfortunately, I don't think we're likely to be able to get this whole > patch series into a committable form in the next few days, but I'd > like to keep reviewing it and working with you on it; there's always > next cycle. Thanks for all your efforts in reviewing the patches and for your excellent suggestions to improve the patches. As I have stated earlier, it will help if we can get 0001 committed, may be 0002. 0004 introduces the concept of partitioning scheme which seems to be vital for partititon-wise aggregation, partition pruning and may be sorting optimization discussed in [1]. If we are able to commit it in this commitfest, the patches for those optimizations can take advantage of partitioning scheme. I understand it's really close to the end of this commitfest and we may not be able to commit even those patches. https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg308742.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
> On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think 0011 is likely to be acceptable in current form. I > can't imagine that we just went to the trouble of getting rid of > AppendRelInfos for child partitioned rels only to turn around and put > them back again. If you just need the parent-child mappings, you can > get that from the PartitionedChildRelInfo list. > Please refer to my earlier mails on this subject [1], [2]. For multi-level partition-wise join, we need RelOptInfo of a partitioned table to contain RelOptInfo of its immediate partitions. I have not seen any counter arguments not to create RelOptInfos for intermediate partitioned tables. We create child RelOptInfos only for entries in root->append_rel_list i.e. only for those relations which have an AppendRelInfo. Since we are not creating AppendRelInfos for partitioned partitions, we do not create RelOptInfos for those. So, to me it looks like we have to either have AppendRelInfos for partitioned partitions or create RelOptInfos by traversing some other list like PartitionedChildRelInfo list. It looks odd to walk root->append_rel_list as well as this new list for creating RelOptInfos. But for a moment, we assume that we have to walk this other list. But then that other list is also lossy. It stores only the topmost parent of any of the partitioned partitions and not the immediate parent as required to add RelOptInfos of immediate children to the RelOptInfo of a parent. Coming back to the point of PartitionedChildRelInfo list as a way to maintain parent - child relationship. All the code assumes that the parent-child relationship is stored in AppendRelInfo linked as root->append_rel_list and walks that list to find children of a given parent of parent/s of a given child. We will have to modify all those places to traverse two lists instead of one. Some of those even return AppendRelInfo structure, and now they some times return an AppendRelInfo and sometimes PartitionedChildRelInfo. That looks ugly. Consider a case where P has partitions p1 and p2, which in turn have partitions p11, p12 and p21, p22 resp. Another partitioned table Q has partitions q1, q2. q1 is further partitioned into q11, q12 but q2 is not partitioned. The partitioning scheme of P and Q matches. Also, partitioning scheme of p1 and q1 matches. So, a partition-wise join between P and Q would look like P J Q = append (p11 J q11, p12 J q12, p2 J q2), p2 J q2 being append(p21, p22) J q2. When constructing the restrictlist (and other clauses) for p2 J q2 we need to translate the restrictlist applicable for P J Q. This translation requires AppendRelInfo of p2 which does not exist today. We can not use PartitionedChildRelInfo because it doesn't have information about how to translate Vars of P to those of p2. I don't see a way to avoid creating AppendRelInfos for partitioned partitions. [1] https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT%3DuQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/03/30 18:35, Ashutosh Bapat wrote: >> On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think 0011 is likely to be acceptable in current form. I >> can't imagine that we just went to the trouble of getting rid of >> AppendRelInfos for child partitioned rels only to turn around and put >> them back again. If you just need the parent-child mappings, you can >> get that from the PartitionedChildRelInfo list. >> > > Please refer to my earlier mails on this subject [1], [2]. For > multi-level partition-wise join, we need RelOptInfo of a partitioned > table to contain RelOptInfo of its immediate partitions. I have not > seen any counter arguments not to create RelOptInfos for intermediate > partitioned tables. We create child RelOptInfos only for entries in > root->append_rel_list i.e. only for those relations which have an > AppendRelInfo. Since we are not creating AppendRelInfos for > partitioned partitions, we do not create RelOptInfos for those. So, to > me it looks like we have to either have AppendRelInfos for partitioned > partitions or create RelOptInfos by traversing some other list like > PartitionedChildRelInfo list. It looks odd to walk > root->append_rel_list as well as this new list for creating > RelOptInfos. But for a moment, we assume that we have to walk this > other list. But then that other list is also lossy. It stores only the > topmost parent of any of the partitioned partitions and not the > immediate parent as required to add RelOptInfos of immediate children > to the RelOptInfo of a parent. So, because we want to create an Append path for each partitioned table in a tree separately, we'll need RelOptInfo for each one, which in turn requires an AppendRelInfo. Note that we do that only for those partitioned child RTEs that have inh set to true, so that all the later stages will treat it as the parent rel to create an Append path for. There would still be partitioned child RTEs with inh set to false for which, just like before, no AppendRelInfos and RelOptInfos are created; they get added as the only member of partitioned_rels in the PartitionedChildRelInfo of each partitioned table. Finally, when the Append path for the root parent is created, its subpaths list will contain paths of leaf partitions of all levels and its partitioned_rels list should contain the RT indexes of partitioned tables of all levels. If we have the following partition tree: A / | \ B C D / \ E F The following RTEs will be created, in that order. RTEs with inh=true are shown with suffix _i. RTEs that get an AppendRelInfo (& a RelOptInfo) are shown with suffix _a. A_i_a A B_a C_i_a C E_a F_a D_a Thanks, Amit
On Thu, Mar 30, 2017 at 6:32 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > So, because we want to create an Append path for each partitioned table in > a tree separately, we'll need RelOptInfo for each one, which in turn > requires an AppendRelInfo. Hmm. It's no more desirable to have an Append inside of another Append with partitionwise join than it is in general. If we've got A partitioned into A1, A2, A3 and similarly B partitioned into B1, B2, and B3, and then A1 and B1 are further partitioned into A1a, A1b, B1a, B1b, then a partitionwise join between the tables ought to end up looking like this: Append -> Join (A1a, B1a) -> Join (A1b, B1b) -> Join (A2, B2) -> Join (A3, B3) So here we don't actually end up with an append-path for A1-B1 here anywhere. But you might need that in more complex cases, I guess, because suppose you now join this to C with partitions C1, C2, C3; but C1 is not sub-partitioned. Then you might end up with a plan like: Append -> Join -> Append -> Join (A1a, B1a) -> Join (A1b, B1b) -> Scan C1 -> Join ((A2, B2), C2) -> Join ((A3, B3), C3) So maybe you're right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Done. Ashutosh and I spent several hours discussing this patch set today. I'm starting to become concerned about the fact that 0004 makes the partition bounds part of the PartitionScheme, because that means you can't do a partition-wise join between two tables that have any difference at all in the partition bounds. It might be possible in the future to introduce a notion of a compatible partition scheme, so that you could say, OK, well, these two partition schemes are not quite the same, but they are compatible, and we'll make a new partition scheme for whatever results from reconciling them. What I think *may* be better is to consider the partition bound information as a property of the RelOptInfo rather than the PartitionScheme. For example, suppose we're joining A with partitions A1, A2, and A4 against B with partitions B1, B2, and B3 and C with partitions C1, C2, and C5. With the current approach, we end up with a PartitionScheme for each baserel and, not in this patch set but maybe eventually, a separate PartitionScheme for each of (A B), (A C), (B C), and (A B C). That seems pretty unsatisfying. If we consider the PartitionScheme to only include the question of whether we're doing a join on the partition keys, then if the join includes WHERE a.key = b.key AND b.key = c.key, we can say that they all have the same PartitionScheme up front. Then, each RelOptInfo can have a separate list of bounds, like this: A: 1, 2, 4 B: 1, 2, 3 C: 1, 2, 5 A B: 1, 2, 3, 4 A C: 1, 2, 4, 5 B C: 1, 2, 3, 5 A B C: 1, 2, 3, 4, 5 Or if it's an inner join, then instead of taking the union at each level, we can take the intersection, because any partition without a match on the other side of the join, then that partition can't produce any rows and doesn't need to be scanned. In that case, the RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up with a bound list of 1, 2. A related question (that I did not discuss with Ashutosh, but occurred to me later) is whether the PartitionScheme ought to worry about cross-type joins. For instance, if A is partitioned on an int4 column and B is partitioned on an int8 column, and they are joined on their respective partitioning columns, can't we still do a partition-wise join? We do need to require that the operator family of the operator actually used in the query, the operator family used to partition the inner table, and the operator family used to partition the other table all match; and the collation used for the comparison in the query, the collation used to partition the outer table, and the collation used to partition the inner table must all match. But it doesn't seem necessary to require an exact type or typmod match. In many ways this seems a whole lot like the what we test when building equivalence classes (cf. process_equivalence) although I'm not sure that we can leverage that in any useful way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I think, we should rename partition scheme as PartitionKeyOptInfo and club partition bounds, nparts and part_rels as PartitionDescOptInfo. But I haven't done that in this patch yet.
On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Done.
Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds. It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.
What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme. For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5. With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C). That seems pretty unsatisfying. If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front. Then, each RelOptInfo can have a
separate list of bounds, like this:
A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5
Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned. In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.
I have separated partition bounds from partition scheme. The patch adds build_joinrel_partition_bounds() to calculate the bounds of the join relation and the pairs of matching partitions from the joining relation. For now the function just check whether both the relations have same bounds and returns the bounds of the first one. But in future, we will expand this function to merge partition bounds from the joining relation and return pairs of matching partitions which when joined form the partitions of the join according to the merged partition bounds.
Also, moved the code to collect partition RelOptInfos from set_append_rel_size() to build_simple_rel(), so everything related to partitioning gets set in that function for a base relation.
I think, we should rename partition scheme as PartitionKeyOptInfo and club partition bounds, nparts and part_rels as PartitionDescOptInfo. But I haven't done that in this patch yet.
A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins. For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join? We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match. But it doesn't seem
necessary to require an exact type or typmod match. In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.
Yes, I agree. For an inner join, the partition key types need to "shrink" and for outer join they need to be "widened". I don't know if there is a way to know "wider" or "shorter" of two given types. We might have to implement a method to merge partition keys to produce partition key of the join, which may be different from either of the partition keys. So, after-all we may have to abandon the idea of canonical partition scheme. I haven't included this change in the attached set of patches.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения
Somehow I sent the old patch set again. Here's the real v17. On Tue, Apr 4, 2017 at 7:52 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > Done. >> >> Ashutosh and I spent several hours discussing this patch set today. >> I'm starting to become concerned about the fact that 0004 makes the >> partition bounds part of the PartitionScheme, because that means you >> can't do a partition-wise join between two tables that have any >> difference at all in the partition bounds. It might be possible in >> the future to introduce a notion of a compatible partition scheme, so >> that you could say, OK, well, these two partition schemes are not >> quite the same, but they are compatible, and we'll make a new >> partition scheme for whatever results from reconciling them. >> >> What I think *may* be better is to consider the partition bound >> information as a property of the RelOptInfo rather than the >> PartitionScheme. For example, suppose we're joining A with partitions >> A1, A2, and A4 against B with partitions B1, B2, and B3 and C with >> partitions C1, C2, and C5. With the current approach, we end up with >> a PartitionScheme for each baserel and, not in this patch set but >> maybe eventually, a separate PartitionScheme for each of (A B), (A C), >> (B C), and (A B C). That seems pretty unsatisfying. If we consider >> the PartitionScheme to only include the question of whether we're >> doing a join on the partition keys, then if the join includes WHERE >> a.key = b.key AND b.key = c.key, we can say that they all have the >> same PartitionScheme up front. Then, each RelOptInfo can have a >> separate list of bounds, like this: >> >> A: 1, 2, 4 >> B: 1, 2, 3 >> C: 1, 2, 5 >> A B: 1, 2, 3, 4 >> A C: 1, 2, 4, 5 >> B C: 1, 2, 3, 5 >> A B C: 1, 2, 3, 4, 5 >> >> Or if it's an inner join, then instead of taking the union at each >> level, we can take the intersection, because any partition without a >> match on the other side of the join, then that partition can't produce >> any rows and doesn't need to be scanned. In that case, the >> RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up >> with a bound list of 1, 2. > > > I have separated partition bounds from partition scheme. The patch adds > build_joinrel_partition_bounds() to calculate the bounds of the join > relation and the pairs of matching partitions from the joining relation. For > now the function just check whether both the relations have same bounds and > returns the bounds of the first one. But in future, we will expand this > function to merge partition bounds from the joining relation and return > pairs of matching partitions which when joined form the partitions of the > join according to the merged partition bounds. > > Also, moved the code to collect partition RelOptInfos from > set_append_rel_size() to build_simple_rel(), so everything related to > partitioning gets set in that function for a base relation. > > I think, we should rename partition scheme as PartitionKeyOptInfo and club > partition bounds, nparts and part_rels as PartitionDescOptInfo. But I > haven't done that in this patch yet. > >> >> >> A related question (that I did not discuss with Ashutosh, but occurred >> to me later) is whether the PartitionScheme ought to worry about >> cross-type joins. For instance, if A is partitioned on an int4 column >> and B is partitioned on an int8 column, and they are joined on their >> respective partitioning columns, can't we still do a partition-wise >> join? We do need to require that the operator family of the operator >> actually used in the query, the operator family used to partition the >> inner table, and the operator family used to partition the other table >> all match; and the collation used for the comparison in the query, the >> collation used to partition the outer table, and the collation used to >> partition the inner table must all match. But it doesn't seem >> necessary to require an exact type or typmod match. In many ways this >> seems a whole lot like the what we test when building equivalence >> classes (cf. process_equivalence) although I'm not sure that we can >> leverage that in any useful way. >> > > Yes, I agree. For an inner join, the partition key types need to "shrink" > and for outer join they need to be "widened". I don't know if there is a way > to know "wider" or "shorter" of two given types. We might have to implement > a method to merge partition keys to produce partition key of the join, which > may be different from either of the partition keys. So, after-all we may > have to abandon the idea of canonical partition scheme. I haven't included > this change in the attached set of patches. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Yes, I agree. For an inner join, the partition key types need to "shrink" > and for outer join they need to be "widened". I don't know if there is a way > to know "wider" or "shorter" of two given types. We might have to implement > a method to merge partition keys to produce partition key of the join, which > may be different from either of the partition keys. So, after-all we may > have to abandon the idea of canonical partition scheme. I haven't included > this change in the attached set of patches. I think this is why you need to regard the partitioning scheme as something more like an equivalence class - possibly the partitioning scheme should actually contain (or be?) an equivalence class. Suppose this is the query: SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x; ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8 partitioning key. It's meaningless to ask whether the result of the join is partitioned by int4 or int8. It's partitioned by the equivalence class that contains both i4.x and i8.x. If the result of this join where joined to another table on either of those two columns, a second partition-wise join would be theoretically possible. If you insist on knowing the type of the partitioning scheme, rather than just the opfamily, you've boxed yourself into a corner from which there's no good escape. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Yes, I agree. For an inner join, the partition key types need to "shrink" >> and for outer join they need to be "widened". I don't know if there is a way >> to know "wider" or "shorter" of two given types. We might have to implement >> a method to merge partition keys to produce partition key of the join, which >> may be different from either of the partition keys. So, after-all we may >> have to abandon the idea of canonical partition scheme. I haven't included >> this change in the attached set of patches. > > I think this is why you need to regard the partitioning scheme as > something more like an equivalence class - possibly the partitioning > scheme should actually contain (or be?) an equivalence class. Suppose > this is the query: > > SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x; > > ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8 > partitioning key. It's meaningless to ask whether the result of the > join is partitioned by int4 or int8. It's partitioned by the > equivalence class that contains both i4.x and i8.x. If the result of > this join where joined to another table on either of those two > columns, a second partition-wise join would be theoretically possible. > If you insist on knowing the type of the partitioning scheme, rather > than just the opfamily, you've boxed yourself into a corner from which > there's no good escape. Only inner join conditions have equivalence classes associated with those. Outer join conditions create single element equivalence classes. So, we can not associate equivalence classes as they are with partition scheme. If we could do that, it makes life much easier since checking whether equi-join between all partition keys exist, is simply looking up equivalence classes that cover joining relations and find em_member corresponding to partition keys. It looks like we should only keep strategy, partnatts, partopfamily and parttypcoll in PartitionScheme. A partition-wise join between two relations would be possible if all those match. When matching partition bounds of joining relations, we should rely on partopfamily to give us comparison function based on the types of partition keys being joined. In that context it looks like all the partition bound comparision functions which accept partition key were not written keeping this use case in mind. They will need to be rewritten to accept strategy, partnatts, partopfamily and parttypcoll. There's a relevant comment in 0006, build_joinrel_partition_info() (probably that name needs to change, but I will do that once we have settled on design) + /* + * Construct partition keys for the join. + * + * An INNER join between two partitioned relations is partition by key + * expressions from both the relations. For tables A and B partitioned by a and b + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a + * and B.b. + * + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with + * B.b NULL. These rows may not fit the partitioning conditions imposed on + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. + * Strictly speaking, partition keys of an OUTER join should include + * partition key expressions from the OUTER side only. Consider a join like + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not + * include B.b as partition key expression for (AB), it prohibits us from + * using partition-wise join when joining (AB) with C as there is no + * equi-join between partition keys of joining relations. But two NULL + * values are never equal and no two rows from mis-matching partitions can + * join. Hence it's safe to include B.b as partition key expression for + * (AB), even though rows in (AB) are not strictly partitioned by B.b. + */ I think that also needs to be reviewed carefully. Partition-wise joins may be happy including partition keys from all sides, but partition-wise aggregates may not be, esp. when pushing complete aggregation down to partitions. In that case, rows with NULL partition key, which falls on nullable side of join, will be spread across multiple partitions. Proabably, we should separate nullable and non-nullable partition key expressions. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Only inner join conditions have equivalence classes associated with > those. Outer join conditions create single element equivalence > classes. So, we can not associate equivalence classes as they are with > partition scheme. If we could do that, it makes life much easier since > checking whether equi-join between all partition keys exist, is simply > looking up equivalence classes that cover joining relations and find > em_member corresponding to partition keys. OK. > It looks like we should only keep strategy, partnatts, partopfamily > and parttypcoll in PartitionScheme. A partition-wise join between two > relations would be possible if all those match. Yes, I think so. Conceivably you could even exclude partnatts and strategy, since there's nothing preventing a partitionwise join between a list-partitioned table and a range-partitioned table, or between a table range-partitioned on (a) and another range-partitioned on (a, b), but there is probably not much benefit in trying to cover such cases. I think it's reasonable to tell users that this is only going to work when the partitioning strategy is the same and the join conditions include all of the partitioning columns on both sides. > There's a relevant comment in 0006, build_joinrel_partition_info() > (probably that name needs to change, but I will do that once we have > settled on design) > + /* > + * Construct partition keys for the join. > + * > + * An INNER join between two partitioned relations is partition by key > + * expressions from both the relations. For tables A and B > partitioned by a and b > + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a > + * and B.b. > + * > + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with > + * B.b NULL. These rows may not fit the partitioning conditions imposed on > + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. > + * Strictly speaking, partition keys of an OUTER join should include > + * partition key expressions from the OUTER side only. Consider a join like > + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not > + * include B.b as partition key expression for (AB), it prohibits us from > + * using partition-wise join when joining (AB) with C as there is no > + * equi-join between partition keys of joining relations. But two NULL > + * values are never equal and no two rows from mis-matching partitions can > + * join. Hence it's safe to include B.b as partition key expression for > + * (AB), even though rows in (AB) are not strictly partitioned by B.b. > + */ > > I think that also needs to be reviewed carefully. The following passage from src/backend/optimizer/README seems highly relevant: === The planner's treatment of outer join reordering is based on the following identities: 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) leftjoin B on (Pab) where Pac is a predicate referencing A and C, etc (in this case, clearly Pac cannot reference B, or the transformation is nonsensical). 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) = (A leftjoin C on (Pac)) leftjoin B on (Pab) 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) = A leftjoin (B leftjoin C on (Pbc)) on (Pab) Identity 3 only holds if predicate Pbc must fail for all-null B rows (that is, Pbc is strict for at least one column of B). If Pbc is not strict, the first form might produce some rows with nonnull C columns where the second form would make those entries null. === In other words, I think your statement that null is never equal to null is a bit imprecise. Somebody could certainly create an operator that is named "=" which returns true in that case, and then they could say, hey, two nulls are equal (when you use that operator). The argument needs to be made in terms of the formal properties of the operator. The relevant logic is in have_partkey_equi_join: + /* Skip clauses which are not equality conditions. */ + if (rinfo->hashjoinoperator == InvalidOid && !rinfo->mergeopfamilies) + continue; Actually, I think the hashjoinoperator test is formally and practically unnecessary here; lower down there is a test to see whether the partitioning scheme's operator family is a member of rinfo->mergeopfamilies, which will certainly fail if we got through this test with rinfo->mergeopfamilies == NIL just on the strength of rinfo->hashjoinoperator != InvalidOid. So you can just bail out if rinfo->mergeopfamilies == NIL. But the underlying point here is that the only thing you really know about the function is that it's got to be a strategy-3 operator in some btree opclass; if that guarantees strictness, then so be it -- but I wasn't able to find anything in the code or documentation off-hand that supports that contention, so we might need to think a bit more about why (or if) this is guaranteed to be true. > Partition-wise joins > may be happy including partition keys from all sides, but > partition-wise aggregates may not be, esp. when pushing complete > aggregation down to partitions. In that case, rows with NULL partition > key, which falls on nullable side of join, will be spread across > multiple partitions. Proabably, we should separate nullable and > non-nullable partition key expressions. I don't think I understand quite what you're getting at here. Can you spell this out in more detail? To push an aggregate down to partitions, you need the grouping key to match the applicable partition key, and the partition key shouldn't allow nulls in more than one place. Now I think your point may be that outer join semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a good test case for partitionwise aggregate. I'd be inclined to think that we should just give up on partitionwise aggregate in such cases; it's not worth trying to optimize such a weird query, at least IMHO. (Does this sort of case ever happen with joins? I think not, as long as the join operator is strict.) I spent some time thinking about this patch set today and I don't see that there's much point in committing any more of this to v10. I think that 0001 and 0002 are probably committable or very close at this point. However, 0001 is adding more complexity than I think is warranted until we're actually ready to commit the feature that uses it, and 0002 is so small that committing isn't really going to smooth future development much. 0003-0009 are essentially all one big patch that will have to be committed together. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... But the underlying point here is that > the only thing you really know about the function is that it's got to > be a strategy-3 operator in some btree opclass; if that guarantees > strictness, then so be it -- but I wasn't able to find anything in the > code or documentation off-hand that supports that contention, so we > might need to think a bit more about why (or if) this is guaranteed to > be true. FWIW, I do not think that follows. If you want to check that the function is strict, check that explicitly. It's very likely that in practice, all such functions are indeed strict, but we don't have an assumption about that wired into the planner. regards, tom lane
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Only inner join conditions have equivalence classes associated with >> those. Outer join conditions create single element equivalence >> classes. So, we can not associate equivalence classes as they are with >> partition scheme. If we could do that, it makes life much easier since >> checking whether equi-join between all partition keys exist, is simply >> looking up equivalence classes that cover joining relations and find >> em_member corresponding to partition keys. > > OK. > >> It looks like we should only keep strategy, partnatts, partopfamily >> and parttypcoll in PartitionScheme. A partition-wise join between two >> relations would be possible if all those match. > > Yes, I think so. Conceivably you could even exclude partnatts and > strategy, since there's nothing preventing a partitionwise join > between a list-partitioned table and a range-partitioned table, or > between a table range-partitioned on (a) and another range-partitioned > on (a, b), but there is probably not much benefit in trying to cover > such cases. I think it's reasonable to tell users that this is only > going to work when the partitioning strategy is the same and the join > conditions include all of the partitioning columns on both sides. > >> There's a relevant comment in 0006, build_joinrel_partition_info() >> (probably that name needs to change, but I will do that once we have >> settled on design) >> + /* >> + * Construct partition keys for the join. >> + * >> + * An INNER join between two partitioned relations is partition by key >> + * expressions from both the relations. For tables A and B >> partitioned by a and b >> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a >> + * and B.b. >> + * >> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with >> + * B.b NULL. These rows may not fit the partitioning conditions imposed on >> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. >> + * Strictly speaking, partition keys of an OUTER join should include >> + * partition key expressions from the OUTER side only. Consider a join like >> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not >> + * include B.b as partition key expression for (AB), it prohibits us from >> + * using partition-wise join when joining (AB) with C as there is no >> + * equi-join between partition keys of joining relations. But two NULL >> + * values are never equal and no two rows from mis-matching partitions can >> + * join. Hence it's safe to include B.b as partition key expression for >> + * (AB), even though rows in (AB) are not strictly partitioned by B.b. >> + */ >> >> I think that also needs to be reviewed carefully. > > The following passage from src/backend/optimizer/README seems highly relevant: > > === > The planner's treatment of outer join reordering is based on the following > identities: > > 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) > = (A innerjoin C on (Pac)) leftjoin B on (Pab) > > where Pac is a predicate referencing A and C, etc (in this case, clearly > Pac cannot reference B, or the transformation is nonsensical). > > 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) > = (A leftjoin C on (Pac)) leftjoin B on (Pab) > > 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > > Identity 3 only holds if predicate Pbc must fail for all-null B rows > (that is, Pbc is strict for at least one column of B). If Pbc is not > strict, the first form might produce some rows with nonnull C columns > where the second form would make those entries null. > === > > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. The relevant logic is in have_partkey_equi_join: > > + /* Skip clauses which are not equality conditions. */ > + if (rinfo->hashjoinoperator == InvalidOid && > !rinfo->mergeopfamilies) > + continue; > > Actually, I think the hashjoinoperator test is formally and > practically unnecessary here; lower down there is a test to see > whether the partitioning scheme's operator family is a member of > rinfo->mergeopfamilies, which will certainly fail if we got through > this test with rinfo->mergeopfamilies == NIL just on the strength of > rinfo->hashjoinoperator != InvalidOid. So you can just bail out if > rinfo->mergeopfamilies == NIL. But the underlying point here is that > the only thing you really know about the function is that it's got to > be a strategy-3 operator in some btree opclass; if that guarantees > strictness, then so be it -- but I wasn't able to find anything in the > code or documentation off-hand that supports that contention, so we > might need to think a bit more about why (or if) this is guaranteed to > be true. I need more time to think about this. Will get back to this soon. > >> Partition-wise joins >> may be happy including partition keys from all sides, but >> partition-wise aggregates may not be, esp. when pushing complete >> aggregation down to partitions. In that case, rows with NULL partition >> key, which falls on nullable side of join, will be spread across >> multiple partitions. Proabably, we should separate nullable and >> non-nullable partition key expressions. > > I don't think I understand quite what you're getting at here. Can you > spell this out in more detail? To push an aggregate down to > partitions, you need the grouping key to match the applicable > partition key, and the partition key shouldn't allow nulls in more > than one place. Now I think your point may be that outer join > semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) > FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a > good test case for partitionwise aggregate. I'd be inclined to think > that we should just give up on partitionwise aggregate in such cases; > it's not worth trying to optimize such a weird query, at least IMHO. > (Does this sort of case ever happen with joins? I think not, as long > as the join operator is strict.) Yes, this is the case, I am thinking about. No, it doesn't happen with join. > > I spent some time thinking about this patch set today and I don't see > that there's much point in committing any more of this to v10. I > think that 0001 and 0002 are probably committable or very close at > this point. However, 0001 is adding more complexity than I think is > warranted until we're actually ready to commit the feature that uses > it, and 0002 is so small that committing isn't really going to smooth > future development much. 0003-0009 are essentially all one big patch > that will have to be committed together. Ok. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Yes, I agree. For an inner join, the partition key types need to "shrink" >> and for outer join they need to be "widened". I don't know if there is a way >> to know "wider" or "shorter" of two given types. We might have to implement >> a method to merge partition keys to produce partition key of the join, which >> may be different from either of the partition keys. So, after-all we may >> have to abandon the idea of canonical partition scheme. I haven't included >> this change in the attached set of patches. > > I think this is why you need to regard the partitioning scheme as > something more like an equivalence class - possibly the partitioning > scheme should actually contain (or be?) an equivalence class. Suppose > this is the query: > > SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x; > > ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8 > partitioning key. It's meaningless to ask whether the result of the > join is partitioned by int4 or int8. It's partitioned by the > equivalence class that contains both i4.x and i8.x. If the result of > this join where joined to another table on either of those two > columns, a second partition-wise join would be theoretically possible. > If you insist on knowing the type of the partitioning scheme, rather > than just the opfamily, you've boxed yourself into a corner from which > there's no good escape. When we merge partition bounds from two relations with different partition key types, the merged partition bounds need to have some information abound the way those constants look like e.g. their length, structure etc. That's the reason we need to store partition key types of merged partitioning scheme. Consider a three way join (i4 JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare partition bounds of i4 and i8, we use operators for int4 and int8. The join i4 JOIN i8 will get partition bounds by merging those of i4 and i8. When we come to join with i2, we need to know which operators to use for comparing the partition bounds of the join with those of i2. So, if the partition key types of the joining relations differ (but they have matching partitioning schemes per strategy, natts and operator family) the partition bounds of the join are converted to the wider type among the partition key types of the joining tree. Actually, as I am explained earlier we could choose a wider outer type for an OUTER join and shorter type for inner join. This type is used as partition key type of the join. In the above case join between i4 and i8 have its partition bounds converted to i8 (or i4) and then when it is joined with i2 the partition bounds of the join are converted to i8 (or i2). -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > When we merge partition bounds from two relations with different > partition key types, the merged partition bounds need to have some > information abound the way those constants look like e.g. their > length, structure etc. That's the reason we need to store partition > key types of merged partitioning scheme. Consider a three way join (i4 > JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare > partition bounds of i4 and i8, we use operators for int4 and int8. The > join i4 JOIN i8 will get partition bounds by merging those of i4 and > i8. When we come to join with i2, we need to know which operators to > use for comparing the partition bounds of the join with those of i2. > > So, if the partition key types of the joining relations differ (but > they have matching partitioning schemes per strategy, natts and > operator family) the partition bounds of the join are converted to the > wider type among the partition key types of the joining tree. > Actually, as I am explained earlier we could choose a wider outer type > for an OUTER join and shorter type for inner join. This type is used > as partition key type of the join. In the above case join between i4 > and i8 have its partition bounds converted to i8 (or i4) and then when > it is joined with i2 the partition bounds of the join are converted to > i8 (or i2). I don't understand why you think that partition-wise join needs any new logic here; if this were a non-partitionwise join, we'd similarly need to use the correct operator, but the existing code handles that just fine. If the join is performed partition-wise, it should use the same operators that would have been used by a non-partitionwise join between the same tables. I think the choice of operator depends only on the column types, and that the "width" of those types has nothing to do with it. For example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the operator for an A/B join or a B/C join will be the one that appears in the query; parse analysis will have identified which specific operator is meant based on the types of the columns. If the optimizer subsequently decides to reorder the joins and perform the A/C join first, it will go hunt down the operator with the same strategy number in the same operator family that takes the type of A.x on one side and the type of C.x on the other side. No problem. A partition-wise join between A and C will use that same operator; again, no problem. Your example involves joining the output of a join between i4 and i8 against i2, so it seems there is some ambiguity about what the input type should be. But, again, the planner already copes with this problem. In fact, the join is performed either using i4.x or i8.x -- I don't know what happens, or whether it depends on other details of the query or the plan -- and the operator which can accept that value on one side and i2.x on the other side is the one that gets used. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I don't understand why you think that partition-wise join needs any > new logic here; if this were a non-partitionwise join, we'd similarly > need to use the correct operator, but the existing code handles that > just fine. If the join is performed partition-wise, it should use the > same operators that would have been used by a non-partitionwise join > between the same tables. More to the point, the appropriate operator was chosen by parse analysis. The planner has *zero* flexibility as to which operator is involved. BTW, I remain totally mystified as to what people think the semantics of partitioning ought to be. Child columns can have a different type from parent columns? Really? Why is this even under discussion? We don't allow that in old-school inheritance, and I cannot imagine a rational argument why partitioning should allow it. regards, tom lane
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Apr 20, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > BTW, I remain totally mystified as to what people think the semantics of > partitioning ought to be. Child columns can have a different type from > parent columns? Really? Why is this even under discussion? We don't > allow that in old-school inheritance, and I cannot imagine a rational > argument why partitioning should allow it. > No, we aren't doing that. We are discussing here how to represent partition bounds of top level join and all the intermediate joins between A, B and C which are partitioned tables with different partition key types. We are not discussing the column types of children, join or simple. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> When we merge partition bounds from two relations with different >> partition key types, the merged partition bounds need to have some >> information abound the way those constants look like e.g. their >> length, structure etc. That's the reason we need to store partition >> key types of merged partitioning scheme. Consider a three way join (i4 >> JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare >> partition bounds of i4 and i8, we use operators for int4 and int8. The >> join i4 JOIN i8 will get partition bounds by merging those of i4 and >> i8. When we come to join with i2, we need to know which operators to >> use for comparing the partition bounds of the join with those of i2. >> >> So, if the partition key types of the joining relations differ (but >> they have matching partitioning schemes per strategy, natts and >> operator family) the partition bounds of the join are converted to the >> wider type among the partition key types of the joining tree. >> Actually, as I am explained earlier we could choose a wider outer type >> for an OUTER join and shorter type for inner join. This type is used >> as partition key type of the join. In the above case join between i4 >> and i8 have its partition bounds converted to i8 (or i4) and then when >> it is joined with i2 the partition bounds of the join are converted to >> i8 (or i2). > > I don't understand why you think that partition-wise join needs any > new logic here; if this were a non-partitionwise join, we'd similarly > need to use the correct operator, but the existing code handles that > just fine. If the join is performed partition-wise, it should use the > same operators that would have been used by a non-partitionwise join > between the same tables. > > I think the choice of operator depends only on the column types, and > that the "width" of those types has nothing to do with it. For > example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the > operator for an A/B join or a B/C join will be the one that appears in > the query; parse analysis will have identified which specific operator > is meant based on the types of the columns. If the optimizer > subsequently decides to reorder the joins and perform the A/C join > first, it will go hunt down the operator with the same strategy number > in the same operator family that takes the type of A.x on one side and > the type of C.x on the other side. No problem. A partition-wise join > between A and C will use that same operator; again, no problem. > > Your example involves joining the output of a join between i4 and i8 > against i2, so it seems there is some ambiguity about what the input > type should be. But, again, the planner already copes with this > problem. In fact, the join is performed either using i4.x or i8.x -- > I don't know what happens, or whether it depends on other details of > the query or the plan -- and the operator which can accept that value > on one side and i2.x on the other side is the one that gets used. I think you are confusing join condition application and partition bounds of a join relation. What you have described above is how operators are chosen to apply join conditions - it picks up the correct operator from the operator family based on the column types being used in join condition. That it can do because the columns being joined are both present the relations being joined, irrespective of which pair of relations is being joined. In your example, A.x, B.x and C.x are all present on one of the sides of join irrespective of whether the join is executed as (AB)C, A(BC) or (AC)B. But the problem we are trying to solve here about partition bounds of the join relation: what should be the partition bounds of AB, BC or AC? When we compare partition bounds of and intermediate join with other intermediate join (e.g. AB with those of C) what operator should be used? You seem to be suggesting that we keep as many sets of partition bounds as there are base relations participating in the join and then use appropriate partition bounds based on the columns in the join conditions, so that we can use the same operator as used in the join condition. That doesn't seem to be a good option since the partition bounds will all have same values, only differing in their binary representation because of differences in data types. I am of the opinion that we save a single set of partition bounds. We have to then associate a data type with bounds to know binary representation of partition bound datums. That datatype would be one of the partition key types of joining relations. I may be wrong in using term "wider" since its associated with the length of binary reprentation. But we need some logic to coalesce the two data types based on the type of join and key type on the outer side. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Amit Langote
Дата:
On 2017/04/20 15:45, Ashutosh Bapat wrote: > On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't understand why you think that partition-wise join needs any >> new logic here; if this were a non-partitionwise join, we'd similarly >> need to use the correct operator, but the existing code handles that >> just fine. If the join is performed partition-wise, it should use the >> same operators that would have been used by a non-partitionwise join >> between the same tables. >> >> I think the choice of operator depends only on the column types, and >> that the "width" of those types has nothing to do with it. For >> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the >> operator for an A/B join or a B/C join will be the one that appears in >> the query; parse analysis will have identified which specific operator >> is meant based on the types of the columns. If the optimizer >> subsequently decides to reorder the joins and perform the A/C join >> first, it will go hunt down the operator with the same strategy number >> in the same operator family that takes the type of A.x on one side and >> the type of C.x on the other side. No problem. A partition-wise join >> between A and C will use that same operator; again, no problem. >> >> Your example involves joining the output of a join between i4 and i8 >> against i2, so it seems there is some ambiguity about what the input >> type should be. But, again, the planner already copes with this >> problem. In fact, the join is performed either using i4.x or i8.x -- >> I don't know what happens, or whether it depends on other details of >> the query or the plan -- and the operator which can accept that value >> on one side and i2.x on the other side is the one that gets used. > > I think you are confusing join condition application and partition > bounds of a join relation. What you have described above is how > operators are chosen to apply join conditions - it picks up the > correct operator from the operator family based on the column types > being used in join condition. That it can do because the columns being > joined are both present the relations being joined, irrespective of > which pair of relations is being joined. In your example, A.x, B.x and > C.x are all present on one of the sides of join irrespective of > whether the join is executed as (AB)C, A(BC) or (AC)B. > > But the problem we are trying to solve here about partition bounds of > the join relation: what should be the partition bounds of AB, BC or > AC? When we compare partition bounds of and intermediate join with > other intermediate join (e.g. AB with those of C) what operator should > be used? You seem to be suggesting that we keep as many sets of > partition bounds as there are base relations participating in the join > and then use appropriate partition bounds based on the columns in the > join conditions, so that we can use the same operator as used in the > join condition. That doesn't seem to be a good option since the > partition bounds will all have same values, only differing in their > binary representation because of differences in data types. I am of > the opinion that we save a single set of partition bounds. We have to > then associate a data type with bounds to know binary representation > of partition bound datums. That datatype would be one of the partition > key types of joining relations. I may be wrong in using term "wider" > since its associated with the length of binary reprentation. But we > need some logic to coalesce the two data types based on the type of > join and key type on the outer side. FWIW, I think that using any one of the partition bounds of the baserels being partitionwise-joined should suffice as the partition bound of any combination of joins involving two or more of those baserels, as long as the partitioning operator of each of the baserels is in the same operator family (I guess that *is* checked somewhere in the partitionwise join consideration flow). IOW, partopfamily[] of all of the baserels should match and then the join clause operators involved should belong to the same respective operator families. ISTM, the question here is about how to derive the partitioning properties of joinrels from those of the baserels involved. Even if the join conditions refer to columns of different types on two sides, as long as the partitioning and joining is known to occur using operators of compatible semantics, I don't understand what more needs to be considered or done. Although, I haven't studied things in enough detail to say anything confidently about whether join being INNER or OUTER has any bearing on the semantics of the partitioning of the joinrels in question. IIUC, using partitioning properties to apply partitionwise join technique at successive join levels will be affected by the OUTER considerations similar to how they affect what levels a give EquivalenceClass clause could be applied without causing any semantics violations. As already mentioned upthread, it would be a good idea to have some integration of the partitioning considerations with the equivalence class mechanism (how ForeignKeyOptInfo contains links to ECs comes to mind). By the way, looking at match_expr_to_partition_keys() in your latest patch, I wonder why not use an approach similar to calling is_indexable_operator() that is used in match_clause_to_indexcol()? Note that is_indexable_operator() simply checks if clause->opno is in the index key's operator family, as returned by op_in_opfamily(). Instead I see the following: /* * The clause allows partition-wise join if only it uses the same * operator family as that specifiedby the partition key. */ if (!list_member_oid(rinfo->mergeopfamilies, part_scheme->partopfamily[ipk1])) continue; But maybe I am missing something. Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Apr 20, 2017 at 3:35 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/04/20 15:45, Ashutosh Bapat wrote: >> On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I don't understand why you think that partition-wise join needs any >>> new logic here; if this were a non-partitionwise join, we'd similarly >>> need to use the correct operator, but the existing code handles that >>> just fine. If the join is performed partition-wise, it should use the >>> same operators that would have been used by a non-partitionwise join >>> between the same tables. >>> >>> I think the choice of operator depends only on the column types, and >>> that the "width" of those types has nothing to do with it. For >>> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the >>> operator for an A/B join or a B/C join will be the one that appears in >>> the query; parse analysis will have identified which specific operator >>> is meant based on the types of the columns. If the optimizer >>> subsequently decides to reorder the joins and perform the A/C join >>> first, it will go hunt down the operator with the same strategy number >>> in the same operator family that takes the type of A.x on one side and >>> the type of C.x on the other side. No problem. A partition-wise join >>> between A and C will use that same operator; again, no problem. >>> >>> Your example involves joining the output of a join between i4 and i8 >>> against i2, so it seems there is some ambiguity about what the input >>> type should be. But, again, the planner already copes with this >>> problem. In fact, the join is performed either using i4.x or i8.x -- >>> I don't know what happens, or whether it depends on other details of >>> the query or the plan -- and the operator which can accept that value >>> on one side and i2.x on the other side is the one that gets used. >> >> I think you are confusing join condition application and partition >> bounds of a join relation. What you have described above is how >> operators are chosen to apply join conditions - it picks up the >> correct operator from the operator family based on the column types >> being used in join condition. That it can do because the columns being >> joined are both present the relations being joined, irrespective of >> which pair of relations is being joined. In your example, A.x, B.x and >> C.x are all present on one of the sides of join irrespective of >> whether the join is executed as (AB)C, A(BC) or (AC)B. >> >> But the problem we are trying to solve here about partition bounds of >> the join relation: what should be the partition bounds of AB, BC or >> AC? When we compare partition bounds of and intermediate join with >> other intermediate join (e.g. AB with those of C) what operator should >> be used? You seem to be suggesting that we keep as many sets of >> partition bounds as there are base relations participating in the join >> and then use appropriate partition bounds based on the columns in the >> join conditions, so that we can use the same operator as used in the >> join condition. That doesn't seem to be a good option since the >> partition bounds will all have same values, only differing in their >> binary representation because of differences in data types. I am of >> the opinion that we save a single set of partition bounds. We have to >> then associate a data type with bounds to know binary representation >> of partition bound datums. That datatype would be one of the partition >> key types of joining relations. I may be wrong in using term "wider" >> since its associated with the length of binary reprentation. But we >> need some logic to coalesce the two data types based on the type of >> join and key type on the outer side. > > FWIW, I think that using any one of the partition bounds of the baserels > being partitionwise-joined should suffice as the partition bound of any > combination of joins involving two or more of those baserels, as long as > the partitioning operator of each of the baserels is in the same operator > family (I guess that *is* checked somewhere in the partitionwise join > consideration flow). IOW, partopfamily[] of all of the baserels should > match and then the join clause operators involved should belong to the > same respective operator families. The partition bounds of different base rels may be different and we have to compare them. Even we say that we join two tables with same partition bounds using partitio-wise join, we need to make sure that those partition bounds are indeed same, thus requiring to compare. And to compare any datum we need to know its type. > > ISTM, the question here is about how to derive the partitioning properties > of joinrels from those of the baserels involved. Even if the join > conditions refer to columns of different types on two sides, as long as > the partitioning and joining is known to occur using operators of > compatible semantics, I don't understand what more needs to be considered > or done. Although, I haven't studied things in enough detail to say > anything confidently about whether join being INNER or OUTER has any > bearing on the semantics of the partitioning of the joinrels in question. > IIUC, using partitioning properties to apply partitionwise join technique > at successive join levels will be affected by the OUTER considerations > similar to how they affect what levels a give EquivalenceClass clause > could be applied without causing any semantics violations. As already > mentioned upthread, it would be a good idea to have some integration of > the partitioning considerations with the equivalence class mechanism (how > ForeignKeyOptInfo contains links to ECs comes to mind). This has been already discussed. I have showed earlier why equivalence classes are not useful in this case. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Thu, Apr 20, 2017 at 8:45 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I think you are confusing join condition application and partition > bounds of a join relation. You're right, I misunderstood what you were talking about. > But the problem we are trying to solve here about partition bounds of > the join relation: what should be the partition bounds of AB, BC or > AC? When we compare partition bounds of and intermediate join with > other intermediate join (e.g. AB with those of C) what operator should > be used? You seem to be suggesting that we keep as many sets of > partition bounds as there are base relations participating in the join > and then use appropriate partition bounds based on the columns in the > join conditions, so that we can use the same operator as used in the > join condition. That doesn't seem to be a good option since the > partition bounds will all have same values, only differing in their > binary representation because of differences in data types. Well, actually, I think it is a good option, as I wrote in http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com In that email, my principal concern was allowing partition-wise join to succeed even with slightly different sets of partition boundaries on the two sides of the join; in particular, if we've got A with A1 .. A10 and B with B1 .. B10 and the DBA adds A11, I don't want performance to tank until the DBA gets around to adding B11. Removing the partition bounds from the PartitionScheme and storing them per-RelOptInfo fixes that problem; the fact that it also solves this problem of what happens when we have different data types on the two sides looks to me like a second reason to go that way. And there's a third reason, too, which is that the opfamily mechanism doesn't currently provide any mechanism for reasoning about which data types are "wider" or "narrower" in the way that you want. In general, there's not even a reason why such a relationship has to exist; consider two data types t1 and t2 with opclasses t1_ops and t2_ops that are part of the same opfamily t_ops, and suppose that t1 can represent any positive integer and t2 can represent any even integer, or in general that each data type can represent some but not all of the values that can be represented by the other data type. In such a case, neither would be "wider" than the other in the sense that you need; you essentially want to find a data type within the opfamily to which all values of any of the types involved in the query can be cast without error, but there is nothing today which requires such a data type to exist, and no way to identify which one it is. In practice, for all of the built-in opfamilies that have more than one opclass, such a data type always exists but is not always unique -- in particular, datetime_ops contains date_ops, timestamptz_ops, and timestamp_ops, and either of the latter two is a plausible choice for the "widest" data type of the three. But there's no way to figure that out from the opfamily or opclass information we have today. In theory, it would be possible to modify the opfamily machinery so that every opfamily designates an optional ordering of types from "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a guarantee that every value of type t1 can be cast without error to a value of type t2. But I think that's a bad plan. It means that every opfamily created by either the core code or some extension now needs to worry about annotating the opclass with this new information, and we have to add to core the SQL syntax and supporting code to make that work. If it were implementing a valuable feature which could not practically be implemented without extending the opfamily machinery, then I guess that's what we'd have to suck it up and incur that complexity, but in this case it does not appear necessary. Storing the partition bounds per-RelOptInfo makes this problem -- and a few others -- go away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> You seem to be suggesting that we keep as many sets of >> partition bounds as there are base relations participating in the join >> and then use appropriate partition bounds based on the columns in the >> join conditions, so that we can use the same operator as used in the >> join condition. That doesn't seem to be a good option since the >> partition bounds will all have same values, only differing in their >> binary representation because of differences in data types. > > Well, actually, I think it is a good option, as I wrote in > http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com I guess, you are now confusing between partition bounds for a join relation and partition bounds of base relation. Above paragraph is about partition bounds of a join relation. I have already agreed that we need to store partition bounds in RelOptInfo. For base relation this is trivial; its RelOptInfo has to store partition bounds as stored in the partition descriptor of corresponding partitioned table. I am talking about partition bounds of a join relation. See below for more explanation. > > In that email, my principal concern was allowing partition-wise join > to succeed even with slightly different sets of partition boundaries > on the two sides of the join; in particular, if we've got A with A1 .. > A10 and B with B1 .. B10 and the DBA adds A11, I don't want > performance to tank until the DBA gets around to adding B11. Removing > the partition bounds from the PartitionScheme and storing them > per-RelOptInfo fixes that problem; We have an agreement on this. > the fact that it also solves this > problem of what happens when we have different data types on the two > sides looks to me like a second reason to go that way. I don't see how is that fixed. For a join relation we need to come up with one set of partition bounds by merging partition bounds of the joining relation and in order to understand how to interpret the datums in the partition bounds, we need to associate data types. The question is which data type we should use if the relations being joined have different data types associated with their respective partition bounds. Or are you saying that we don't need to associate data type with merged partition bounds? In that case, I don't know how do we compare the partition bounds of two relations? In your example, A has partition key of type int8, has bound datums X1.. X10. B has partition key of type int4 and has bounds datums X1 .. X11. C has partition key type int2 and bound datums X1 .. X12. The binary representation of X's is going to differ between A, B and C although each Xk for A, B and C is equal, wherever exists. Join between A and B will have merged bound datums X1 .. X10 (and X11 depending upon the join type). In order to match bounds of AB with C, we need to know the data type of bounds of AB, so that we can choose appropriate equality operator. The question is what should we choose as data type of partition bounds of AB, int8 or int4. This is different from applying join conditions between AB and C, which can choose the right opfamily operator based on the join conditions. > > And there's a third reason, too, which is that the opfamily mechanism > doesn't currently provide any mechanism for reasoning about which data > types are "wider" or "narrower" in the way that you want. In general, > there's not even a reason why such a relationship has to exist; > consider two data types t1 and t2 with opclasses t1_ops and t2_ops > that are part of the same opfamily t_ops, and suppose that t1 can > represent any positive integer and t2 can represent any even integer, > or in general that each data type can represent some but not all of > the values that can be represented by the other data type. In such a > case, neither would be "wider" than the other in the sense that you > need; you essentially want to find a data type within the opfamily to > which all values of any of the types involved in the query can be cast > without error, but there is nothing today which requires such a data > type to exist, and no way to identify which one it is. In practice, > for all of the built-in opfamilies that have more than one opclass, > such a data type always exists but is not always unique -- in > particular, datetime_ops contains date_ops, timestamptz_ops, and > timestamp_ops, and either of the latter two is a plausible choice for > the "widest" data type of the three. But there's no way to figure > that out from the opfamily or opclass information we have today. > > In theory, it would be possible to modify the opfamily machinery so > that every opfamily designates an optional ordering of types from > "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a > guarantee that every value of type t1 can be cast without error to a > value of type t2. But I think that's a bad plan. It means that every > opfamily created by either the core code or some extension now needs > to worry about annotating the opclass with this new information, and > we have to add to core the SQL syntax and supporting code to make that > work. If it were implementing a valuable feature which could not > practically be implemented without extending the opfamily machinery, > then I guess that's what we'd have to suck it up and incur that > complexity, but in this case it does not appear necessary. Storing > the partition bounds per-RelOptInfo makes this problem -- and a few > others -- go away. This seems to suggest that we can not come up with merged bounds for join if the partition key types of joining relations differ. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Here's an updated patch set 0001-Refactor-adjust_appendrel_attrs-adjust_appendrel_att.patch 0002-Refactor-calc_nestloop_required_outer-and-allow_star.patch These are same as earlier patch set. 0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch 0004-Modify-bound-comparision-functions-to-accept-members.patch These are new patches to refactor partition bound comparison functions without passing partition key directly. Changes in the first patch are being used in this set but the second patch will be useful for more generic bound matching. 0005-Multi-level-partitioned-table-expansion.patch This is same as old set with minor changes. I have moved it ahead of the other patches as we discussed offline. 0006-Canonical-partition-scheme.patch Partition bounds are no more part of partition scheme. They appear in RelOptInfo. We are discussing the data type handling of partition bounds for join relation. So, I still have partopcintype in partition scheme. There is one change though. From ComputePartitionAttrs()->GetDefaultOpClass()/ResolveOpClass(), I gather that partopcintype is the type used for comparison of partition bounds instead of parttypid. When they are different they are binary compatible. So, I have saved partopcintype in PartitionScheme instead of parttypid and parttypmod. 0007-Canonical-partitioning-scheme-for-multi-level-partit.patch What was earlier mult-level partition-wise join support patch is now broken into set of patches and goes with corresponding patch for single-level partition-wise join patch. The idea is if we agree on changes for multi-level partitioning support and want to commit it before partition-wise join support, we can squash those pairs into one. This also associates the multi-level support changes with corresponding changes for single-level support. That might be easier to review. 0008-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch No changes in this patch. 0009-Partition-wise-join-implementation.patch The patch adds build_joinrel_partition_bounds() to match the partition bounds of the relations being joined. This function is called from try_partition_wise_join(). This function is also responsible for creating the pairs of matching partitions. When we come to support partition-wise joins for unequal number of partitions, this function would change without changing rest of the code. 0010-Multi-level-partition-wise-join-implementation.patch multi-level support 0011-Adjust-join-related-to-code-to-accept-child-relation.patch No changes to this patch. 0012-Fix-ConvertRowtypeExpr-refs-in-join-targetlist-and-q.patch Fixes a crash with mult-level partitioning reported by Rajkumar. Fixes set_plan_refs code for nested ConvertRowtypeExprs corresponding to multiple levels of partitions. 0013-Parameterized-path-fixes.patch No changes to this patch. 0014-Reparameterize-path-across-multiple-levels-of-partit.patch Multi-level support changes for 0013 0015-Partition-wise-join-tests.patch 0016-Multi-level-partition-wise-join-tests.patch Added the testcases reported by Rajkumar. On Fri, Apr 21, 2017 at 12:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> You seem to be suggesting that we keep as many sets of >>> partition bounds as there are base relations participating in the join >>> and then use appropriate partition bounds based on the columns in the >>> join conditions, so that we can use the same operator as used in the >>> join condition. That doesn't seem to be a good option since the >>> partition bounds will all have same values, only differing in their >>> binary representation because of differences in data types. >> >> Well, actually, I think it is a good option, as I wrote in >> http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com > > I guess, you are now confusing between partition bounds for a join > relation and partition bounds of base relation. Above paragraph is > about partition bounds of a join relation. I have already agreed that > we need to store partition bounds in RelOptInfo. For base relation > this is trivial; its RelOptInfo has to store partition bounds as > stored in the partition descriptor of corresponding partitioned table. > I am talking about partition bounds of a join relation. See below for > more explanation. > >> >> In that email, my principal concern was allowing partition-wise join >> to succeed even with slightly different sets of partition boundaries >> on the two sides of the join; in particular, if we've got A with A1 .. >> A10 and B with B1 .. B10 and the DBA adds A11, I don't want >> performance to tank until the DBA gets around to adding B11. Removing >> the partition bounds from the PartitionScheme and storing them >> per-RelOptInfo fixes that problem; > > We have an agreement on this. > >> the fact that it also solves this >> problem of what happens when we have different data types on the two >> sides looks to me like a second reason to go that way. > > I don't see how is that fixed. For a join relation we need to come up > with one set of partition bounds by merging partition bounds of the > joining relation and in order to understand how to interpret the > datums in the partition bounds, we need to associate data types. The > question is which data type we should use if the relations being > joined have different data types associated with their respective > partition bounds. > > Or are you saying that we don't need to associate data type with > merged partition bounds? In that case, I don't know how do we compare > the partition bounds of two relations? > > In your example, A has partition key of type int8, has bound datums > X1.. X10. B has partition key of type int4 and has bounds datums X1 .. > X11. C has partition key type int2 and bound datums X1 .. X12. The > binary representation of X's is going to differ between A, B and C > although each Xk for A, B and C is equal, wherever exists. Join > between A and B will have merged bound datums X1 .. X10 (and X11 > depending upon the join type). In order to match bounds of AB with C, > we need to know the data type of bounds of AB, so that we can choose > appropriate equality operator. The question is what should we choose > as data type of partition bounds of AB, int8 or int4. This is > different from applying join conditions between AB and C, which can > choose the right opfamily operator based on the join conditions. > >> >> And there's a third reason, too, which is that the opfamily mechanism >> doesn't currently provide any mechanism for reasoning about which data >> types are "wider" or "narrower" in the way that you want. In general, >> there's not even a reason why such a relationship has to exist; >> consider two data types t1 and t2 with opclasses t1_ops and t2_ops >> that are part of the same opfamily t_ops, and suppose that t1 can >> represent any positive integer and t2 can represent any even integer, >> or in general that each data type can represent some but not all of >> the values that can be represented by the other data type. In such a >> case, neither would be "wider" than the other in the sense that you >> need; you essentially want to find a data type within the opfamily to >> which all values of any of the types involved in the query can be cast >> without error, but there is nothing today which requires such a data >> type to exist, and no way to identify which one it is. In practice, >> for all of the built-in opfamilies that have more than one opclass, >> such a data type always exists but is not always unique -- in >> particular, datetime_ops contains date_ops, timestamptz_ops, and >> timestamp_ops, and either of the latter two is a plausible choice for >> the "widest" data type of the three. But there's no way to figure >> that out from the opfamily or opclass information we have today. >> >> In theory, it would be possible to modify the opfamily machinery so >> that every opfamily designates an optional ordering of types from >> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a >> guarantee that every value of type t1 can be cast without error to a >> value of type t2. But I think that's a bad plan. It means that every >> opfamily created by either the core code or some extension now needs >> to worry about annotating the opclass with this new information, and >> we have to add to core the SQL syntax and supporting code to make that >> work. If it were implementing a valuable feature which could not >> practically be implemented without extending the opfamily machinery, >> then I guess that's what we'd have to suck it up and incur that >> complexity, but in this case it does not appear necessary. Storing >> the partition bounds per-RelOptInfo makes this problem -- and a few >> others -- go away. > > This seems to suggest that we can not come up with merged bounds for > join if the partition key types of joining relations differ. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I don't see how is that fixed. For a join relation we need to come up > with one set of partition bounds by merging partition bounds of the > joining relation and in order to understand how to interpret the > datums in the partition bounds, we need to associate data types. The > question is which data type we should use if the relations being > joined have different data types associated with their respective > partition bounds. > > Or are you saying that we don't need to associate data type with > merged partition bounds? In that case, I don't know how do we compare > the partition bounds of two relations? Well, since there is no guarantee that a datatype exists which can be used to "merge" the partition bounds in the sense that you are describing, and even if there is one we have no opfamily infrastructure to find out which one it is, I think it would be smart to try to set things up so that we don't need to do that. I believe that's probably possible. > In your example, A has partition key of type int8, has bound datums > X1.. X10. B has partition key of type int4 and has bounds datums X1 .. > X11. C has partition key type int2 and bound datums X1 .. X12. OK, sure. > The binary representation of X's is going to differ between A, B and C > although each Xk for A, B and C is equal, wherever exists. Agreed. > Join > between A and B will have merged bound datums X1 .. X10 (and X11 > depending upon the join type). In order to match bounds of AB with C, > we need to know the data type of bounds of AB, so that we can choose > appropriate equality operator. The question is what should we choose > as data type of partition bounds of AB, int8 or int4. This is > different from applying join conditions between AB and C, which can > choose the right opfamily operator based on the join conditions. Well, the join is actually being performed either on A.keycol = C.keycol or on B.keycol = C.keycol, right? It has to be one or the other; there's no "merged" join column in any relation's targetlist, but only columns derived from the various baserels. So let's use that set of bounds for the matching. It makes sense to use the set of bounds for the matching that corresponds to the column actually being joined, I think. It's late here and I'm tired, but it seems like it should be possible to relate the child joinrels of the AB join back to the child joinrels of either A or B. (AB)1 .. (AB)10 related back to A1 .. A10 and B1 .. B10. (AB)11 relates back to B11 but, of course not to A11, which doesn't exist. If the join is INNER, (AB)11 is a dummy rel anyway and actually we should probably see whether we can omit it altogether. If the join is an outer join of some kind, there's an interesting case where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not on the nullable side of the join; in that case, too, (AB)11 is dummy or nonexistent. Otherwise, assuming A is nullable, (AB)11 maps only to B11 and not to A11. But that's absolutely right: if the join to C uses A.keycol, either the join operator is strict and (AB)11 won't match anything anyway, or it's not and partition-wise join is illegal because A.keycol in (AB)11 can include not only values from X11 but also nulls. So, it seems to me that what you can do is loop over the childrels on the outer side of the join. For each one, you've got a join clause that relates the outer rel to the inner rel, and that join clause mentions some baserel which is contained in the joinrel. So drill down through the childrel to the corresponding partition of the baserel and get those bounds. Then if you do the same thing for the inner childrels, you've now got two lists of bounds, and the type on the left matches the outer side of the join and the type on the right matches the inner side of the join and the opfamily of the operator in the join clause gives you a comparison operator that relates those two types, and now you can match them up. (We should also keep in mind the case where there are multiple columns in the partition key.) > This seems to suggest that we can not come up with merged bounds for > join if the partition key types of joining relations differ. Yes, I think that would be difficult. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Sat, Apr 22, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I don't see how is that fixed. For a join relation we need to come up >> with one set of partition bounds by merging partition bounds of the >> joining relation and in order to understand how to interpret the >> datums in the partition bounds, we need to associate data types. The >> question is which data type we should use if the relations being >> joined have different data types associated with their respective >> partition bounds. >> >> Or are you saying that we don't need to associate data type with >> merged partition bounds? In that case, I don't know how do we compare >> the partition bounds of two relations? > > Well, since there is no guarantee that a datatype exists which can be > used to "merge" the partition bounds in the sense that you are > describing, and even if there is one we have no opfamily > infrastructure to find out which one it is, I think it would be smart > to try to set things up so that we don't need to do that. I believe > that's probably possible. > >> In your example, A has partition key of type int8, has bound datums >> X1.. X10. B has partition key of type int4 and has bounds datums X1 .. >> X11. C has partition key type int2 and bound datums X1 .. X12. > > OK, sure. > >> The binary representation of X's is going to differ between A, B and C >> although each Xk for A, B and C is equal, wherever exists. > > Agreed. > >> Join >> between A and B will have merged bound datums X1 .. X10 (and X11 >> depending upon the join type). In order to match bounds of AB with C, >> we need to know the data type of bounds of AB, so that we can choose >> appropriate equality operator. The question is what should we choose >> as data type of partition bounds of AB, int8 or int4. This is >> different from applying join conditions between AB and C, which can >> choose the right opfamily operator based on the join conditions. > > Well, the join is actually being performed either on A.keycol = > C.keycol or on B.keycol = C.keycol, right? It has to be one or the > other; there's no "merged" join column in any relation's targetlist, > but only columns derived from the various baserels. So let's use that > set of bounds for the matching. It makes sense to use the set of > bounds for the matching that corresponds to the column actually being > joined, I think. > > It's late here and I'm tired, but it seems like it should be possible > to relate the child joinrels of the AB join back to the child joinrels > of either A or B. (AB)1 .. (AB)10 related back to A1 .. A10 and B1 .. > B10. (AB)11 relates back to B11 but, of course not to A11, which > doesn't exist. If the join is INNER, (AB)11 is a dummy rel anyway and > actually we should probably see whether we can omit it altogether. If > the join is an outer join of some kind, there's an interesting case > where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not > on the nullable side of the join; in that case, too, (AB)11 is dummy > or nonexistent. Otherwise, assuming A is nullable, (AB)11 maps only > to B11 and not to A11. But that's absolutely right: if the join to C > uses A.keycol, either the join operator is strict and (AB)11 won't > match anything anyway, or it's not and partition-wise join is illegal > because A.keycol in (AB)11 can include not only values from X11 but > also nulls. > > So, it seems to me that what you can do is loop over the childrels on > the outer side of the join. For each one, you've got a join clause > that relates the outer rel to the inner rel, and that join clause > mentions some baserel which is contained in the joinrel. So drill > down through the childrel to the corresponding partition of the > baserel and get those bounds. Then if you do the same thing for the > inner childrels, you've now got two lists of bounds, and the type on > the left matches the outer side of the join and the type on the right > matches the inner side of the join and the opfamily of the operator in > the join clause gives you a comparison operator that relates those two > types, and now you can match them up. This assumes that datums in partition bounds have one to one mapping with the partitions, which isn't true for list partitions. For list partitions we have multiple datums corresponding to the items listed associated with a given partition. So, simply looping over the partitions of outer relations doesn't work; in fact there are two outer relations for a full outer join, so we have to loop over both of them together in a merge join fashion. Consider A join B where A has partitions A1 (a, b, c), A2(e, f), A3(g, h) and B has partitions B1 (a, b), B2 (c, d, e), B3(f, g, h). If we just look at the partitions, we won't recognize that list item c is repeated in A1B1 and A2B2. That can be recognized only when we loop over the datums of A and B trying to match the partitions. We will see that for a, b A1 and B1 match but for c A1 and B1 do not match, instead A1 and B2 match. In one to one partition matching we will bail out here. I think, we have to find the base relations whose partition bounds should be used for comparison looking at the equi-join conditions and then compare those partition bounds to come up with the partition bounds of join relation. That won't work straight forward either when their are partitions missing on either sides of the join, I guess. Needs a careful thought. > > (We should also keep in mind the case where there are multiple columns > in the partition key.) Yes. This is tricky. Consider A partitioned by (a1, a2) B partitioned by (b1, b2) and C partitioned by (c1, c2). If the query is A join B on (A.a1 = B.a1 and A.a2 = B.b2) join C on (C.c1 = A.a1 and C.c2 = B.b2), we need to fetch partition bound values for a1 from A's partition bounds and those for b1 from B's partition bounds. Create combined partition bounds from those and then compare the combined bounds with those of C. After saying all that, I think we have a precedence of merged join columns with merged data types. Consider create table t1(a int2, b int);create table t2 (a int4, b int); explain verbose select * from t1 join t2 using(a); QUERY PLAN -------------------------------------------------------------------------Merge Join (cost=327.25..745.35 rows=27120 width=12) Output: t2.a, t1.b, t2.b Merge Cond: (t2.a = t1.a) -> Sort (cost=158.51..164.16 rows=2260 width=8) Output:t2.a, t2.b Sort Key: t2.a -> Seq Scan on public.t2 (cost=0.00..32.60 rows=2260 width=8) Output: t2.a, t2.b -> Sort (cost=168.75..174.75 rows=2400 width=6) Output: t1.b, t1.a Sort Key: t1.a -> Seq Scan on public.t1 (cost=0.00..34.00 rows=2400 width=6) Output: t1.b, t1.a (13 rows) When using clause is used the columns specified by using clause from the joining relations are merged into a single column. Here it has used a "wider" type column t2.a as the merged column for t1.a and t2.a. The logic is in buildMergedJoinVar(). Probably we want to build merged partition bounds for a join relation where partition keys of the joining relations are different using a single data type provided by the same logic as buildMergedJoinVar() and attach those to the join relation. [1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312629.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > This assumes that datums in partition bounds have one to one mapping > with the partitions, which isn't true for list partitions. For list > partitions we have multiple datums corresponding to the items listed > associated with a given partition. So, simply looping over the > partitions of outer relations doesn't work; in fact there are two > outer relations for a full outer join, so we have to loop over both of > them together in a merge join fashion. Maybe so, but my point is that it can be done with the original types, without converting anything to a different type. > When using clause is used the columns specified by using clause from > the joining relations are merged into a single column. Here it has > used a "wider" type column t2.a as the merged column for t1.a and > t2.a. The logic is in buildMergedJoinVar(). That relies on select_common_type(), which can error out if it can't find a common type. That's OK for the current uses of that function, because if it fails it means that the query is invalid. But it's not OK for what you want here, because it's not OK to error out due to inability to do a partition-wise join when a non-partition-wise join would have worked. Also, note that all select_common_type() is really doing is looking for the type within the type category that is marked typispreferred, or else checking which direction has an implicit cast. Neither of those things guarantee the property you want here, namely that the "common" type is in the same opfamily and can store every value of any of the input types without loss of precision. So I don't think you can rely on that. I'm going to say this one more time: I really, really, really think you need to avoid trying to convert the partition bounds to a common type. I said before that the infrastructure to do that is not present in our type system, and I'm pretty sure that statement is 100% correct. The fact that you can find other cases where we do something sorta like that but in a different case with different requirements doesn't make that false. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm going to say this one more time: I really, really, really think > you need to avoid trying to convert the partition bounds to a common > type. I said before that the infrastructure to do that is not present > in our type system, and I'm pretty sure that statement is 100% > correct. The fact that you can find other cases where we do something > sorta like that but in a different case with different requirements > doesn't make that false. It's not just a matter of lack of infrastructure: the very attempt is flawed, because in some cases there simply isn't a supertype that can hold all values of both types. An easy counterexample is float8 vs numeric: you can't convert float8 'Infinity' to numeric, but also there are values of numeric that can't be converted to float8 without overflow and/or loss of precision. The whole business of precision loss makes things very touchy for almost anything involving float and a non-float type, actually. What I'm going to ask one more time, though, is why we are even discussing this. Surely the partition bounds of a partitioned table must all be of the same type already. If there is a case where they are not, that is a bug we had better close off before v10 ships, not a feature that we need to write a lot of code to accommodate. regards, tom lane
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Apr 26, 2017 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What I'm going to ask one more time, though, is why we are even discussing > this. Surely the partition bounds of a partitioned table must all be of > the same type already. If there is a case where they are not, that is > a bug we had better close off before v10 ships, not a feature that we > need to write a lot of code to accommodate. This question was answered before, by Ashutosh. http://postgr.es/m/CAFjFpRfaKSO4YZjVv7jkcMEMVgDcnqc4yhqVWhO5gczB5mW8eQ@mail.gmail.com Since you either didn't read his answer, or else didn't understand it and didn't bother asking for clarification, I'll try to be more blunt: of course all of the partition bounds of a single partitioned table have to be of the same type. We're not talking about that, because no kidding. This thread is about the possibility -- in a future release -- of implementing a join between two different partitioned tables by joining each pair of matching partitions. To do that, you need the tables to be compatibly partitioned, which requires that the partitioning columns use the same opfamily for each partitioning column but not necessarily that the types be the same. Making partition-wise join work in the case where the partitioning columns are of different types within an opfamily (like int4 vs. int8) is giving Ashutosh a bit of trouble. So this is about a cross-type join, not multiple types within a single partitioning hierarchy, as you might also gather from the subject line of this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > So this is about a cross-type join, > not multiple types within a single partitioning hierarchy, as you > might also gather from the subject line of this thread. OK, but I still don't understand why any type conversion is needed in such a case. The existing join estimators don't try to do that, for the good and sufficient reasons you and I have already mentioned. They just apply the given cross-type join operator, and whatever cross-type selectivity estimator might be associated with it, and possibly other cross-type operators obtained from the same btree opfamily. The minute you get into trying to do any type conversion that is not mandated by the semantics of the query as written, you're going to have problems. regards, tom lane
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Apr 26, 2017 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So this is about a cross-type join, >> not multiple types within a single partitioning hierarchy, as you >> might also gather from the subject line of this thread. > > OK, but I still don't understand why any type conversion is needed > in such a case. The existing join estimators don't try to do that, > for the good and sufficient reasons you and I have already mentioned. > They just apply the given cross-type join operator, and whatever > cross-type selectivity estimator might be associated with it, and > possibly other cross-type operators obtained from the same btree > opfamily. > > The minute you get into trying to do any type conversion that is not > mandated by the semantics of the query as written, you're going to > have problems. There is no daylight whatsoever between us on this issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Apr 26, 2017 at 9:30 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> This assumes that datums in partition bounds have one to one mapping >> with the partitions, which isn't true for list partitions. For list >> partitions we have multiple datums corresponding to the items listed >> associated with a given partition. So, simply looping over the >> partitions of outer relations doesn't work; in fact there are two >> outer relations for a full outer join, so we have to loop over both of >> them together in a merge join fashion. > > Maybe so, but my point is that it can be done with the original types, > without converting anything to a different type. > Theoretically, I agree with this. But practically the implementation is lot more complex than what you have described in the earlier mails. I am afraid, that the patch with those changes will be a lot harder to review and commit. Later in this mail, I will try to explain some of the complexities. > > I'm going to say this one more time: I really, really, really think > you need to avoid trying to convert the partition bounds to a common > type. I said before that the infrastructure to do that is not present > in our type system, and I'm pretty sure that statement is 100% > correct. The fact that you can find other cases where we do something > sorta like that but in a different case with different requirements > doesn't make that false. Ok. Thanks for the explanation. The current design and implementation is for a restricted case where the partition bounds, partition key types and numbers match exactly. We want to commit an implementation which is reasonably extensible and doesn't require a lot of changes when we add more capabilities. Some of the extensions we discussed are as follows: 1. Partition-wise join when the existing partitions have matching bounds/lists but there can be extra partitions on either side of the join (between base relations or join relations) without a matching partition on the other side.\ 2. Partition-wise join when the partition bounds/lists do not match exactly but there is 1:1 or 1:0 or 0:1 mapping between the partitions which can contribute to the final result. E.g. A (0-100, 100 - 150, 200-300), B (0-50, 125-200, 300-400) 3. Partition-wise join when the partition key types do not match, but there's a single opfamily being used for partitioning. 4. Partition-wise join where 1:m or m:n mapping exists between partitions of the joining relations. First one is clearly something that we will need. We may add it in the first commit or next commit, but it will be needed pretty soon (v11?). To me 2nd is more important than the 3rd one. You may have a different view. We will expect 3rd optimization to work with all the prior optimizations. I am restricting myself from thinking about 4th one since that requires ganging together multiple RelOptInfos as a single RelOptInfo while joining, something we don't have infrastruture for. In case of first goal, supporting INNER joins and OUTER joins where the partitions are missing on the OUTER side but not inner side are easier. In those cases we just drop those partitions and corresponding bounds/lists from the join. For a FULL OUTER join, where both sides act as OUTER as well as INNER, we will need exact mapping between the partitions. For supporting OUTER joins where partitions on the INNER sides can be missing, we need to create some "dummy" relations representing the missing partitions so that we have OUTER rows with NULL inner side. This requires giving those dummy relations some relids and thus in case of base relations we may need to inject some dummy children. This may mean that we have to expand simple_rel_array as part of outer join, may or may not require adding new AppendRelInfos and so on. We are basically breaking an assumption that base relations can not be introduced while planning joins and that might require some rework in the existing infrastructure. There might be other ways to introduce dummy relations during join planning, but I haven't really thought through the problem. The third goal requires that the partition bounds be compared based on the partition keys present in the equi-join. While matching the partitions to be joined, the partition bounds corresponding the base relation whose partition keys appear in the equi-join are used for comparison using support function corresponding to the data types of partition keys. This requires us to associate the partitions of a join with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3) (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c assuming strict operators this can be executed as (AB)C or A(BC). AB will have partitions A1B1, A2B3 since there is no matching bound of A for B2 and A is outer relation. A1B1 is associated with bound X1 of A and C both. A2B3 is associated with bound of X3, which happens to be 2nd bound of A but third of B. When we join (AB) with C, we should notice that C1 goes with A1B1, C2 doesn't have any matching partition in AB and C3 goes with A2B3. If we compare bounds of B with C without any transformation we will know C2 matches B2, but we need to look at the children of AB to realize that B2 isn't present in any of the children and thus C2 should not be joined with any partition of AB. That usually looks a quadratic order operation on the number of partitions. The complexity can be reduced by maintaining as many partition bounds as the number of base relations participating in the join (an idea, I have floated earlier [1]) I don't elaborate it here to avoid digression. There's also the complexity of an N-way join with multiple partition keys and joins on partition keys from different relations as discussed in [1]. There may be more involved cases, that I haven't thought about. In short, implementation for 1st and 3rd optimization together looks fairly complex. Add to this the 2nd optimization and it becomes still more complex. In order to keep the patches manageable to implement review and commit, I am proposing following approach. 1. Implement first optimization on top of the current patches, which enforces that the partition key datatypes of the joining relations match. I am right now working on that patch. Do this for INNER join and OUTER join where partitions are missing on the OUTER side and not INNER side. As a side note, the existing partition bound comparison functions are tied to PartitionKey structure and require complete set of bounds from partitioned relation. Both of those are not applicable anymore, PartitionKey structure is not available for join and we have to compare individual bounds in case of join as against one probe with a complete set. This refactoring did eat some time. 2. Implement support for OUTER join where partitions can be missing from either side. 3. Implement support for partition-wise join with different partition key types. All those implementation will be different patches on top of v18 patches. Given the complexities involved in 2 and 3, I am not sure which order I should attack them. I don't have any estimates as to how much time each of those are going to require. May be a couple of months, but I am not sure. Obviously we have to wait till the first commitfest to commit the first version of the patch. So, based on the status at time, we can decide what goes in the first commit of this feature and adjust the patch set accordingly. Thoughts/comments? [1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The third goal requires that the partition bounds be compared based on > the partition keys present in the equi-join. While matching the > partitions to be joined, the partition bounds corresponding the base > relation whose partition keys appear in the equi-join are used for > comparison using support function corresponding to the data types of > partition keys. This requires us to associate the partitions of a join > with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3) > (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds > (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c > assuming strict operators this can be executed as (AB)C or A(BC). AB > will have partitions A1B1, A2B3 since there is no matching bound of A > for B2 and A is outer relation. A1B1 is associated with bound X1 of A > and C both. A2B3 is associated with bound of X3, which happens to be > 2nd bound of A but third of B. When we join (AB) with C, we should > notice that C1 goes with A1B1, C2 doesn't have any matching partition > in AB and C3 goes with A2B3. If we compare bounds of B with C without > any transformation we will know C2 matches B2, but we need to look at > the children of AB to realize that B2 isn't present in any of the > children and thus C2 should not be joined with any partition of AB. Sure. > That usually looks a quadratic order operation on the number of > partitions. Now that I don't buy. Certainly, for range partitions, given a list of ranges of length M and another of length N, this can be done in O(M+N) time by merge-joining the lists of bounds. You pointed out upthread that for list partitions, things are a bit complicated because a single list partition can contain multiple values which are not necessarily contiguous, but I think that this can still be done in O(M+N) time. Sort all of the bounds, associating each one to a partition, and do a merge pass; whenever two bounds match, match the two corresponding partitions, but if one of those partitions is already matched to some other partition, then fail. For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN (2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4). The sorted bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6. The first value in both lists is a 1, so the corresponding partitions A1 and B1 are matched. The second value in both lists is a 2, so the corresponding partitions A2 and B2 are matched. Then we hit a 3 on the A side that has no match on the B side, but that's fine; we don't need to do anything. If the partition on the A side never got a mapping at any point during this merge pass, we'd eventually need to match it to a dummy partition (unless this is an inner join) but it's already mapped to B1 so no problem. Then we hit a 4 which says that A2 must match B2, which is consistent with what we already determine; no problem. Then we hit another value that only exists on the A side, which is fine just as before. Finally we hit a 6 on each side, which means that A2 must match B1, which is inconsistent with the existing mappings so we give up; no partitionwise join is possible here. > The complexity can be reduced by maintaining as many > partition bounds as the number of base relations participating in the > join (an idea, I have floated earlier [1]) I don't elaborate it here > to avoid digression. There's also the complexity of an N-way join with > multiple partition keys and joins on partition keys from different > relations as discussed in [1]. There may be more involved cases, that > I haven't thought about. In short, implementation for 1st and 3rd > optimization together looks fairly complex. I spent some time thinking about this today and I think I see how we could make it work: keep a single set of bounds for each join relation, but record the type of each bound. For example, suppose we are full joining relation i2, with an int2 partition column, which has partitions i2a from 0 to 10000 and i2b from 20000 to 30000, to relation i4, with an int4 partition column, which has partitions i4a from 5000 to 15000 and i4b from 25000 to 35000. We end up with a joinrel with 2 partitions. The first goes from 0 (stored as an int2) to 15000 (stored as an int4) and the second goes from 20000 (stored as an int2) to 35000 (stored as an int4). If we subsequently need to merge these bounds with yet another relation at a higher join level, we can use the opfamily (which is common) to dig out the right cross-type operator for each comparison we may need to perform, based on the precise types of the datums being compared. Of course, we might not find an appropriate cross-type operator in some cases, because an opfamily isn't required to provide that, so then we'd have to fail gracefully somehow, but that could be done. Having said that I think we could make this work, I'm starting to agree with you that it will add more complexity than it's worth. Needing to keep track of the type of every partition bound individually seems like a real nuisance, and it's not likely to win very often because, realistically, people should and generally will use the same type for the partitioning column in all of the relevant tables. So I'm going to revise my position and say it's fine to just give up on partitionwise join unless the types match exactly, but I still think we should try to cover the cases where the bounds don't match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow, optimizations 1 and 2 from your list of 4). I agree that ganging partitions (optimization 4 from your list) is not something to tackle right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Fri, Apr 28, 2017 at 1:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The third goal requires that the partition bounds be compared based on >> the partition keys present in the equi-join. While matching the >> partitions to be joined, the partition bounds corresponding the base >> relation whose partition keys appear in the equi-join are used for >> comparison using support function corresponding to the data types of >> partition keys. This requires us to associate the partitions of a join >> with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3) >> (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds >> (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c >> assuming strict operators this can be executed as (AB)C or A(BC). AB >> will have partitions A1B1, A2B3 since there is no matching bound of A >> for B2 and A is outer relation. A1B1 is associated with bound X1 of A >> and C both. A2B3 is associated with bound of X3, which happens to be >> 2nd bound of A but third of B. When we join (AB) with C, we should >> notice that C1 goes with A1B1, C2 doesn't have any matching partition >> in AB and C3 goes with A2B3. If we compare bounds of B with C without >> any transformation we will know C2 matches B2, but we need to look at >> the children of AB to realize that B2 isn't present in any of the >> children and thus C2 should not be joined with any partition of AB. > > Sure. > >> That usually looks a quadratic order operation on the number of >> partitions. > > Now that I don't buy. Certainly, for range partitions, given a list > of ranges of length M and another of length N, this can be done in > O(M+N) time by merge-joining the lists of bounds. You pointed out > upthread that for list partitions, things are a bit complicated > because a single list partition can contain multiple values which are > not necessarily contiguous, but I think that this can still be done in > O(M+N) time. Sort all of the bounds, associating each one to a > partition, and do a merge pass; whenever two bounds match, match the > two corresponding partitions, but if one of those partitions is > already matched to some other partition, then fail. > > For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN > (2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4). The sorted > bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6. The first value in both > lists is a 1, so the corresponding partitions A1 and B1 are matched. > The second value in both lists is a 2, so the corresponding partitions > A2 and B2 are matched. Then we hit a 3 on the A side that has no > match on the B side, but that's fine; we don't need to do anything. > If the partition on the A side never got a mapping at any point during > this merge pass, we'd eventually need to match it to a dummy partition > (unless this is an inner join) but it's already mapped to B1 so no > problem. Then we hit a 4 which says that A2 must match B2, which is > consistent with what we already determine; no problem. Then we hit > another value that only exists on the A side, which is fine just as > before. Finally we hit a 6 on each side, which means that A2 must > match B1, which is inconsistent with the existing mappings so we give > up; no partitionwise join is possible here. For two-way join this works and is fairly straight-forward. I am assuming that A an B are base relations and not joins. But making it work for N-way join is the challenge. I don't see your example describing that. But I think, given your revised position below, we don't need to get this right at this point. Remember, that the paragraph was about 3rd goal, which according to your revised position is now deferred. > > Having said that I think we could make this work, I'm starting to > agree with you that it will add more complexity than it's worth. > Needing to keep track of the type of every partition bound > individually seems like a real nuisance, and it's not likely to win > very often because, realistically, people should and generally will > use the same type for the partitioning column in all of the relevant > tables. So I'm going to revise my position and say it's fine to just > give up on partitionwise join unless the types match exactly, but I > still think we should try to cover the cases where the bounds don't > match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow, > optimizations 1 and 2 from your list of 4). I agree that ganging > partitions (optimization 4 from your list) is not something to tackle > right now. Good. I will have a more enjoyable vacation now. Do you still want the patition key type to be out of partition scheme? Keeping it there means we match it only once and save it only at a single place. Otherwise, it will have to be stored in RelOptInfo of the partitioned table and match it for every pair of joining relations. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > For two-way join this works and is fairly straight-forward. I am > assuming that A an B are base relations and not joins. But making it > work for N-way join is the challenge. I don't think it's much different, is it? Anyway, I'm going to protest if your algorithm for merging bounds takes any more than linear time, regardless of what else we decide. >> Having said that I think we could make this work, I'm starting to >> agree with you that it will add more complexity than it's worth. >> Needing to keep track of the type of every partition bound >> individually seems like a real nuisance, and it's not likely to win >> very often because, realistically, people should and generally will >> use the same type for the partitioning column in all of the relevant >> tables. So I'm going to revise my position and say it's fine to just >> give up on partitionwise join unless the types match exactly, but I >> still think we should try to cover the cases where the bounds don't >> match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow, >> optimizations 1 and 2 from your list of 4). I agree that ganging >> partitions (optimization 4 from your list) is not something to tackle >> right now. > > Good. I will have a more enjoyable vacation now. Phew, what a relief. :-) > Do you still want the patition key type to be out of partition scheme? > Keeping it there means we match it only once and save it only at a > single place. Otherwise, it will have to be stored in RelOptInfo of > the partitioned table and match it for every pair of joining > relations. The only reason for removing things from the PartitionScheme was if they didn't need to be consistent across all tables. Deciding that the type is one of the things that has to match means deciding it should be in the PartitionScheme, not the RelOptInfo. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> There's a relevant comment in 0006, build_joinrel_partition_info() >> (probably that name needs to change, but I will do that once we have >> settled on design) >> + /* >> + * Construct partition keys for the join. >> + * >> + * An INNER join between two partitioned relations is partition by key >> + * expressions from both the relations. For tables A and B >> partitioned by a and b >> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a >> + * and B.b. >> + * >> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with >> + * B.b NULL. These rows may not fit the partitioning conditions imposed on >> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. >> + * Strictly speaking, partition keys of an OUTER join should include >> + * partition key expressions from the OUTER side only. Consider a join like >> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not >> + * include B.b as partition key expression for (AB), it prohibits us from >> + * using partition-wise join when joining (AB) with C as there is no >> + * equi-join between partition keys of joining relations. But two NULL >> + * values are never equal and no two rows from mis-matching partitions can >> + * join. Hence it's safe to include B.b as partition key expression for >> + * (AB), even though rows in (AB) are not strictly partitioned by B.b. >> + */ >> >> I think that also needs to be reviewed carefully. > > The following passage from src/backend/optimizer/README seems highly relevant: > > === > The planner's treatment of outer join reordering is based on the following > identities: > > 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) > = (A innerjoin C on (Pac)) leftjoin B on (Pab) > > where Pac is a predicate referencing A and C, etc (in this case, clearly > Pac cannot reference B, or the transformation is nonsensical). > > 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) > = (A leftjoin C on (Pac)) leftjoin B on (Pab) > > 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > > Identity 3 only holds if predicate Pbc must fail for all-null B rows > (that is, Pbc is strict for at least one column of B). If Pbc is not > strict, the first form might produce some rows with nonnull C columns > where the second form would make those entries null. > === > > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. [.. some portion clipped .. ] > The relevant logic is in have_partkey_equi_join: > > + /* Skip clauses which are not equality conditions. */ > + if (rinfo->hashjoinoperator == InvalidOid && > !rinfo->mergeopfamilies) > + continue; > > Actually, I think the hashjoinoperator test is formally and > practically unnecessary here; lower down there is a test to see > whether the partitioning scheme's operator family is a member of > rinfo->mergeopfamilies, which will certainly fail if we got through > this test with rinfo->mergeopfamilies == NIL just on the strength of > rinfo->hashjoinoperator != InvalidOid. So you can just bail out if > rinfo->mergeopfamilies == NIL. But the underlying point here is that > the only thing you really know about the function is that it's got to > be a strategy-3 operator in some btree opclass; if that guarantees > strictness, then so be it -- but I wasn't able to find anything in the > code or documentation off-hand that supports that contention, so we > might need to think a bit more about why (or if) this is guaranteed to > be true. > >> Partition-wise joins >> may be happy including partition keys from all sides, but >> partition-wise aggregates may not be, esp. when pushing complete >> aggregation down to partitions. In that case, rows with NULL partition >> key, which falls on nullable side of join, will be spread across >> multiple partitions. Proabably, we should separate nullable and >> non-nullable partition key expressions. > > I don't think I understand quite what you're getting at here. Can you > spell this out in more detail? To push an aggregate down to > partitions, you need the grouping key to match the applicable > partition key, and the partition key shouldn't allow nulls in more > than one place. Now I think your point may be that outer join > semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) > FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a > good test case for partitionwise aggregate. I'd be inclined to think > that we should just give up on partitionwise aggregate in such cases; > it's not worth trying to optimize such a weird query, at least IMHO. > (Does this sort of case ever happen with joins? I think not, as long > as the join operator is strict.) > I am revisiting NULL equality in the context of merging partition bounds. In [1] paragraphs following -- Do not write expression = NULL because NULL is not “equal to” NULL. (The null value represents an unknown value, and it is not known whether two unknown values are equal.) -- seem to indicate that an equality operator should never return true for two NULL values since it would never know whether two NULL (unknown) values are same or not. In a paragraph above, Robert stated that > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. But in case a user has written an = operator which returns true for two NULL values, per description in [1], that comparison operator is flawed and using that operator is going to result in SQL-standard-incompliant behaviour. I have tried to preserve all the relevant portions of discussion in this mail. Am I missing something? [1] https://www.postgresql.org/docs/devel/static/functions-comparison.html -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Thu, May 18, 2017 at 4:38 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > But in case a user has written an = operator which returns true for > two NULL values, per description in [1], that comparison operator is > flawed and > using that operator is going to result in SQL-standard-incompliant > behaviour. I have tried to preserve all the relevant portions of > discussion in this mail. Am I missing something? Yes. You're confusing friendly advice about how to write good SQL with internals documentation about how the system actually works. The documentation we have about how operator classes and index methods and so forth actually work under the hood is in https://www.postgresql.org/docs/devel/static/xindex.html -- as a developer, that's what you should be looking at. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Sat, Apr 29, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> For two-way join this works and is fairly straight-forward. I am >> assuming that A an B are base relations and not joins. But making it >> work for N-way join is the challenge. > > I don't think it's much different, is it? Anyway, I'm going to > protest if your algorithm for merging bounds takes any more than > linear time, regardless of what else we decide. > >>> Having said that I think we could make this work, I'm starting to >>> agree with you that it will add more complexity than it's worth. >>> Needing to keep track of the type of every partition bound >>> individually seems like a real nuisance, and it's not likely to win >>> very often because, realistically, people should and generally will >>> use the same type for the partitioning column in all of the relevant >>> tables. So I'm going to revise my position and say it's fine to just >>> give up on partitionwise join unless the types match exactly, but I >>> still think we should try to cover the cases where the bounds don't >>> match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow, >>> optimizations 1 and 2 from your list of 4). I agree that ganging >>> partitions (optimization 4 from your list) is not something to tackle >>> right now. >> >> Good. I will have a more enjoyable vacation now. > > Phew, what a relief. :-) > >> Do you still want the patition key type to be out of partition scheme? >> Keeping it there means we match it only once and save it only at a >> single place. Otherwise, it will have to be stored in RelOptInfo of >> the partitioned table and match it for every pair of joining >> relations. > > The only reason for removing things from the PartitionScheme was if > they didn't need to be consistent across all tables. Deciding that > the type is one of the things that has to match means deciding it > should be in the PartitionScheme, not the RelOptInfo. > Here's set of patches rebased on latest head. I spent some time trying to implement partition-wise join when partition bounds do not match exactly but there's 1:1, 1:0 or 0:1 mapping between partitions. A WIP patch 0017 is included in the set for the same. The patch is not complete, it doesn't support range partitions and needs some bugs to be fixed for list partitions. Also because of the way it crafts partition bounds for a join, it leaks memory consumed by partition bounds for every pair of joining relations. I will work on fixing those issues. That patch is pretty large now. So, I think we will have to commit it separately on top of basic partition-wise join implementation. But you will see that it has minimal changes to the basic partition-wise join code. I rewrote code handling partition keys on the nullable side of the join. Now we store partition keys from nullable and non-nullable relations separately. The partition keys from nullable relations are matched only when the equality operator is strict. This is explained in details the comments in match_expr_to_partition_keys() and build_joinrel_partition_info(). Also please note that since last patch set I have paired the multi-level partition-wise join support patches with single-level partition-wise join patches providing corresponding functionality. [1] https://www.postgresql.org/message-id/CAFjFpRd9ebX225KhuvYXQRBuk9NrVJfPzHqGPGqpze%2BqvH0xmw%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Here's set of patches rebased on latest head.
In an attempt to test this set of patches, I found that not all of the patches could be applied on latest head-- commit 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing.
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > > > On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> >> Here's set of patches rebased on latest head. > > > In an attempt to test this set of patches, I found that not all of the > patches could be applied on latest head-- commit > 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7 > Might be in need of rebasing. Thanks Rafia for your interest. I have started rebasing the patches on the latest head. I am expecting it to take some time. Will update the thread with the patches once I am done rebasing them. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> >> >> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> >>> >>> Here's set of patches rebased on latest head. >> >> >> In an attempt to test this set of patches, I found that not all of the >> patches could be applied on latest head-- commit >> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7 >> Might be in need of rebasing. > > Thanks Rafia for your interest. I have started rebasing the patches on > the latest head. I am expecting it to take some time. Will update the > thread with the patches once I am done rebasing them. > Here are patches rebased. As mentioned in my previous mail [1], the last two patches are not complete but are included, so that the reviewer can see the changes we will have to make when we go towards more general partition-wise join. Please use patches upto 0015, which implement 1:1 partition mapping for benchmarking and testing. [1] https://www.postgresql.org/message-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Jul 10, 2017 at 3:57 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >>> >>> >>> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> >>>> >>>> Here's set of patches rebased on latest head. >>> >>> >>> In an attempt to test this set of patches, I found that not all of the >>> patches could be applied on latest head-- commit >>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7 >>> Might be in need of rebasing. >> >> Thanks Rafia for your interest. I have started rebasing the patches on >> the latest head. I am expecting it to take some time. Will update the >> thread with the patches once I am done rebasing them. >> > > Here are patches rebased. > > As mentioned in my previous mail [1], the last two patches are not > complete but are included, so that the reviewer can see the changes we > will have to make when we go towards more general partition-wise join. > Please use patches upto 0015, which implement 1:1 partition mapping > for benchmarking and testing. > > [1] https://www.postgresql.org/message-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com > Here's revised patch set with only 0004 revised. That patch deals with creating multi-level inheritance hierarchy from multi-level partition hierarchy. The original logic of recursively calling inheritance_planner()'s guts over the inheritance hierarchy required that for every such recursion we flatten many lists created by that code. Recursion also meant that root->append_rel_list is traversed as many times as the number of partitioned partitions in the hierarchy. Instead the revised version keep the iterative shape of inheritance_planner() intact, thus naturally creating flat lists, iterates over root->append_rel_list only once and is still easy to read and maintain. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Fri, Jul 14, 2017 at 12:32 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
>
> Here's revised patch set with only 0004 revised. That patch deals with
> creating multi-level inheritance hierarchy from multi-level partition
> hierarchy. The original logic of recursively calling
> inheritance_planner()'s guts over the inheritance hierarchy required
> that for every such recursion we flatten many lists created by that
> code. Recursion also meant that root->append_rel_list is traversed as
> many times as the number of partitioned partitions in the hierarchy.
> Instead the revised version keep the iterative shape of
> inheritance_planner() intact, thus naturally creating flat lists,
> iterates over root->append_rel_list only once and is still easy to
> read and maintain.
>
On testing this patch for TPC-H (for scale factor 20) benchmark I found a regression for Q21, on head it was taking some 600 seconds and with this patch it is taking 3200 seconds. This comparison is on the same partitioned database, one using the partition wise join patch and other is without it. The execution time of Q21 on unpartitioned head is some 300 seconds. The explain analyse output for each of these cases is attached.
This suggests that partitioning is not a suitable strategy for this query, but then may be partition wise should not be picked for such a case to aggravate the performance issue.
The details of the setup is as follows,
Server parameter settings,
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
Server parameter settings,
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on
Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders
Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders
Lineitem -
Partition key = l_orderkey
No of partitions = 18
Orders -
Partition key = o_orderkey
No of partitions = 11
Commit id - 42171e2cd23c8307bbe0ec64e901f58e297db1c3
I chose orderkey as the partition key since it is the primary key of orders and along with l_linenumber it forms the primary key for lineitem.
For the above mentioned settings, there was no other query that used partitioned wise join.
Please let me know if any more information is required regarding this experimentation.
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Jul 19, 2017 at 12:24 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On testing this patch for TPC-H (for scale factor 20) benchmark I found a > regression for Q21, on head it was taking some 600 seconds and with this > patch it is taking 3200 seconds. This comparison is on the same partitioned > database, one using the partition wise join patch and other is without it. > The execution time of Q21 on unpartitioned head is some 300 seconds. The > explain analyse output for each of these cases is attached. Interesting. > This suggests that partitioning is not a suitable strategy for this query, > but then may be partition wise should not be picked for such a case to > aggravate the performance issue. In the unpartitioned case, and in the partitioned case on head, the join order is l1-(nation-supplier)-l2-orders-l3. In the patched case, the join order changes to l1-l2-supplier-orders-nation-l3. If the planner used the former join order, it wouldn't be able to do a partition-wise join at all, so it must think that the l1-l2 join gets much cheaper when done partitionwise, thus justifying a change in the overall join order to be able to use partion-wise join. But it doesn't work out. I think the problem is that the row count estimates for the child joins seem to be totally bogus: -> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12) (actual time=10484.422..15945.851 rows=1523493 loops=3) Hash Cond: (l1.l_orderkey = l2.l_orderkey) Join Filter: (l2.l_suppkey<> l1.l_suppkey) Rows Removed by Join Filter: 395116 That's clearly wrong. In the un-partitioned plan, the join to l2 produces about as many rows of output as the number of rows that were input (998433 vs. 962909); but here, a child join with a million rows as input is estimated to produce only 1 row of output. I bet the problem is that the child-join's row count estimate isn't getting initialized at all, but then something is clamping it to 1 row instead of 0. So this looks like a bug in Ashutosh's patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Thomas Munro
Дата:
On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the problem is that the row count estimates for the child > joins seem to be totally bogus: > > -> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12) > (actual time=10484.422..15945.851 rows=1523493 loops=3) > Hash Cond: (l1.l_orderkey = l2.l_orderkey) > Join Filter: (l2.l_suppkey <> l1.l_suppkey) > Rows Removed by Join Filter: 395116 > > That's clearly wrong. In the un-partitioned plan, the join to l2 > produces about as many rows of output as the number of rows that were > input (998433 vs. 962909); but here, a child join with a million rows > as input is estimated to produce only 1 row of output. I bet the > problem is that the child-join's row count estimate isn't getting > initialized at all, but then something is clamping it to 1 row instead > of 0. > > So this looks like a bug in Ashutosh's patch. Isn't this the same as the issue reported here? https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think the problem is that the row count estimates for the child >> joins seem to be totally bogus: >> >> -> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12) >> (actual time=10484.422..15945.851 rows=1523493 loops=3) >> Hash Cond: (l1.l_orderkey = l2.l_orderkey) >> Join Filter: (l2.l_suppkey <> l1.l_suppkey) >> Rows Removed by Join Filter: 395116 >> >> That's clearly wrong. In the un-partitioned plan, the join to l2 >> produces about as many rows of output as the number of rows that were >> input (998433 vs. 962909); but here, a child join with a million rows >> as input is estimated to produce only 1 row of output. I bet the >> problem is that the child-join's row count estimate isn't getting >> initialized at all, but then something is clamping it to 1 row instead >> of 0. >> >> So this looks like a bug in Ashutosh's patch. > > Isn't this the same as the issue reported here? > > https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com Hmm, possibly. But why would that affect the partition-wise join case only? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Thomas Munro
Дата:
On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Isn't this the same as the issue reported here? >> >> https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com > > Hmm, possibly. But why would that affect the partition-wise join case only? It doesn't. From Rafia's part_reg.zip we see a bunch of rows=1 that turn out to be wrong by several orders of magnitude: 21_nopart_head.out: Hash Semi Join (cost=5720107.25..9442574.55 rows=1 width=50) 21_part_head.out: Hash Semi Join (cost=5423094.06..8847638.36 rows=1 width=38) 21_part_patched.out: Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12) My guess is that the consequences of that bad estimate are sensitive to arbitrary other parameters moving around, as you can see from the big jump in execution time I showed in the that message, measured on unpatched master of the day: 4 workers = 9.5s 3 workers = 39.7s That's why why both parallel hash join and partition-wise join are showing regressions on Q21: it's just flip-flopping between various badly costed plans. Note that even without parallelism, the fix that Tom Lane suggested gives a much better plan: https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > > Partition information: > Type of partitioning - single column range partition > Tables partitioned - Lineitem and orders > > Lineitem - > Partition key = l_orderkey > No of partitions = 18 > > Orders - > Partition key = o_orderkey > No of partitions = 11 > The patch set upto 0015 would refuse to join two partitioned relations using a partition-wise join if they have different number of partitions. Next patches implement a more advanced partition matching algorithm only for list partitions. Those next patches would refuse to apply partition-wise join for range partitioned tables. So, I am confused as to how come partition-wise join is being chosen even when the number of partitions differ. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Amit Langote
Дата:
On 2017/07/20 15:05, Ashutosh Bapat wrote: > On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> >> Partition information: >> Type of partitioning - single column range partition >> Tables partitioned - Lineitem and orders >> >> Lineitem - >> Partition key = l_orderkey >> No of partitions = 18 >> >> Orders - >> Partition key = o_orderkey >> No of partitions = 11 >> > > The patch set upto 0015 would refuse to join two partitioned relations > using a partition-wise join if they have different number of > partitions. Next patches implement a more advanced partition matching > algorithm only for list partitions. Those next patches would refuse to > apply partition-wise join for range partitioned tables. So, I am > confused as to how come partition-wise join is being chosen even when > the number of partitions differ. In 21_part_patched.out, I see that lineitem is partitionwise-joined with itself. > Append -> Hash Semi Join Hash Cond: (l1.l_orderkey = l2.l_orderkey) Join Filter: (l2.l_suppkey <> l1.l_suppkey) Rows Removed by Join Filter: 395116 -> Parallel Seq Scan on lineitem_001 l1 Filter: (l_receiptdate > l_commitdate) Rows Removed by Filter:919654 -> Hash Buckets: 8388608 Batches: 1 Memory Usage: 358464kB -> Seq Scan on lineitem_001 l2 Thanks, Amit
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Jul 20, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> This suggests that partitioning is not a suitable strategy for this query, >> but then may be partition wise should not be picked for such a case to >> aggravate the performance issue. > > In the unpartitioned case, and in the partitioned case on head, the > join order is l1-(nation-supplier)-l2-orders-l3. In the patched case, > the join order changes to l1-l2-supplier-orders-nation-l3. If the > planner used the former join order, it wouldn't be able to do a > partition-wise join at all, so it must think that the l1-l2 join gets > much cheaper when done partitionwise, thus justifying a change in the > overall join order to be able to use partion-wise join. But it > doesn't work out. > > I think the problem is that the row count estimates for the child > joins seem to be totally bogus: > > -> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12) > (actual time=10484.422..15945.851 rows=1523493 loops=3) > Hash Cond: (l1.l_orderkey = l2.l_orderkey) > Join Filter: (l2.l_suppkey <> l1.l_suppkey) > Rows Removed by Join Filter: 395116 > > That's clearly wrong. In the un-partitioned plan, the join to l2 > produces about as many rows of output as the number of rows that were > input (998433 vs. 962909); but here, a child join with a million rows > as input is estimated to produce only 1 row of output. I bet the > problem is that the child-join's row count estimate isn't getting > initialized at all, but then something is clamping it to 1 row instead > of 0. > > So this looks like a bug in Ashutosh's patch. The patch does not have any changes to the selectivity estimation. It might happen that some correction in selectivity estimation for child-joins is required, but I have not spotted any code in selectivity estimation that differentiates explicitly between child and parent Vars and estimates. So, I am more inclined to believe Thomas's theory. I will try Tom's suggested approach. I am investigating this case with the setup that Rafia provided. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/07/20 15:05, Ashutosh Bapat wrote: >> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >>> >>> Partition information: >>> Type of partitioning - single column range partition >>> Tables partitioned - Lineitem and orders >>> >>> Lineitem - >>> Partition key = l_orderkey >>> No of partitions = 18 >>> >>> Orders - >>> Partition key = o_orderkey >>> No of partitions = 11 >>> >> >> The patch set upto 0015 would refuse to join two partitioned relations >> using a partition-wise join if they have different number of >> partitions. Next patches implement a more advanced partition matching >> algorithm only for list partitions. Those next patches would refuse to >> apply partition-wise join for range partitioned tables. So, I am >> confused as to how come partition-wise join is being chosen even when >> the number of partitions differ. > > In 21_part_patched.out, I see that lineitem is partitionwise-joined with > itself. > > > Append > > -> Hash Semi Join > Hash Cond: (l1.l_orderkey = l2.l_orderkey) > Join Filter: (l2.l_suppkey <> l1.l_suppkey) > Rows Removed by Join Filter: 395116 > > -> Parallel Seq Scan on lineitem_001 l1 > Filter: (l_receiptdate > l_commitdate) > Rows Removed by Filter: 919654 > > -> Hash > Buckets: 8388608 Batches: 1 Memory Usage: 358464kB > -> Seq Scan on lineitem_001 l2 > Ah, I see now. We need the same number of partitions in all partitioned tables, for joins to pick up partition-wise join. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Thu, Jul 20, 2017 at 8:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Isn't this the same as the issue reported here?
>>
>> https://www.postgresql.org/message-id/flat/CAEepm% 3D270ze2hVxWkJw- 5eKzc3AB4C9KpH3L2kih75R5pdSogg %40mail.gmail.com
>
> Hmm, possibly. But why would that affect the partition-wise join case only?
It doesn't. From Rafia's part_reg.zip we see a bunch of rows=1 that
turn out to be wrong by several orders of magnitude:
21_nopart_head.out: Hash Semi Join (cost=5720107.25..9442574.55
rows=1 width=50)
21_part_head.out: Hash Semi Join (cost=5423094.06..8847638.36
rows=1 width=38)
21_part_patched.out: Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)
My guess is that the consequences of that bad estimate are sensitive
to arbitrary other parameters moving around, as you can see from the
big jump in execution time I showed in the that message, measured on
unpatched master of the day:
4 workers = 9.5s
3 workers = 39.7s
That's why why both parallel hash join and partition-wise join are
showing regressions on Q21: it's just flip-flopping between various
badly costed plans. Note that even without parallelism, the fix that
Tom Lane suggested gives a much better plan:
https://www.postgresql.org/message-id/CAEepm% 3D11BiYUkgXZNzMtYhXh4S3a9DwUP8 O%2BF2_ZPeGzzJFPbw%40mail. gmail.com
Following the discussion at [1], with the patch Thomas posted there, now Q21 completes in some 160 seconds. The plan is changed for the good but does not use partition-wise join. The output of explain analyse is attached.
Not just the join orders but the join strategy itself changed, with the patch no hash semi join is picked which was consuming most time there, rather nested loop semi join is in picture now, though the estimates are still way-off, but the change in join-order made them terrible from horrible. It appears like this query is performing efficient now particularly because of worse under-estimated hash-join as compared to under-estimated nested loop join.
For the hash-semi-join:
-> Hash (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual time=180858.448..180858.448 rows=119994608 loops=3)
Buckets: 33554432 Batches: 8 Memory Usage: 847911kB
Overall, this doesn't look like a problem of partition-wise join patch itself.
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Thu, Jul 20, 2017 at 2:44 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2017/07/20 15:05, Ashutosh Bapat wrote: > >> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih > >> <rafia.sabih@enterprisedb.com> wrote: > >>> > >>> Partition information: > >>> Type of partitioning - single column range partition > >>> Tables partitioned - Lineitem and orders > >>> > >>> Lineitem - > >>> Partition key = l_orderkey > >>> No of partitions = 18 > >>> > >>> Orders - > >>> Partition key = o_orderkey > >>> No of partitions = 11 > >>> > >> > >> The patch set upto 0015 would refuse to join two partitioned relations > >> using a partition-wise join if they have different number of > >> partitions. Next patches implement a more advanced partition matching > >> algorithm only for list partitions. Those next patches would refuse to > >> apply partition-wise join for range partitioned tables. So, I am > >> confused as to how come partition-wise join is being chosen even when > >> the number of partitions differ. > > > > In 21_part_patched.out, I see that lineitem is partitionwise-joined with > > itself. > > > > > Append > > > > -> Hash Semi Join > > Hash Cond: (l1.l_orderkey = l2.l_orderkey) > > Join Filter: (l2.l_suppkey <> l1.l_suppkey) > > Rows Removed by Join Filter: 395116 > > > > -> Parallel Seq Scan on lineitem_001 l1 > > Filter: (l_receiptdate > l_commitdate) > > Rows Removed by Filter: 919654 > > > > -> Hash > > Buckets: 8388608 Batches: 1 Memory Usage: 358464kB > > -> Seq Scan on lineitem_001 l2 > > > Ah, I see now. > > We need the same number of partitions in all partitioned tables, for > joins to pick up partition-wise join. > Oh, I missed this limitation, will modify my setup to have same number of partitions in the partitioned table with same ranges. So, does this also mean that a partitioned table will not join with an unpartitioned table without append of partitions? -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Fri, Jul 21, 2017 at 11:42 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > > Following the discussion at [1], with the patch Thomas posted there, now Q21 > completes in some 160 seconds. Your earlier reports mentioned unpartitioned case taking 300 seconds, partitioned case without partition-wise join taking 600 seconds and with partition-wise join it taking 3200 seconds. My experiements showed that those have changed to 70s, 160s and 160s resp. This is with Thomas's patch. Can you please confirm? > The plan is changed for the good but does not > use partition-wise join. As explained earlier, this is because the tables are not partitioned similarly. Please try with lineitem and orders partitioned similarly i.e. same number of partitions and exactly same ranges. > Not just the join orders but the join strategy itself changed, with the > patch no hash semi join is picked which was consuming most time there, > rather nested loop semi join is in picture now, though the estimates are > still way-off, but the change in join-order made them terrible from > horrible. It appears like this query is performing efficient now > particularly because of worse under-estimated hash-join as compared to > under-estimated nested loop join. Earlier it was using partition-wise join between lineitems (l1, l2, l3) since it's the same table. Now for some reason the planner doesn't find joining them to each other a better strategy, instead they are joined indirectly so we don't see partition-wise join being picked. We should experiment with orders and lineitems being partitioned similarly. Can you please provide that result? > > For the hash-semi-join: > -> Hash (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual > time=180858.448..180858.448 rows=119994608 loops=3) > Buckets: 33554432 > Batches: 8 Memory Usage: 847911kB > > Overall, this doesn't look like a problem of partition-wise join patch > itself. > Thanks for confirming it. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > So, does this > also mean that a partitioned table will not join with an unpartitioned > table without append of partitions? > Yes. When you join an unpartitioned table with a partitioned table, the planner will choose to append all the partitions of the partitioned table and then join with the unpartitioned table. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Fri, Jul 21, 2017 at 12:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> So, does this
>> also mean that a partitioned table will not join with an unpartitioned
>> table without append of partitions?
>>
>
> Yes. When you join an unpartitioned table with a partitioned table,
> the planner will choose to append all the partitions of the
> partitioned table and then join with the unpartitioned table.
>
I tested this set of patches for TPC-H benchmark and came across following results,
- total 7 queries were using partition-wise join,
- Q4 attains a speedup of around 80% compared to the partitioned setup without partition-wise join, the main reason being the poor plan choice at head for partitioned database.
When I tried this query with forced nested-loop join then it completes in some 45 seconds at head. So, basically when no partition-wise join is present because of terrible selectivity estimation optimiser picks up a hash join plan, which results poorly as the estimated number of rows are two orders of magnitude lesser than actual.
Note that this is not the effect of [1], I tried this without that patch as well.
- other queries show a good 20-30% improvement in performance. Performance numbers are as follows,
Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |
The experimental settings used were,
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |
The experimental settings used were,
Partitioning: Range partitioning on lineitem and orders on l_orderkey and o_orderkey respectively. The number and range of partitions were kept same for both the tables.
Server parameters:
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on
TPC-H setup:
scale-factor - 20
Commit id - 42171e2cd23c8307bbe0ec64e901f5 8e297db1c3, also, the patch at [1] was applied in all the cases.
Query plans for the above mentioned queries is attached.
[1] https://www.postgresql. org/message-id/CAEepm%3D3% 3DNHHko3oOzpik%2BggLy17AO% 2Bpx3rGYrg3x_x05%2BBr9-A% 40mail.gmail.com
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > - other queries show a good 20-30% improvement in performance. Performance > numbers are as follows, > > Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) | > 3 | 76 |127 | 88 | > 4 |17 | 244 | 41 | > 5 | 52 | 123 | 84 | > 7 | 73 | 134 | 103 | > 10 | 67 | 111 | 89 | > 12 | 53 | 114 | 99 | > 18 | 447 | 709 | 551 | Hmm. This certainly shows that benefit of the patch, although it's rather sad that we're still slower than if we hadn't partitioned the data in the first place. Why does partitioning hurt performance so much? Maybe things would be better at a higher scale factor. When reporting results of this sort, it would be good to make a habit of reporting the number of partitions along with the other details you included. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Dilip Kumar
Дата:
On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih > <rafia.sabih@enterprisedb.com> wrote: >> - other queries show a good 20-30% improvement in performance. Performance >> numbers are as follows, >> >> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) | >> 3 | 76 |127 | 88 | >> 4 |17 | 244 | 41 | >> 5 | 52 | 123 | 84 | >> 7 | 73 | 134 | 103 | >> 10 | 67 | 111 | 89 | >> 12 | 53 | 114 | 99 | >> 18 | 447 | 709 | 551 | > > Hmm. This certainly shows that benefit of the patch, although it's > rather sad that we're still slower than if we hadn't partitioned the > data in the first place. Why does partitioning hurt performance so > much? I was analysing some of the plans (without partition and with partition), Seems like one of the reasons of performing worse with the partitioned table is that we can not use an index on the partitioned table. Q4 is taking 17s without partition whereas it's taking 244s with partition. Now if we analyze the plan Without partition, it can use parameterize index scan on lineitem table which is really big in size. But with partition, it has to scan this table completely. -> Nested Loop Semi Join -> Parallel Bitmap Heap Scan on orders -> Bitmap Index Scan on idx_orders_orderdate (cost=0.00..24378.88 r -> Index Scan using idx_lineitem_orderkey on lineitem (cost=0.57..29.29 rows=105 width=8) (actual time=0.031..0.031 rows=1 loops=1122364) Index Cond: (l_orderkey = orders.o_orderkey) Filter: (l_commitdate < l_receiptdate) Rows Removed by Filter: 1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >>> - other queries show a good 20-30% improvement in performance. Performance >>> numbers are as follows, >>> >>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) | >>> 3 | 76 |127 | 88 | >>> 4 |17 | 244 | 41 | >>> 5 | 52 | 123 | 84 | >>> 7 | 73 | 134 | 103 | >>> 10 | 67 | 111 | 89 | >>> 12 | 53 | 114 | 99 | >>> 18 | 447 | 709 | 551 | >> >> Hmm. This certainly shows that benefit of the patch, although it's >> rather sad that we're still slower than if we hadn't partitioned the >> data in the first place. Why does partitioning hurt performance so >> much? > > I was analysing some of the plans (without partition and with > partition), Seems like one of the reasons of performing worse with the > partitioned table is that we can not use an index on the partitioned > table. > > Q4 is taking 17s without partition whereas it's taking 244s with partition. > > Now if we analyze the plan > > Without partition, it can use parameterize index scan on lineitem > table which is really big in size. But with partition, it has to scan > this table completely. > > -> Nested Loop Semi Join > -> Parallel Bitmap Heap Scan on orders > -> Bitmap Index Scan on > idx_orders_orderdate (cost=0.00..24378.88 r > -> Index Scan using idx_lineitem_orderkey on > lineitem (cost=0.57..29.29 rows=105 width=8) (actual > time=0.031..0.031 rows=1 loops=1122364) > Index Cond: (l_orderkey = > orders.o_orderkey) > Filter: (l_commitdate < l_receiptdate) > Rows Removed by Filter: 1 > If the partitions have the same indexes as the unpartitioned table, planner manages to create parameterized plans for each partition and thus parameterized plan for the whole partitioned table. Do we have same indexes on unpartitioned table and each of the partitions? The difference between the two cases is the parameterized path on an unpartitioned table scans only one index whereas that on the partitioned table scans the indexes on all the partitions. My guess is the planner thinks those many scans are costlier than hash/merge join and chooses those strategies over parameterized nest loop join. In case of partition-wise join, only one index on the inner partition is involved and thus partition-wise join picks up parameterized nest loop join. Notice, that this index is much smaller than the index on the partitioned table, so the index scan will be a bit faster. But only a bit, since the depth of the index doesn't increase linearly with the size of index. Rrun-time partition pruning will improve performance even without partition-wise join since partition pruning will be able to eliminate all but one partition and only one index needs to be scanned. If planner is smart enough to cost that effectively, it will choose parameterized nest loop join for partitioned table thus improving the performance similar to unpartitioned case. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > Query plans for the above mentioned queries is attached. > Can you please share plans for all the queries, even if they haven't chosen partition-wise join when run on partitioned tables with enable_partition_wise_join ON? Also, please include the query in explain analyze output using -a or -e flats to psql. That way we will have query and its plan in the same file for ready reference. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> Query plans for the above mentioned queries is attached.
>
Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.
I didn't run the query not using partition-wise join, for now.
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > > > On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >> >> > Query plans for the above mentioned queries is attached. >> > >> >> Can you please share plans for all the queries, even if they haven't >> chosen partition-wise join when run on partitioned tables with >> enable_partition_wise_join ON? Also, please include the query in >> explain analyze output using -a or -e flats to psql. That way we will >> have query and its plan in the same file for ready reference. >> > I didn't run the query not using partition-wise join, for now. parse-parse error, sorry. Do you mean, you haven't run the queries which do not use partition-wise join? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
-- On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
>
> On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>
>> > Query plans for the above mentioned queries is attached.
>> >
>>
>> Can you please share plans for all the queries, even if they haven't
>> chosen partition-wise join when run on partitioned tables with
>> enable_partition_wise_join ON? Also, please include the query in
>> explain analyze output using -a or -e flats to psql. That way we will
>> have query and its plan in the same file for ready reference.
>>
> I didn't run the query not using partition-wise join, for now.
parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?
Yes, that's what I mean.
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Wed, Jul 26, 2017 at 11:08 AM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > > > On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >> > >> > >> > On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat >> > <ashutosh.bapat@enterprisedb.com> wrote: >> >> >> >> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih >> >> <rafia.sabih@enterprisedb.com> wrote: >> >> >> >> > Query plans for the above mentioned queries is attached. >> >> > >> >> >> >> Can you please share plans for all the queries, even if they haven't >> >> chosen partition-wise join when run on partitioned tables with >> >> enable_partition_wise_join ON? Also, please include the query in >> >> explain analyze output using -a or -e flats to psql. That way we will >> >> have query and its plan in the same file for ready reference. >> >> >> > I didn't run the query not using partition-wise join, for now. >> >> parse-parse error, sorry. Do you mean, you haven't run the queries >> which do not use partition-wise join? >> > Yes, that's what I mean. Ok. If those queries have equi-join between partitioned tables and are not picking up partition-wise join, that case needs to be investigated. Q21 for example has join between three lineitem instances. Those joins can be executed by partition-wise join. But it may so happen that optimal join order doesn't join partitioned tables with each other, thus interleaving partitioned tables with unpartitioned or differently partitioned tables in join order. Partition-wise join is not possible then. A different partitioning scheme may be required there. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Wed, Jul 26, 2017 at 10:38 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih > >> <rafia.sabih@enterprisedb.com> wrote: > >>> - other queries show a good 20-30% improvement in performance. Performance > >>> numbers are as follows, > >>> > >>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) | > >>> 3 | 76 |127 | 88 | > >>> 4 |17 | 244 | 41 | > >>> 5 | 52 | 123 | 84 | > >>> 7 | 73 | 134 | 103 | > >>> 10 | 67 | 111 | 89 | > >>> 12 | 53 | 114 | 99 | > >>> 18 | 447 | 709 | 551 | > >> > >> Hmm. This certainly shows that benefit of the patch, although it's > >> rather sad that we're still slower than if we hadn't partitioned the > >> data in the first place. Why does partitioning hurt performance so > >> much? > > > > I was analysing some of the plans (without partition and with > > partition), Seems like one of the reasons of performing worse with the > > partitioned table is that we can not use an index on the partitioned > > table. > > > > Q4 is taking 17s without partition whereas it's taking 244s with partition. > > > > Now if we analyze the plan > > > > Without partition, it can use parameterize index scan on lineitem > > table which is really big in size. But with partition, it has to scan > > this table completely. > > > > -> Nested Loop Semi Join > > -> Parallel Bitmap Heap Scan on orders > > -> Bitmap Index Scan on > > idx_orders_orderdate (cost=0.00..24378.88 r > > -> Index Scan using idx_lineitem_orderkey on > > lineitem (cost=0.57..29.29 rows=105 width=8) (actual > > time=0.031..0.031 rows=1 loops=1122364) > > Index Cond: (l_orderkey = > > orders.o_orderkey) > > Filter: (l_commitdate < l_receiptdate) > > Rows Removed by Filter: 1 > > > > If the partitions have the same indexes as the unpartitioned table, > planner manages to create parameterized plans for each partition and > thus parameterized plan for the whole partitioned table. Do we have > same indexes on unpartitioned table and each of the partitions? The Yes both lineitem and orders have same number of partitions viz 17 and on the same partitioning key (*_orderkey) and same ranges for each partition. However, I missed creating the index on o_orderdate for the partitions. But on creating it as well, the plan with bitmap heap scan is used and it still completes in some 200 seconds, check the attached file for the query plan. > difference between the two cases is the parameterized path on an > unpartitioned table scans only one index whereas that on the > partitioned table scans the indexes on all the partitions. My guess is > the planner thinks those many scans are costlier than hash/merge join > and chooses those strategies over parameterized nest loop join. In > case of partition-wise join, only one index on the inner partition is > involved and thus partition-wise join picks up parameterized nest loop > join. Notice, that this index is much smaller than the index on the > partitioned table, so the index scan will be a bit faster. But only a > bit, since the depth of the index doesn't increase linearly with the > size of index. > As I have observed, the thing with this query is that selectivity estimation is too high than actual, now when index scan is chosen for lineitem being in the inner side of NLJ, the query completes quickly since the number of actual returned rows is too low. However, in case we pick seq scan, or lineitem is on the outer side, the query is going to take a really long time. Now, when Hash-Join is picked in the case of partitioned database and no partition-wise join is available, seq scan is preferred instead of index scan and hence the elongated query execution time. I tried this query with random_page_cost = 0 and forcing NLJ and the chosen plan completes the query in 45 seconds, check the attached file for explain analyse output. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Rafia Sabih
Дата:
On Wed, Jul 26, 2017 at 12:02 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > Ok. If those queries have equi-join between partitioned tables and are > not picking up partition-wise join, that case needs to be > investigated. Q21 for example has join between three lineitem > instances. Those joins can be executed by partition-wise join. But it > may so happen that optimal join order doesn't join partitioned tables > with each other, thus interleaving partitioned tables with > unpartitioned or differently partitioned tables in join order. > Partition-wise join is not possible then. A different partitioning > scheme may be required there. > Good point, will look into this direction as well. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Here's revised patch set with only 0004 revised. That patch deals with > creating multi-level inheritance hierarchy from multi-level partition > hierarchy. The original logic of recursively calling > inheritance_planner()'s guts over the inheritance hierarchy required > that for every such recursion we flatten many lists created by that > code. Recursion also meant that root->append_rel_list is traversed as > many times as the number of partitioned partitions in the hierarchy. > Instead the revised version keep the iterative shape of > inheritance_planner() intact, thus naturally creating flat lists, > iterates over root->append_rel_list only once and is still easy to > read and maintain. 0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004: + if (brel->reloptkind != RELOPT_BASEREL && + brte->relkind != RELKIND_PARTITIONED_TABLE) I spent a lot of time staring at this code before I figured out what was going on here. We're iterating over simple_rel_array, so the reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL. But does that guarantee that rtekind is RTE_RELATION such that brte->relkind will be initialized to a value? I'm not 100% sure. I think it would be clearer to write this test like this: Assert(IS_SIMPLE_REL(brel)); if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL && (brte->rtekind != RELOPT_BASEREL || brte->relkind != RELKIND_PARTITIONED_TABLE)) continue; Note that the way you wrote the comment is says if it *is* another REL, not if it's *not* a baserel; it's good if those kinds of little details match between the code and the comments. It is not clear to me what the motivation is for the API changes in expanded_inherited_rtentry. They don't appear to be necessary. If they are necessary, you need to do a more thorough job updating the comments. This one, in particular: * If so, add entries for all the child tables to the query's * rangetable, and build AppendRelInfo nodes for allthe child tables * and add them to root->append_rel_list. If not, clear the entry's And the comments could maybe say something like "We return the list of appinfos rather than directly appending it to append_rel_list because $REASON." - * is a partitioned table. + * RTE simply duplicates the parent *partitioned* table. */ - if (childrte->relkind != RELKIND_PARTITIONED_TABLE) + if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh) This is another case where it's hard to understand the test from the comments. + * In case of multi-level inheritance hierarchy, for every child we require + * PlannerInfo of its immediate parent. Hence we save those in a an array Say why. Also, need to fix "a an". I'm a little bit surprised that this patch doesn't make any changes to allpaths.c or relnode.c. It looks to me like we'll generate paths for the new RTEs that are being added. Are we going to end up with multiple levels of Append nodes, then? Does the consider the way consider_parallel is propagated up and down in set_append_rel_size() and set_append_rel_pathlist() really work with multiple levels? Maybe this is all fine; I haven't tried to verify it in depth. Overall I think this is a reasonable direction to go but I'm worried that there may be bugs lurking -- other code that needs adjusting that hasn't been found, really. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Here's revised patch set with only 0004 revised. That patch deals with >> creating multi-level inheritance hierarchy from multi-level partition >> hierarchy. The original logic of recursively calling >> inheritance_planner()'s guts over the inheritance hierarchy required >> that for every such recursion we flatten many lists created by that >> code. Recursion also meant that root->append_rel_list is traversed as >> many times as the number of partitioned partitions in the hierarchy. >> Instead the revised version keep the iterative shape of >> inheritance_planner() intact, thus naturally creating flat lists, >> iterates over root->append_rel_list only once and is still easy to >> read and maintain. > > 0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004: > > + if (brel->reloptkind != RELOPT_BASEREL && > + brte->relkind != RELKIND_PARTITIONED_TABLE) > > I spent a lot of time staring at this code before I figured out what > was going on here. We're iterating over simple_rel_array, so the > reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL. > But does that guarantee that rtekind is RTE_RELATION such that > brte->relkind will be initialized to a value? I'm not 100% sure. Comment in RangeTblEntry says952 /*953 * Fields valid for a plain relation RTE (else zero):954 * ... clipped portion for RTE_NAMEDTUPLESTORE related comment 960 Oid relid; /* OID of the relation */961 char relkind; /* relation kind (see pg_class.relkind)*/ This means that relkind will be 0 when rtekind != RTE_RELATION. So, the condition holds. But code creating an RTE somewhere which is not in sync with this comment would create a problem. So your suggestion makes sense. > I > think it would be clearer to write this test like this: > > Assert(IS_SIMPLE_REL(brel)); > if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL && > (brte->rtekind != RELOPT_BASEREL || Do you mean (brte_>rtekind != RTE_RELATION)? > brte->relkind != RELKIND_PARTITIONED_TABLE)) > continue; > > Note that the way you wrote the comment is says if it *is* another > REL, not if it's *not* a baserel; it's good if those kinds of little > details match between the code and the comments. I find the existing comment and code in this part of the function differ. The comment just above the loop on simple_rel_array[], talks about changing something in the child, but the very next line skips child relations and later a loop on append_rel_list makes changes to appropriate children. I guess, it's done that way to keep the code working even after we introduce some RELOPTKIND other than BASEREL or OTHER_MEMBER_REL for a simple rel. But your suggestion makes more sense. Changed it according to your suggestion. > > It is not clear to me what the motivation is for the API changes in > expanded_inherited_rtentry. They don't appear to be necessary. expand_inherited_rtentry() creates AppendRelInfos for all the children of a given parent and collects them in a list. The list is appended to root->append_rel_list at the end of the function. Now that function needs to do this recursively. This means that for a partitioned partition table its children's AppendRelInfos will be added to root->append_rel_list before AppendRelInfo of that partitioned partition table. inheritance_planner() assumes that the parent's AppendRelInfo comes before its children in append_rel_list.This assumption allows it to be simplified greately, retaining its iterative form. My earlier patches had recursive version of inheritance_planner(), which is complex. I have comments in this patch explaining this. Adding AppendRelInfos to root->append_rel_list as and when they are created would keep parent AppendRelInfos before those of children. But that function throws away the AppendRelInfo it created when their are no real children i.e. in partitioned table's case when has no leaf partitions. So, we can't do that. Hence, I chose to change the API to return the list of AppendRelInfos when the given RTE has real children. > If > they are necessary, you need to do a more thorough job updating the > comments. This one, in particular: > > * If so, add entries for all the child tables to the query's > * rangetable, and build AppendRelInfo nodes for all the child tables > * and add them to root->append_rel_list. If not, clear the entry's Done. > > And the comments could maybe say something like "We return the list of > appinfos rather than directly appending it to append_rel_list because > $REASON." Done. Please check the attached version. > > - * is a partitioned table. > + * RTE simply duplicates the parent *partitioned* table. > */ > - if (childrte->relkind != RELKIND_PARTITIONED_TABLE) > + if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh) > > This is another case where it's hard to understand the test from the comments. The current comment says it all, but it very cryptic manner. 1526 /* 1527 * Build an AppendRelInfo for this parent and child, unless the child 1528 * RTE simply duplicates the parent *partitioned* table. 1529 */ The comment makes sense in the context of this paragraph in the prologue 1364 * Note that the original RTE is considered to represent the whole 1365 * inheritance set. The first of the generated RTEs is an RTE for the same 1366 * table, but with inh = false, to represent the parent table in its role 1367 * as a simple member of the inheritance set. 1368 * The code avoids creating AppendRelInfos for a child which represents the parent in its role as a simple member of inheritance set. I have reworded it as 1526 /* 1527 * Build an AppendRelInfo for this parent and child, unless the child 1528 * RTE represents the parent as a simple member of inheritance set. 1529 */ > > + * In case of multi-level inheritance hierarchy, for every child we require > + * PlannerInfo of its immediate parent. Hence we save those in a an array > > Say why. Also, need to fix "a an". Done. > > I'm a little bit surprised that this patch doesn't make any changes to > allpaths.c or relnode.c. > It looks to me like we'll generate paths for > the new RTEs that are being added. Are we going to end up with > multiple levels of Append nodes, then? Does the consider the way > consider_parallel is propagated up and down in set_append_rel_size() > and set_append_rel_pathlist() really work with multiple levels? Maybe > this is all fine; I haven't tried to verify it in depth. This has been discussed before, but I can not locate the mail answering these questions. accumulate_append_subpath() called from add_paths_to_append_rel() takes care of flattening Merge/Append paths. The planner code deals with the multi-level inheritance hierarchy created for subqueries with set operations. There is code in relnode.c to build the RelOptInfos for such subqueries recursively through using RangeTblEntry::inh flag. So there are no changes in allpaths.c and relnode.c. Are you looking for some other changes? > > Overall I think this is a reasonable direction to go but I'm worried > that there may be bugs lurking -- other code that needs adjusting that > hasn't been found, really. > Planner code is already aware of such hierarchies except DMLs, which this patch adjusts. We have fixed issues revealed by mine and Rajkumar's testing. What kinds of things you suspect? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
Forgot the patch set. Here it is. On Mon, Jul 31, 2017 at 5:29 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> Here's revised patch set with only 0004 revised. That patch deals with >>> creating multi-level inheritance hierarchy from multi-level partition >>> hierarchy. The original logic of recursively calling >>> inheritance_planner()'s guts over the inheritance hierarchy required >>> that for every such recursion we flatten many lists created by that >>> code. Recursion also meant that root->append_rel_list is traversed as >>> many times as the number of partitioned partitions in the hierarchy. >>> Instead the revised version keep the iterative shape of >>> inheritance_planner() intact, thus naturally creating flat lists, >>> iterates over root->append_rel_list only once and is still easy to >>> read and maintain. >> >> 0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004: >> >> + if (brel->reloptkind != RELOPT_BASEREL && >> + brte->relkind != RELKIND_PARTITIONED_TABLE) >> >> I spent a lot of time staring at this code before I figured out what >> was going on here. We're iterating over simple_rel_array, so the >> reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL. >> But does that guarantee that rtekind is RTE_RELATION such that >> brte->relkind will be initialized to a value? I'm not 100% sure. > > Comment in RangeTblEntry says > 952 /* > 953 * Fields valid for a plain relation RTE (else zero): > 954 * > ... clipped portion for RTE_NAMEDTUPLESTORE related comment > > 960 Oid relid; /* OID of the relation */ > 961 char relkind; /* relation kind (see pg_class.relkind) */ > > This means that relkind will be 0 when rtekind != RTE_RELATION. So, > the condition holds. But code creating an RTE somewhere which is not > in sync with this comment would create a problem. So your suggestion > makes sense. > >> I >> think it would be clearer to write this test like this: >> >> Assert(IS_SIMPLE_REL(brel)); >> if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL && >> (brte->rtekind != RELOPT_BASEREL || > > Do you mean (brte_>rtekind != RTE_RELATION)? > >> brte->relkind != RELKIND_PARTITIONED_TABLE)) >> continue; >> >> Note that the way you wrote the comment is says if it *is* another >> REL, not if it's *not* a baserel; it's good if those kinds of little >> details match between the code and the comments. > > I find the existing comment and code in this part of the function > differ. The comment just above the loop on simple_rel_array[], talks > about changing something in the child, but the very next line skips > child relations and later a loop on append_rel_list makes changes to > appropriate children. I guess, it's done that way to keep the code > working even after we introduce some RELOPTKIND other than BASEREL or > OTHER_MEMBER_REL for a simple rel. But your suggestion makes more > sense. Changed it according to your suggestion. > >> >> It is not clear to me what the motivation is for the API changes in >> expanded_inherited_rtentry. They don't appear to be necessary. > > expand_inherited_rtentry() creates AppendRelInfos for all the children > of a given parent and collects them in a list. The list is appended to > root->append_rel_list at the end of the function. Now that function > needs to do this recursively. This means that for a partitioned > partition table its children's AppendRelInfos will be added to > root->append_rel_list before AppendRelInfo of that partitioned > partition table. inheritance_planner() assumes that the parent's > AppendRelInfo comes before its children in append_rel_list.This > assumption allows it to be simplified greately, retaining its > iterative form. My earlier patches had recursive version of > inheritance_planner(), which is complex. I have comments in this patch > explaining this. > > Adding AppendRelInfos to root->append_rel_list as and when they are > created would keep parent AppendRelInfos before those of children. But > that function throws away the AppendRelInfo it created when their are > no real children i.e. in partitioned table's case when has no leaf > partitions. So, we can't do that. Hence, I chose to change the API to > return the list of AppendRelInfos when the given RTE has real > children. > >> If >> they are necessary, you need to do a more thorough job updating the >> comments. This one, in particular: >> >> * If so, add entries for all the child tables to the query's >> * rangetable, and build AppendRelInfo nodes for all the child tables >> * and add them to root->append_rel_list. If not, clear the entry's > > Done. > >> >> And the comments could maybe say something like "We return the list of >> appinfos rather than directly appending it to append_rel_list because >> $REASON." > > Done. Please check the attached version. > >> >> - * is a partitioned table. >> + * RTE simply duplicates the parent *partitioned* table. >> */ >> - if (childrte->relkind != RELKIND_PARTITIONED_TABLE) >> + if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh) >> >> This is another case where it's hard to understand the test from the comments. > > The current comment says it all, but it very cryptic manner. > 1526 /* > 1527 * Build an AppendRelInfo for this parent and child, > unless the child > 1528 * RTE simply duplicates the parent *partitioned* table. > 1529 */ > > The comment makes sense in the context of this paragraph in the prologue > 1364 * Note that the original RTE is considered to represent the whole > 1365 * inheritance set. The first of the generated RTEs is an RTE for the same > 1366 * table, but with inh = false, to represent the parent table in its role > 1367 * as a simple member of the inheritance set. > 1368 * > > The code avoids creating AppendRelInfos for a child which represents > the parent in its role as a simple member of inheritance set. > > I have reworded it as > 1526 /* > 1527 * Build an AppendRelInfo for this parent and child, > unless the child > 1528 * RTE represents the parent as a simple member of inheritance set. > 1529 */ > >> >> + * In case of multi-level inheritance hierarchy, for every child we require >> + * PlannerInfo of its immediate parent. Hence we save those in a an array >> >> Say why. Also, need to fix "a an". > > Done. > >> >> I'm a little bit surprised that this patch doesn't make any changes to >> allpaths.c or relnode.c. > >> It looks to me like we'll generate paths for >> the new RTEs that are being added. Are we going to end up with >> multiple levels of Append nodes, then? Does the consider the way >> consider_parallel is propagated up and down in set_append_rel_size() >> and set_append_rel_pathlist() really work with multiple levels? Maybe >> this is all fine; I haven't tried to verify it in depth. > > This has been discussed before, but I can not locate the mail > answering these questions. accumulate_append_subpath() called from > add_paths_to_append_rel() takes care of flattening Merge/Append paths. > The planner code deals with the multi-level inheritance hierarchy > created for subqueries with set operations. There is code in relnode.c > to build the RelOptInfos for such subqueries recursively through using > RangeTblEntry::inh flag. So there are no changes in allpaths.c and > relnode.c. Are you looking for some other changes? > >> >> Overall I think this is a reasonable direction to go but I'm worried >> that there may be bugs lurking -- other code that needs adjusting that >> hasn't been found, really. >> > > Planner code is already aware of such hierarchies except DMLs, which > this patch adjusts. We have fixed issues revealed by mine and > Rajkumar's testing. > What kinds of things you suspect? > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Adding AppendRelInfos to root->append_rel_list as and when they are > created would keep parent AppendRelInfos before those of children. But > that function throws away the AppendRelInfo it created when their are > no real children i.e. in partitioned table's case when has no leaf > partitions. So, we can't do that. Hence, I chose to change the API to > return the list of AppendRelInfos when the given RTE has real > children. So, IIUC, the case you're concerned about is when you have a hierarchy of only partitioned tables, with no plain tables. For example, if B is a partitioned table and a partition of A, and that's all there is, A will recurse to B and B will return NIL. Is it necessary to get rid of the extra AppendRelInfos, or are they harmless like the duplicate RTE and PlanRowMark nodes? /* * If all the children were temp tables or a partitioned parent did not * have any leaf partitions, pretend it'sa non-inheritance situation; we * don't need Append node in that case. The duplicate RTE we added for * the parenttable is harmless, so we don't bother to get rid of it; * ditto for the useless PlanRowMark node. */ if (!need_append) { /* Clear flag before returning */ rte->inh = false; return; } If we do need to get rid of the extra AppendRelInfos, maybe a less invasive solution would be to change the if (!need_append) case to do root->append_rel_list = list_truncate(root->append_rel_list, original_append_rel_length). > The code avoids creating AppendRelInfos for a child which represents > the parent in its role as a simple member of inheritance set. OK, I suggest we rewrite the whole comment like this: "We need an AppendRelInfo if paths will be built for the child RTE. If childrte->inh is true, then we'll always need to generate append paths for it. If childrte->inh is false, we must scan it if it's not a partitioned table; but if it is a partitioned table, then it never has any data of its own and need not be scanned. It does, however, need to be locked, so note the OID for inclusion in the PartitionedChildRelInfo we're going to build." It looks like you also need to update the header comment for AppendRelInfo itself, in nodes/relation.h. + * PlannerInfo for every child is obtained by translating relevant members Insert "The" at the start of the sentence. - subroot->parse = (Query *) - adjust_appendrel_attrs(root, - (Node *) parse, - appinfo); + subroot->parse = (Query *) adjust_appendrel_attrs(parent_root, + (Node *) parent_parse, + 1, &appinfo); I suggest that you don't remove the line break after the cast. + * If the child is further partitioned, remember it as a parent. Since + * partitioned tables do not have any data, we don't need to create + * plan for it. We just need its PlannerInfo set up to be used while + * planning its children. Most of this comment is in the singular, but the first half of the second sentence is plural. Should be "Since a partitioned table does not have any data...". I might replace the last sentence by "We do, however, need to remember the PlannerInfo for use when planning its children." +-- Check UPDATE with *multi-level partitioned* inherited target Asterisks seem like overkill. Since expand_inherited_rtentry() and set_append_rel_size() can now recurse down to as many levels as there are levels in the inheritance hierarchy, they should probably have a check_stack_depth() check. >> Overall I think this is a reasonable direction to go but I'm worried >> that there may be bugs lurking -- other code that needs adjusting that >> hasn't been found, really. >> > Planner code is already aware of such hierarchies except DMLs, which > this patch adjusts. We have fixed issues revealed by mine and > Rajkumar's testing. > What kinds of things you suspect? I'm not sure exactly. It's just hard with this kind of patch to make sure you've caught everything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Robert Haas
Дата:
On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Forgot the patch set. Here it is. The commit message for 0005 isn't really accurate given that it follows 0004. I think you could just flatten 0005 and 0006 into one patch. Reviewing those together: - Existing code does partdesc = RelationGetPartitionDesc(relation) but this has got it as part_desc. Seems better to be consistent. Likewise existing variables for PartitionKey are key or partkey, not part_key. - get_relation_partition_info has a useless trailing return. - Instead of adding nparts, boundinfo, and part_oids to RelOptInfo, how about just adding partdesc? Seems cleaner. - pkexprs seems like a potentially confusing name, since PK is widely used to mean "primary key" but here you mean "partition key". Maybe partkeyexprs. - build_simple_rel's matching algorithm is O(n^2). We may have talked about this problem before... - This patch introduces some bits that are not yet used, like nullable_pkexprs, or even the code to set the partition scheme for joinrels. I think perhaps some of that logic should be moved from 0008 to here - e.g. the initial portion of build_joinrel_partition_info. There may be more, but I've run out of energy for tonight. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Adding AppendRelInfos to root->append_rel_list as and when they are >> created would keep parent AppendRelInfos before those of children. But >> that function throws away the AppendRelInfo it created when their are >> no real children i.e. in partitioned table's case when has no leaf >> partitions. So, we can't do that. Hence, I chose to change the API to >> return the list of AppendRelInfos when the given RTE has real >> children. > > So, IIUC, the case you're concerned about is when you have a hierarchy > of only partitioned tables, with no plain tables. For example, if B > is a partitioned table and a partition of A, and that's all there is, > A will recurse to B and B will return NIL. > > Is it necessary to get rid of the extra AppendRelInfos, or are they > harmless like the duplicate RTE and PlanRowMark nodes? > Actually there are two sides to this: If there are no leaf partitions, without the patch two things happen 1. rte->inh is cleared and 2 no appinfo is added to the root->append_rel_list, even though harmless RTE and PlanRowMark nodes are created. The first avoids treating the relation as the inheritance parent and thus avoids creating any child relations and paths, saving a lot of work. Ultimately set_rel_size() marks such a relation as dummy352 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)353 {354 /*355 * A partitioned table without leaf partitions is marked356 * as a dummy rel.357 */358 set_dummy_rel_pathlist(rel);359 } Since root->append_rel_list is traversed for every inheritance parent, not adding needless AppendRelInfos improves performance and saves memory, (FWIW or consider a case where there are thousands of partitioned partitions without any leaf partition.). My initial thought was to keep both these properties intact. But then removing such AppendRelInfos would have a problem when such a table is on the inner side of the join as described in [1]. So I wrote the patch not to do either of those things when there are partitioned partitions without leaf partitions. So, it looks like you are correct, we could just go ahead and add those AppendRelInfos directly to root->append_rel_list. > /* > * If all the children were temp tables or a partitioned parent did not > * have any leaf partitions, pretend it's a non-inheritance situation; we > * don't need Append node in that case. The duplicate RTE we added for > * the parent table is harmless, so we don't bother to get rid of it; > * ditto for the useless PlanRowMark node. > */ > if (!need_append) > { > /* Clear flag before returning */ > rte->inh = false; > return; > } > > If we do need to get rid of the extra AppendRelInfos, maybe a less > invasive solution would be to change the if (!need_append) case to do > root->append_rel_list = list_truncate(root->append_rel_list, > original_append_rel_length). We might require this for non-partitioned tables. I will try to implement it this way in the next set of patches. > >> The code avoids creating AppendRelInfos for a child which represents >> the parent in its role as a simple member of inheritance set. > > OK, I suggest we rewrite the whole comment like this: "We need an > AppendRelInfo if paths will be built for the child RTE. If > childrte->inh is true, then we'll always need to generate append paths > for it. If childrte->inh is false, we must scan it if it's not a > partitioned table; but if it is a partitioned table, then it never has > any data of its own and need not be scanned. It does, however, need > to be locked, so note the OID for inclusion in the > PartitionedChildRelInfo we're going to build." Done. > > It looks like you also need to update the header comment for > AppendRelInfo itself, in nodes/relation.h. Done. Thanks for pointing it out. > > + * PlannerInfo for every child is obtained by translating relevant members > > Insert "The" at the start of the sentence. Done. > > - subroot->parse = (Query *) > - adjust_appendrel_attrs(root, > - (Node *) parse, > - appinfo); > + subroot->parse = (Query *) adjust_appendrel_attrs(parent_root, > + (Node *) > parent_parse, > + 1, &appinfo); > > I suggest that you don't remove the line break after the cast. This is part of 0001 patch, fixed there. > > + * If the child is further partitioned, remember it as a parent. Since > + * partitioned tables do not have any data, we don't need to create > + * plan for it. We just need its PlannerInfo set up to be used while > + * planning its children. > > Most of this comment is in the singular, but the first half of the > second sentence is plural. Should be "Since a partitioned table does > not have any data...". I might replace the last sentence by "We do, > however, need to remember the PlannerInfo for use when planning its > children." Done. > > +-- Check UPDATE with *multi-level partitioned* inherited target > > Asterisks seem like overkill. Done. This style was copied from an existing comment in that file.-- Check UPDATE with *partitioned* inherited target > > Since expand_inherited_rtentry() and set_append_rel_size() can now > recurse down to as many levels as there are levels in the inheritance > hierarchy, they should probably have a check_stack_depth() check. Done. Even without this patch set_append_rel_size() could recurse down many levels of inheritance hierarchy (created by set operation queries) through set_append_rel_size()->set_rel_size()->set_append_rel_size(). And so would set_rel_size(). But now it's more prone to that problem. I will provide updated patches after taking care of your comments about 0005 and 0006. [1] https://www.postgresql.org/message-id/CAFjFpRd5+zroxY7UMGTR2M=rjBV4aBOCxQg3+1rBmTPLK5mpDg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Aug 3, 2017 at 9:38 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> Adding AppendRelInfos to root->append_rel_list as and when they are >>> created would keep parent AppendRelInfos before those of children. But >>> that function throws away the AppendRelInfo it created when their are >>> no real children i.e. in partitioned table's case when has no leaf >>> partitions. So, we can't do that. Hence, I chose to change the API to >>> return the list of AppendRelInfos when the given RTE has real >>> children. >> >> So, IIUC, the case you're concerned about is when you have a hierarchy >> of only partitioned tables, with no plain tables. For example, if B >> is a partitioned table and a partition of A, and that's all there is, >> A will recurse to B and B will return NIL. >> >> Is it necessary to get rid of the extra AppendRelInfos, or are they >> harmless like the duplicate RTE and PlanRowMark nodes? >> > > Actually there are two sides to this: > > If there are no leaf partitions, without the patch two things happen > 1. rte->inh is cleared and 2 no appinfo is added to the > root->append_rel_list, even though harmless RTE and PlanRowMark nodes > are created. The first avoids treating the relation as the inheritance > parent and thus avoids creating any child relations and paths, saving > a lot of work. Ultimately set_rel_size() marks such a relation as > dummy > 352 else if (rte->relkind == RELKIND_PARTITIONED_TABLE) > 353 { > 354 /* > 355 * A partitioned table without leaf > partitions is marked > 356 * as a dummy rel. > 357 */ > 358 set_dummy_rel_pathlist(rel); > 359 } > > Since root->append_rel_list is traversed for every inheritance parent, > not adding needless AppendRelInfos improves performance and saves > memory, (FWIW or consider a case where there are thousands of > partitioned partitions without any leaf partition.). With some testing, I found that this was true once, but not after declarative partition support. Please check [1]. > > My initial thought was to keep both these properties intact. But then > removing such AppendRelInfos would have a problem when such a table is > on the inner side of the join as described in [1]. So I wrote the > patch not to do either of those things when there are partitioned > partitions without leaf partitions. So, it looks like you are correct, > we could just go ahead and add those AppendRelInfos directly to > root->append_rel_list. Irrespective of [1], I have implemented your idea of not changing signature of expand_inherited_rtentry() with following idea. >> >> If we do need to get rid of the extra AppendRelInfos, maybe a less >> invasive solution would be to change the if (!need_append) case to do >> root->append_rel_list = list_truncate(root->append_rel_list, >> original_append_rel_length). [1] https://www.postgresql.org/message-id/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Ashutosh Bapat
Дата:
On Thu, Aug 3, 2017 at 7:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> Forgot the patch set. Here it is. > > The commit message for 0005 isn't really accurate given that it > follows 0004. I think you could just flatten 0005 and 0006 into one > patch. Earlier, there was some doubt about the approach for expanding multi-level partitioned table's inheritance hierarchy. So, I had separated all multi-level partition related changes into patches by themselves, collocating them with their respective single level partition peers. I thought that would make the reviews easier while leaving the possibility of committing single-level partition-wise support before multi-level partition-wise join support. From your previous replies, it seems that you are fine with the multi-level partitioned hierarchy expansion, so it may be committed along-with other patches. So, I have squashed those two patches together. Similarly I have squashed pairs 0008-0009 and 0012-0013. Those dealt with similar issues for single-level partitioned and multi-level partitioned tables. > > Reviewing those together: > > - Existing code does partdesc = RelationGetPartitionDesc(relation) but > this has got it as part_desc. Seems better to be consistent. > Likewise existing variables for PartitionKey are key or partkey, not > part_key. Done. > > - get_relation_partition_info has a useless trailing return. > Done. > - Instead of adding nparts, boundinfo, and part_oids to RelOptInfo, > how about just adding partdesc? Seems cleaner. nparts and boundinfo apply to any kind of relation simple, join or upper but part_oids applies only to simple relations. So, I have split those members and added them in respective sections. Do you still think that we should add PartitionDesc as a single member? Similar to your suggestion of changing name of part_key to partkey, should we rename part_scheme as partscheme, part_rels as partrels and part_oids as partoids? > > - pkexprs seems like a potentially confusing name, since PK is widely > used to mean "primary key" but here you mean "partition key". Maybe > partkeyexprs. agreed. Done. PartitionKey structure has member partexprs for partition keys which are expressions. I have used the same name instread of pkexprs. > > - build_simple_rel's matching algorithm is O(n^2). We may have talked > about this problem before... If root->append_rel_list has AppendRelInfos in the same order as the partition bounds, we could reduce this to O(n). That expansion option is being discussed in [1]. Once we commit it, I will change the code to make it O(n). Right now, we can not rely on the order of AppendRelInfos in root->append_rel_list. > > - This patch introduces some bits that are not yet used, like > nullable_pkexprs, We could fix that by adding that member in 0008. IIRC, earlier you had complained about declaring a structure in one patch and adding members to it in the subsequent patches, so I just added all members in the same patch. BTW, I have renamed that member to nullable_partexprs to be consistent with change to pkexpers. > or even the code to set the partition scheme for > joinrels. I think perhaps some of that logic should be moved from > 0008 to here - e.g. the initial portion of > build_joinrel_partition_info. Setting part_scheme for joinrel should really be part of the patch which actually implements partition-wise join. That will keep all the partition-wise join implementation together. 0005 and 0006 really just introduce PartitionScheme for base relation. I think PartitionScheme and other partitioning properties for base relation are useful for something else like partition-wise aggregation on base relation. So, we may want to commit those two patches separately. If you want, we could squash the partition scheme and partition-wise join implementation together. [1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp Updated patches attached. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Thomas Munro
Дата:
On Tue, Aug 8, 2017 at 8:51 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Updated patches attached. Hi, I started reviewing this. It's nicely commented, but it's also very complicated, and it's going to take me several rounds to understand what all the parts do, but here's some assorted feedback after reading some parts of the patches, some tinkering and quite a bit of time spent trying to break it (so far unsuccessfully). On my computer it took ~1.5 seconds to plan a 1000 partition join, ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a 4000 partition join. I poked around in a profiler a bit and saw that for the 2000 partition case I spent almost half the time in create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle, and about half of that was in bms_is_subset. The other half the time was in query_planner->make_one_rel which spent 2/3 of its time in set_rel_size->add_child_rel_equivalences->bms_overlap and the other 1/3 in standard_join_search. One micro-optimisation opportunity I noticed was in those bms_is_subset and bms_overlap calls. The Bitmapsets don't tend to have trailing words but often have hundreds of empty leading words. If I hack bitmapset.{c,h} so that it tracks first_non_empty_wordnum and then adjust bms_is_subset and bms_overlap so that they start their searches at Min(a->first_non_empty_wordnum, b->first_non_empty_wordnum) then the planning time improves measurably: 1000 partitions: ~1.5s -> 1.3s 2000 partitions: ~7.1s -> 5.8s 4000 partitions: ~50s -> ~44s When using list-based partitions, it must be possible to omit the part of a join key that is implied by the partition because the partition has only one list value. For example, if I create a two level hierarchy with one partition per US state and then time-based range partitions under that, the state part of this merge condition is redundant: Merge Cond: ((sales_wy_2017_10.state = purchases_wy_2017_10.state) AND (sales_wy_2017_10.created = purchases_wy_2017_10.created)) 0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch -partition_bounds_equal(PartitionKey key, +partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, PartitionBoundInfob1, PartitionBoundInfo b2) I wonder is there any value in creating a struct to represent the common part of PartitionKey and PartitionScheme that functions like this and others need? Maybe not. Perhaps you didn't want to make PartitionKey contain a PartitionScheme because then you'd lose the property that every pointer to PartitionScheme in the system must be a pointer to an interned (canonical) PartitionScheme, so it's always safe to compare pointers to test for equality? 0005-Canonical-partition-scheme.patch: +/* + * get_relation_partition_info + * + * Retrieves partitioning information for a given relation. + * + * For a partitioned table it sets partitioning scheme, partition key + * expressions, number of partitions and OIDs of partitions in the given + * RelOptInfo. + */ +static void +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, + Relation relation) Would this be better called "set_relation_partition_info"? It doesn't really "retrieve" the information, it "installs" it. +{ + PartitionDesc part_desc; + + /* No partitioning information for an unpartitioned relation. */ + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE || + !(rel->part_scheme = find_partition_scheme(root, relation))) + return; Here and elsewhere you use the idiom !(foo = bar), which is perfectly good C in my book but I understand the project convention is to avoid implicit pointer->boolean conversion and to prefer expressions like (foo = bar) != NULL and there is certainly a lot more code like that. 0007-Partition-wise-join-implementation.patch + {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD, This GUC should appear in postgresql.conf.sample. I'm chewing on 0007. More soon. -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
От
Thomas Munro
Дата:
On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On my computer it took ~1.5 seconds to plan a 1000 partition join, > ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a > 4000 partition join. I poked around in a profiler a bit and saw that > for the 2000 partition case I spent almost half the time in > create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle, > and about half of that was in bms_is_subset. The other half the time > was in query_planner->make_one_rel which spent 2/3 of its time in > set_rel_size->add_child_rel_equivalences->bms_overlap and the other > 1/3 in standard_join_search. Ashutosh asked me how I did that. Please see attached. I was explaining simple joins like SELECT * FROM foofoo JOIN barbar USING (a, b). Here also is the experimental hack I tried when I saw bitmapset.c eating my CPU. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers