Обсуждение: Erronous sort used in query plan
I am putting together searches on the catalog info and came up with a select that was rather slow and I noticed that in the explain analyze there is a sort step on one of the left joins which I don't think belongs there. I found the small error in my query (using tl.oid instead of tr.oid and tres.oid) that caused the query to slow down and generate the sort in the plan but am not sure that the given condition should even generate a sort step and if it does then I believe it should be a (more?) stable decision. Removing one of the left join's that is in error (tr or tres) changes the column that is sorted, neither of which is related to the join/s that appear to generate the step. With tl, tr and tres in place the sort is performed on pjoin.oid. Removing or correcting either tr or tres the sort is changed to perform on olsort.oid. Removing or correcting both tr and tres removes the sort from the plan. Also - removing all the pg_operator joins the sort is still there (on pjoin.oid) but if I remove one of the erroneous joins as well the sort goes. (correcting one of the joins leaves the sort there but removing it removes the sort) Using postgres 8.2.0 on Mac OSX 10.4.8 The full query is - explain analyze SELECT o.oid as "OID" , n.nspname as "Schema" , o.oprname as "Name" , r.rolname as "Owner" , CASE WHEN o.oprkind='b' THEN 'infix(left and right)' WHEN o.oprkind='l' THEN 'prefix (left)' WHEN o.oprkind='r' THEN 'postfix (right)' END as "Kind" , CASE WHEN o.oprcanhash='t' THEN 'Yes' WHEN o.oprcanhash='f' THEN 'No' END as "Supports Hash Joins" , tl.typname as "Left Operand" , tr.typname as "Right Operand" , tres.typname as "Result Type" , ocom.oprname as "Commutator Operator" , onegate.oprname as "Negator Operator" , olsort.oprname as "Left Sort Operator" , orsort.oprname as "Right Sort Operator" , oltcm.oprname as "Less Than Operator" , ogtcm.oprname as "Greater Than Operator" , pcode.proname as "Operator Function" , prest.proname as "Restriction Selectivity Function" , pjoin.proname as "Join Selectivity Function" FROM pg_catalog.pg_operator o left join pg_catalog.pg_namespace n ON n.oid = o.oprnamespace left join pg_catalog.pg_roles r on r.oid=o.oprowner left join pg_catalog.pg_type tl on tl.oid=o.oprleft left join pg_catalog.pg_type tr on tl.oid=o.oprright left join pg_catalog.pg_type tres on tl.oid=o.oprresult left join pg_catalog.pg_operator ocom on ocom.oid=o.oprcom left join pg_catalog.pg_operator onegate on onegate.oid=o.oprnegate left join pg_catalog.pg_operator oneg on oneg.oid=o.oprnegate left join pg_catalog.pg_operator olsort on olsort.oid=o.oprlsortop left join pg_catalog.pg_operator orsort on orsort.oid=o.oprrsortop left join pg_catalog.pg_operator oltcm on oltcm.oid=o.oprltcmpop left join pg_catalog.pg_operator ogtcm on ogtcm.oid=o.oprgtcmpop left join pg_catalog.pg_proc pcode on pcode.oid=o.oprcode left join pg_catalog.pg_proc prest on prest.oid=o.oprrest left join pg_catalog.pg_proc pjoin on pjoin.oid=o.oprjoin WHERE n.nspname like 'public' I have attached a copy of the query and plan. -- Shane Ambler pgSQL@007Marketing.com Get Sheeky @ http://Sheeky.Biz explain analyze SELECT o.oid as "OID" , n.nspname as "Schema" , o.oprname as "Name" , r.rolname as "Owner" , CASE WHEN o.oprkind='b' THEN 'infix(left and right)' WHEN o.oprkind='l' THEN 'prefix (left)' WHEN o.oprkind='r' THEN 'postfix (right)' END as "Kind" , CASE WHEN o.oprcanhash='t' THEN 'Yes' WHEN o.oprcanhash='f' THEN 'No' END as "Supports Hash Joins" , tl.typname as "Left Operand" , tr.typname as "Right Operand" , tres.typname as "Result Type" , ocom.oprname as "Commutator Operator" , onegate.oprname as "Negator Operator" , olsort.oprname as "Left Sort Operator" , orsort.oprname as "Right Sort Operator" , oltcm.oprname as "Less Than Operator" , ogtcm.oprname as "Greater Than Operator" , pcode.proname as "Operator Function" , prest.proname as "Restriction Selectivity Function" , pjoin.proname as "Join Selectivity Function" FROM pg_catalog.pg_operator o left join pg_catalog.pg_namespace n ON n.oid = o.oprnamespace left join pg_catalog.pg_roles r on r.oid=o.oprowner left join pg_catalog.pg_type tl on tl.oid=o.oprleft left join pg_catalog.pg_type tr on tl.oid=o.oprright left join pg_catalog.pg_type tres on tl.oid=o.oprresult left join pg_catalog.pg_operator ocom on ocom.oid=o.oprcom left join pg_catalog.pg_operator onegate on onegate.oid=o.oprnegate left join pg_catalog.pg_operator oneg on oneg.oid=o.oprnegate left join pg_catalog.pg_operator olsort on olsort.oid=o.oprlsortop left join pg_catalog.pg_operator orsort on orsort.oid=o.oprrsortop left join pg_catalog.pg_operator oltcm on oltcm.oid=o.oprltcmpop left join pg_catalog.pg_operator ogtcm on ogtcm.oid=o.oprgtcmpop left join pg_catalog.pg_proc pcode on pcode.oid=o.oprcode left join pg_catalog.pg_proc prest on prest.oid=o.oprrest left join pg_catalog.pg_proc pjoin on pjoin.oid=o.oprjoin WHERE n.nspname like 'public' ORDER BY lower(n.nspname), lower(o.oprname) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=2707.30..2707.66 rows=143 width=966) (actual time=22085.090..23284.736 rows=314064 loops=1) Sort Key: lower((n.nspname)::text), lower((o.oprname)::text) -> Hash Left Join (cost=2652.63..2702.18 rows=143 width=966) (actual time=3668.065..9163.877 rows=314064 loops=1) Hash Cond: ((o.oprcode)::oid = pcode.oid) -> Hash Left Join (cost=2567.13..2600.79 rows=143 width=906) (actual time=3661.366..7305.265 rows=314064 loops=1) Hash Cond: ((o.oprrest)::oid = prest.oid) -> Hash Left Join (cost=2481.63..2504.16 rows=143 width=846) (actual time=3654.311..6704.433 rows=314064loops=1) Hash Cond: (o.oprlsortop = olsort.oid) -> Hash Left Join (cost=2453.91..2474.29 rows=143 width=786) (actual time=3651.827..6250.536 rows=314064loops=1) Hash Cond: (o.oprrsortop = orsort.oid) -> Hash Left Join (cost=2426.18..2444.42 rows=143 width=726) (actual time=3649.287..5795.792rows=314064 loops=1) Hash Cond: (o.oprltcmpop = oltcm.oid) -> Hash Left Join (cost=2398.46..2414.55 rows=143 width=666) (actual time=3646.749..5331.642rows=314064 loops=1) Hash Cond: (o.oprgtcmpop = ogtcm.oid) -> Merge Left Join (cost=2370.73..2384.68 rows=143 width=606) (actual time=3643.994..4867.855rows=314064 loops=1) Merge Cond: ("outer"."?column19?" = pjoin.oid) -> Sort (cost=2158.92..2159.27 rows=143 width=546) (actual time=3634.598..3800.837rows=314064 loops=1) Sort Key: (o.oprjoin)::oid -> Hash Left Join (cost=49.70..2153.80 rows=143 width=546) (actual time=5.883..2061.807rows=314064 loops=1) Hash Cond: (o.oprnegate = oneg.oid) -> Hash Left Join (cost=21.98..2123.93 rows=143 width=550) (actualtime=4.022..1348.066 rows=314064 loops=1) Hash Cond: (o.oprowner = r.oid) -> Nested Loop Left Join (cost=20.89..2120.69 rows=143 width=490)(actual time=3.878..709.734 rows=314064 loops=1) Join Filter: (tl.oid = o.oprright) -> Nested Loop Left Join (cost=10.44..1103.17 rows=143width=434) (actual time=3.843..73.766 rows=1128 loops=1) Join Filter: (tl.oid = o.oprresult) -> Nested Loop Left Join (cost=0.00..85.65 rows=143width=374) (actual time=2.654..13.504 rows=192 loops=1) -> Nested Loop Left Join (cost=0.00..75.35rows=143 width=310) (actual time=2.637..10.730 rows=192 loops=1) -> Nested Loop Left Join (cost=0.00..56.37rows=143 width=250) (actual time=2.625..8.149 rows=192 loops=1) -> Nested Loop (cost=0.00..37.38rows=143 width=186) (actual time=2.588..4.261 rows=192 loops=1) Join Filter: (n.oid =o.oprnamespace) -> Seq Scan on pg_namespacen (cost=0.00..1.07 rows=1 width=68) (actual time=0.052..0.054 rows=1 loops=1) Filter: (nspname~~ 'public'::text) -> Seq Scan on pg_operatoro (cost=0.00..25.58 rows=858 width=126) (actual time=0.022..3.408 rows=858 loops=1) -> Index Scan using pg_operator_oid_indexon pg_operator onegate (cost=0.00..0.12 rows=1 width=68) (actual time=0.014..0.016 rows=1 loops=192) Index Cond: (onegate.oid= o.oprnegate) -> Index Scan using pg_operator_oid_indexon pg_operator ocom (cost=0.00..0.12 rows=1 width=68) (actual time=0.008..0.010 rows=1 loops=192) Index Cond: (ocom.oid = o.oprcom) -> Index Scan using pg_type_oid_index onpg_type tl (cost=0.00..0.06 rows=1 width=68) (actual time=0.007..0.011 rows=1 loops=192) Index Cond: (tl.oid = o.oprleft) -> Materialize (cost=10.44..13.57 rows=313 width=64)(actual time=0.001..0.125 rows=313 loops=192) -> Seq Scan on pg_type tr2 (cost=0.00..10.13rows=313 width=64) (actual time=0.013..0.328 rows=313 loops=1) -> Materialize (cost=10.44..13.57 rows=313 width=64)(actual time=0.000..0.133 rows=313 loops=1128) -> Seq Scan on pg_type tr (cost=0.00..10.13rows=313 width=64) (actual time=0.007..0.320 rows=313 loops=1) -> Hash (cost=1.08..1.08 rows=4 width=68) (actual time=0.065..0.065rows=5 loops=1) -> Subquery Scan r (cost=0.00..1.08 rows=4 width=68)(actual time=0.021..0.037 rows=5 loops=1) -> Seq Scan on pg_authid (cost=0.00..1.04 rows=4width=118) (actual time=0.018..0.027 rows=5 loops=1) -> Hash (cost=25.58..25.58 rows=858 width=4) (actual time=1.797..1.797rows=858 loops=1) -> Seq Scan on pg_operator oneg (cost=0.00..25.58 rows=858width=4) (actual time=0.012..0.781 rows=858 loops=1) -> Sort (cost=211.81..217.71 rows=2360 width=68) (actual time=9.366..104.265rows=216851 loops=1) Sort Key: pjoin.oid -> Seq Scan on pg_proc pjoin (cost=0.00..79.60 rows=2360 width=68) (actualtime=0.039..2.700 rows=2360 loops=1) -> Hash (cost=25.58..25.58 rows=858 width=68) (actual time=2.589..2.589 rows=858loops=1) -> Seq Scan on pg_operator ogtcm (cost=0.00..25.58 rows=858 width=68) (actualtime=0.040..1.024 rows=858 loops=1) -> Hash (cost=25.58..25.58 rows=858 width=68) (actual time=2.448..2.448 rows=858 loops=1) -> Seq Scan on pg_operator oltcm (cost=0.00..25.58 rows=858 width=68) (actual time=0.016..0.909rows=858 loops=1) -> Hash (cost=25.58..25.58 rows=858 width=68) (actual time=2.471..2.471 rows=858 loops=1) -> Seq Scan on pg_operator orsort (cost=0.00..25.58 rows=858 width=68) (actual time=0.015..0.923rows=858 loops=1) -> Hash (cost=25.58..25.58 rows=858 width=68) (actual time=2.409..2.409 rows=858 loops=1) -> Seq Scan on pg_operator olsort (cost=0.00..25.58 rows=858 width=68) (actual time=0.022..0.928rows=858 loops=1) -> Hash (cost=79.60..79.60 rows=2360 width=68) (actual time=6.983..6.983 rows=2360 loops=1) -> Seq Scan on pg_proc prest (cost=0.00..79.60 rows=2360 width=68) (actual time=0.020..2.793 rows=2360loops=1) -> Hash (cost=79.60..79.60 rows=2360 width=68) (actual time=6.571..6.571 rows=2360 loops=1) -> Seq Scan on pg_proc pcode (cost=0.00..79.60 rows=2360 width=68) (actual time=0.017..2.593 rows=2360 loops=1) Total runtime: 24195.085 ms (65 rows)
Shane Ambler <pgsql@007Marketing.com> writes: > I am putting together searches on the catalog info and came up with a > select that was rather slow and I noticed that in the explain analyze > there is a sort step on one of the left joins which I don't think > belongs there. Well, it's certainly necessary in context because it's preparing the data for the merge join immediately above it. The correct question is why is the thing using a merge join here, when a hash join would be cheaper? I dug through this and found out that the hash join is estimated as cheaper, right up till the last step of cost_hashjoin: /* * Bias against putting larger relation on inside. We don't want an * absolute prohibition, though, since largerrelation might have better * bucketsize --- and we can't trust the size estimates unreservedly, * anyway. Instead,inflate the run cost by the square root of the size * ratio. (Why square root? No real good reason, but it seems * reasonable...) * * Note: before 7.4 we implemented this by inflating startup cost; but if * there's adisable_cost component in the input paths' startup cost, that * unfairly penalizes the hash. Probably it'd be betterto keep track of * disable penalty separately from cost. */ if (innerbytes > outerbytes && outerbytes > 0) run_cost *= sqrt(innerbytes / outerbytes); In this example, the data volume from the join of everything else is estimated as less than what needs to be fetched from pg_proc, and so this bias kicks in, and the cost estimate roughly doubles. Unfortunately, because it's a LEFT JOIN, we'll never consider hashjoin in the other direction and so the hash loses out to the mergejoin. It seems clear to me that we ought not impose a bias unless the join type is such that both directions of hashing are feasible. I wonder also if the bias is too large ... but there's not really evidence for or against that in this example. The point is that this code implicitly assumes both directions will be tried, and they won't. regards, tom lane
Tom Lane wrote: > Shane Ambler <pgsql@007Marketing.com> writes: >> I am putting together searches on the catalog info and came up with a >> select that was rather slow and I noticed that in the explain analyze >> there is a sort step on one of the left joins which I don't think >> belongs there. > > Well, it's certainly necessary in context because it's preparing the > data for the merge join immediately above it. The correct question > is why is the thing using a merge join here, when a hash join would be > cheaper? > > I dug through this and found out that the hash join is estimated as > cheaper, right up till the last step of cost_hashjoin: > > /* > * Bias against putting larger relation on inside. We don't want an > * absolute prohibition, though, since larger relation might have better > * bucketsize --- and we can't trust the size estimates unreservedly, > * anyway. Instead, inflate the run cost by the square root of the size > * ratio. (Why square root? No real good reason, but it seems > * reasonable...) > * > * Note: before 7.4 we implemented this by inflating startup cost; but if > * there's a disable_cost component in the input paths' startup cost, that > * unfairly penalizes the hash. Probably it'd be better to keep track of > * disable penalty separately from cost. > */ > if (innerbytes > outerbytes && outerbytes > 0) > run_cost *= sqrt(innerbytes / outerbytes); > > In this example, the data volume from the join of everything else is > estimated as less than what needs to be fetched from pg_proc, and so > this bias kicks in, and the cost estimate roughly doubles. > Unfortunately, because it's a LEFT JOIN, we'll never consider hashjoin > in the other direction and so the hash loses out to the mergejoin. > > It seems clear to me that we ought not impose a bias unless the join > type is such that both directions of hashing are feasible. I wonder > also if the bias is too large ... but there's not really evidence for > or against that in this example. The point is that this code implicitly > assumes both directions will be tried, and they won't. > I think that the selected sort (or at least the merge join) is incorrect - the column sorted (or both actually) is linking the current record in pg_operator with the oid in the pg_proc - it will only return one row. If one of the pg_type joins is changed, it then sorts on the oid of pg_operator as the foreign table - again this will only return one row. I would think that the foreign oid would indicate to the planner that it will only find one foreign row to link with. I can see that the error I made created a funny (probably useless actually) link that would throw things out, but I would expect it to create bad planning for the two joins that are in error not a non-related one to one join. If a sort/merge join was created from this and used to satisfy this join I would accept that as part of what I unintentionally requested but instead it generates a sort/merge join on a join that links one current record to one foreign record and has nothing in common with the joins in error. -- Shane Ambler pgSQL@007Marketing.com Get Sheeky @ http://Sheeky.Biz
Shane Ambler <pgsql@007Marketing.com> writes: > Tom Lane wrote: >> It seems clear to me that we ought not impose a bias unless the join >> type is such that both directions of hashing are feasible. > I think that the selected sort (or at least the merge join) is incorrect > - the column sorted (or both actually) is linking the current record in > pg_operator with the oid in the pg_proc - it will only return one row. That hasn't got anything to do with it AFAICS. The plan is not wrong, it's just inefficient. I dug around in old files and found out that the notion of discriminating against hash joins with the larger rel on the inside is ancient: in Postgres v4r2, the oldest tarball I have, cost_hashjoin contains int outerpages = page_size (outersize,outerwidth); int innerpages = page_size (innersize,innerwidth); if (outerpages < innerpages) return _disable_cost_; and while the code's been rewritten several times, the lineage is clear. It seems to me now that I've thought about it that we should just get rid of this bias entirely: it is demonstrably wrong if the join is not an inner join, or if the larger relation has significantly better dispersion across hash buckets. To the extent that it's accomplishing anything good at all, the behavior ought to be emergent from the rest of the cost model. I experimented a bit with the idea and found out that without the bias, it was willing to flip over to larger-relation-on-the-inside in cases where that actually made it a bit slower. I interpret this as meaning that we're undercharging for loading the hash table in comparison to using it. The cost model as it stands (but minus bias) effectively says that for two relations that both fit into work_mem and have equally well-dispersed hash keys, it's actually better to have the larger rel on the inside! The number of hash-function calculations is the same either way (one for each tuple in the two rels), and the executor tries to hold the number-of-tuples-per-hash-bucket constant at about 10 regardless of relation size, so when you trace through the calculations you find the only differential between the total-cost estimates for the two mirror-image hash joins is the number of outer tuples for which hashtable probes must be made; hence the fewer of them the better. Experimentation shows this isn't so, however, which suggests that the cost of inserting a tuple into the hashtable is a non-ignorable cost. I got results that seemed to track reality better after I added cpu_tuple_cost per hashtable entry and discounted the per-outer-tuple cost a bit to compensate. I also noticed that the cost estimate for having to do I/O in a batched hashjoin had missed getting updated for the 8.2 seq_page_cost changes; this is a plain ol' oversight on my part. In short, then, I'm proposing the attached patch for HEAD and 8.2; I'm not sure whether to change things further back. Comments? regards, tom lane Index: costsize.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v retrieving revision 1.172 diff -c -r1.172 costsize.c *** costsize.c 5 Jan 2007 22:19:31 -0000 1.172 --- costsize.c 8 Jan 2007 01:00:56 -0000 *************** *** 1498,1507 **** double hashjointuples; double outer_path_rows = PATH_ROWS(outer_path); double inner_path_rows = PATH_ROWS(inner_path); - double outerbytes = relation_byte_size(outer_path_rows, - outer_path->parent->width); - double innerbytes = relation_byte_size(inner_path_rows, - inner_path->parent->width); int num_hashclauses = list_length(hashclauses); int numbuckets; int numbatches; --- 1498,1503 ---- *************** *** 1538,1550 **** /* * Cost of computing hash function: must do it once per input tuple. We ! * charge one cpu_operator_cost for each column's hash function. * * XXX when a hashclause is more complexthan a single operator, we really * should charge the extra eval costs of the left or right side, as * appropriate,here. This seems more work than it's worth at the moment. */ ! startup_cost += cpu_operator_cost * num_hashclauses * inner_path_rows; run_cost += cpu_operator_cost * num_hashclauses* outer_path_rows; /* Get hash table size that executor would use for inner relation */ --- 1534,1549 ---- /* * Cost of computing hash function: must do it once per input tuple. We ! * charge one cpu_operator_cost for each column's hash function. Also, ! * tack on one cpu_tuple_cost per inner row, to model the costs of ! * inserting the row into the hashtable. * * XXX when a hashclause is more complex than a single operator,we really * should charge the extra eval costs of the left or right side, as * appropriate, here. Thisseems more work than it's worth at the moment. */ ! startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) ! * inner_path_rows; run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows; /* Get hash tablesize that executor would use for inner relation */ *************** *** 1624,1631 **** /* * If inner relation is too big then we will need to "batch" the join, * which implieswriting and reading most of the tuples to disk an extra ! * time. Charge one cost unit per page of I/O (correct since it should be ! * nice and sequential...). Writing the inner rel counts as startup cost, * all the rest as run cost. */ if (numbatches > 1) --- 1623,1630 ---- /* * If inner relation is too big then we will need to "batch" the join, * which implieswriting and reading most of the tuples to disk an extra ! * time. Charge seq_page_cost per page, since the I/O should be nice and ! * sequential. Writing the inner rel counts as startup cost, * all the rest as run cost. */ if (numbatches> 1) *************** *** 1635,1642 **** double innerpages = page_size(inner_path_rows, inner_path->parent->width); ! startup_cost += innerpages; ! run_cost += innerpages + 2 * outerpages; } /* CPU costs */ --- 1634,1641 ---- double innerpages = page_size(inner_path_rows, inner_path->parent->width); ! startup_cost += seq_page_cost * innerpages; ! run_cost += seq_page_cost * (innerpages + 2 * outerpages); } /* CPU costs */ *************** *** 1654,1667 **** * The number of tuple comparisons needed is the number of outer tuples * times the typical numberof tuples in a hash bucket, which is the inner * relation size times its bucketsize fraction. At each one, weneed to ! * evaluate the hashjoin quals. (Note: charging the full qual eval cost ! * at each tuple is pessimistic, since we don't evaluate the quals unless ! * the hash values match exactly.) */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple* outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * ! joininfactor; /* * For each tuple that gets through the hashjoin proper, we charge --- 1653,1667 ---- * The number of tuple comparisons needed is the number of outer tuples * times the typical numberof tuples in a hash bucket, which is the inner * relation size times its bucketsize fraction. At each one, weneed to ! * evaluate the hashjoin quals. But actually, charging the full qual eval ! * cost at each tuple is pessimistic, since we don't evaluate the quals ! * unless the hash values match exactly. For lack of a better idea, halve ! * the cost estimate to allow for that. */ startup_cost += hash_qual_cost.startup; run_cost += hash_qual_cost.per_tuple* outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * ! joininfactor * 0.5; /* * For each tuple that gets through the hashjoin proper, we charge *************** *** 1673,1694 **** cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple; run_cost += cpu_per_tuple * hashjointuples* joininfactor; - /* - * Bias against putting larger relation on inside. We don't want an - * absolute prohibition, though, since larger relation might have better - * bucketsize --- and we can't trust the size estimates unreservedly, - * anyway. Instead, inflate the run cost by the square root of the size - * ratio. (Why square root? No real good reason, but it seems - * reasonable...) - * - * Note: before 7.4 we implemented this by inflating startup cost; but if - * there's a disable_cost component in the input paths' startup cost, that - * unfairly penalizes the hash. Probably it'd be better to keep track of - * disable penalty separately from cost. - */ - if (innerbytes > outerbytes && outerbytes > 0) - run_cost *= sqrt(innerbytes / outerbytes); - path->jpath.path.startup_cost = startup_cost; path->jpath.path.total_cost = startup_cost + run_cost; } --- 1673,1678 ----