Обсуждение: star schema and the optimizer

Поиск

Список

Период

Сортировка

star schema and the optimizer

От

Marc Cousin

Дата:

27 февраля 2015 г., 14:29:52

Hi all,

I've been facing an issue with star schemas for a while. PostgreSQL's performance is not that good without rewriting
thequery (or at least I couldn't find a way to make it do what I want). 

Here is a very basic mockup schema I used for the rest of this mail. It's of course too small to see very long runtimes
(Isee minutes, not tenths of seconds, with the schema that triggered this work): 

create table dim1 (a int, b int);
create table dim2 (a int, b int);
insert into dim1 select i,(random()*1000)::int from generate_series(1,10000) g(i);
insert into dim2 select i,(random()*1000)::int from generate_series(1,10000) g(i);
create index idxdim11 on dim1(b);
create index idxdim12 on dim1(a);
create index idxdim21 on dim2(b);
create index idxdim22 on dim2(a);

create table facts (dim1 bigint, dim2 bigint, data1 text, data2 text, data3 text, data4 text, data5 text, data6 text);
insert into facts select (random()*1000)::int, (random()*1000)::int,i,i,i,i,i from generate_series(1,10000000) g(i);
create index idxfacts on facts(dim1,dim2);
analyze;


Here is the query that I want to make faster:

SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND dim2.b=17;

PostgreSQL creates this plan:

 Nested Loop  (cost=233.98..207630.07 rows=8 width=99) (actual time=131.891..131.891 rows=0 loops=1)
   Join Filter: (facts.dim2 = dim2.a)
   Rows Removed by Join Filter: 239796
   ->  Index Scan using idxdim22 on dim2  (cost=0.29..343.29 rows=9 width=8) (actual time=0.672..4.436 rows=12 loops=1)
         Filter: (b = 17)
         Rows Removed by Filter: 9988
   ->  Materialize  (cost=233.70..206094.28 rows=9000 width=91) (actual time=0.471..6.673 rows=19983 loops=12)
         ->  Nested Loop  (cost=233.70..206049.28 rows=9000 width=91) (actual time=5.633..53.121 rows=19983 loops=1)
               ->  Index Scan using idxdim12 on dim1  (cost=0.29..343.29 rows=9 width=8) (actual time=0.039..4.359
rows=10loops=1) 
                     Filter: (b = 12)
                     Rows Removed by Filter: 9990
               ->  Bitmap Heap Scan on facts  (cost=233.41..22756.32 rows=9990 width=83) (actual time=1.113..4.146
rows=1998loops=10) 
                     Recheck Cond: (dim1 = dim1.a)
                     Heap Blocks: exact=19085
                     ->  Bitmap Index Scan on idxfacts  (cost=0.00..230.92 rows=9990 width=0) (actual time=0.602..0.602
rows=1998loops=10) 
                           Index Cond: (dim1 = dim1.a)
 Planning time: 1.896 ms
 Execution time: 132.588 ms


What I used to do was to set join_collapse_limit to 1 and rewrite the query like this:

EXPLAIN ANALYZE SELECT * FROM dim1 cross join dim2 join facts on (facts.dim1=dim1.a and facts.dim2=dim2.a) where
dim1.b=12AND dim2.b=17; 
                                                             QUERY PLAN
            

------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=13.25..3661.66 rows=8 width=99) (actual time=0.758..0.758 rows=0 loops=1)
   ->  Nested Loop  (cost=8.71..57.82 rows=81 width=16) (actual time=0.065..0.151 rows=120 loops=1)
         ->  Bitmap Heap Scan on dim1  (cost=4.35..28.39 rows=9 width=8) (actual time=0.040..0.057 rows=10 loops=1)
               Recheck Cond: (b = 12)
               Heap Blocks: exact=10
               ->  Bitmap Index Scan on idxdim11  (cost=0.00..4.35 rows=9 width=0) (actual time=0.033..0.033 rows=10
loops=1)
                     Index Cond: (b = 12)
         ->  Materialize  (cost=4.35..28.44 rows=9 width=8) (actual time=0.002..0.006 rows=12 loops=10)
               ->  Bitmap Heap Scan on dim2  (cost=4.35..28.39 rows=9 width=8) (actual time=0.021..0.040 rows=12
loops=1)
                     Recheck Cond: (b = 17)
                     Heap Blocks: exact=11
                     ->  Bitmap Index Scan on idxdim21  (cost=0.00..4.35 rows=9 width=0) (actual time=0.017..0.017
rows=12loops=1) 
                           Index Cond: (b = 17)
   ->  Bitmap Heap Scan on facts  (cost=4.54..44.39 rows=10 width=83) (actual time=0.004..0.004 rows=0 loops=120)
         Recheck Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
         ->  Bitmap Index Scan on idxfacts  (cost=0.00..4.53 rows=10 width=0) (actual time=0.004..0.004 rows=0
loops=120)
               Index Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
 Planning time: 0.520 ms
 Execution time: 0.812 ms


The cost is 100 times lower. So this plan seems to be a good candidate. Or at least it keeps my users happy.



That rewriting worked for me, but today, I'm in a context where I cannot rewrite the query... it's generated.


So I gave a look at the optimizer's code to try to understand why I got this problem. If I understand correctly, the
optimizerwon't do cross joins, except if it has no choice. 


A funny sidenote before continuing: having dim1.b = dim2.b gives the right plan too:

EXPLAIN ANALYZE SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND
dim2.b=12;
                                                             QUERY PLAN
             

-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=9.43..954.86 rows=1 width=184) (actual time=1.409..1.409 rows=0 loops=1)
   ->  Nested Loop  (cost=8.74..82.11 rows=100 width=32) (actual time=0.155..0.370 rows=70 loops=1)
         ->  Bitmap Heap Scan on dim1  (cost=4.37..40.42 rows=10 width=16) (actual time=0.093..0.144 rows=7 loops=1)
               Recheck Cond: (b = 12)
               Heap Blocks: exact=7
               ->  Bitmap Index Scan on idxdim11  (cost=0.00..4.37 rows=10 width=0) (actual time=0.074..0.074 rows=7
loops=1)
                     Index Cond: (b = 12)
         ->  Materialize  (cost=4.37..40.47 rows=10 width=16) (actual time=0.009..0.020 rows=10 loops=7)
               ->  Bitmap Heap Scan on dim2  (cost=4.37..40.42 rows=10 width=16) (actual time=0.051..0.096 rows=10
loops=1)
                     Recheck Cond: (b = 12)
                     Heap Blocks: exact=10
                     ->  Bitmap Index Scan on idxdim21  (cost=0.00..4.37 rows=10 width=0) (actual time=0.038..0.038
rows=10loops=1) 
                           Index Cond: (b = 12)
   ->  Index Scan using idxfacts on facts  (cost=0.69..8.72 rows=1 width=152) (actual time=0.013..0.013 rows=0
loops=70)
         Index Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
 Planning time: 2.616 ms
 Execution time: 1.625 ms
(17 lignes)


It seems logical: there is an EquivalenceClass between dim1 and dim2, so they can be joined, and the optimizer
considersit, and chooses this plan. 




Just commenting the tests in the planner (joinrels.c) solves my problem, but of course this makes the number of
possibleplans much higher. I attached a patch to this mail (as it is very small), not as a patch proposal of course,
justto show what I have tested. 

EXPLAIN ANALYZE SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND
dim2.b=17;
                                                            QUERY PLAN
           

-----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=13.32..7678.71 rows=17 width=99) (actual time=0.168..3.485 rows=20 loops=1)
   ->  Nested Loop  (cost=8.79..70.60 rows=171 width=16) (actual time=0.107..0.452 rows=152 loops=1)
         ->  Bitmap Heap Scan on dim2  (cost=4.43..40.05 rows=19 width=8) (actual time=0.064..0.153 rows=19 loops=1)
               Recheck Cond: (b = 17)
               Heap Blocks: exact=18
               ->  Bitmap Index Scan on idxdim21  (cost=0.00..4.43 rows=19 width=0) (actual time=0.042..0.042 rows=19
loops=1)
                     Index Cond: (b = 17)
         ->  Materialize  (cost=4.35..28.44 rows=9 width=8) (actual time=0.002..0.008 rows=8 loops=19)
               ->  Bitmap Heap Scan on dim1  (cost=4.35..28.39 rows=9 width=8) (actual time=0.034..0.062 rows=8
loops=1)
                     Recheck Cond: (b = 12)
                     Heap Blocks: exact=6
                     ->  Bitmap Index Scan on idxdim11  (cost=0.00..4.35 rows=9 width=0) (actual time=0.023..0.023
rows=8loops=1) 
                           Index Cond: (b = 12)
   ->  Bitmap Heap Scan on facts  (cost=4.54..44.39 rows=10 width=83) (actual time=0.016..0.017 rows=0 loops=152)
         Recheck Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
         Heap Blocks: exact=20
         ->  Bitmap Index Scan on idxfacts  (cost=0.00..4.53 rows=10 width=0) (actual time=0.014..0.014 rows=0
loops=152)
               Index Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
 Planning time: 2.434 ms
 Execution time: 3.666 ms


So I get the plan I want, now.

I'm not advocating to apply this patch, of course. It would lead to bigger planning times for all other uses of
PostgreSQL.That's the obvious reason why we don't inspect all those cross-joins. 

So my questions are:

* Is this patch "correct", meaning that I could, if I had no other solution, make a patched version for this specific
usage? It passes all regression tests, and seems to return only valid plans. 
* Could this be implemented, maybe with a GUC, similar to 'constraint_exclusion' ? Some GUC telling "This user or
databaseis used for datawarehousing, examine more plans, it's worth it" ? Is there something smarter that could be done
?

The real life difference here between the two plans is that the query runtime is going from 4 minutes to under a
second.

Of course, I'm willing to work on this, if what I wrote on the rest of this email isn't completely false.

Regards,

Marc

Вложения

patch_crossjoin

Re: star schema and the optimizer

От

Tom Lane

Дата:

27 февраля 2015 г., 17:08:33

Marc Cousin <cousinmarc@gmail.com> writes:
> So I gave a look at the optimizer's code to try to understand why I got this problem. If I understand correctly, the
optimizerwon't do cross joins, except if it has no choice.
 

That's right, and as you say, the planning-speed consequences of doing
otherwise would be disastrous.  However, all you need to help it find the
right plan is some dummy join condition between the dimension tables,
which will allow the join path you want to be considered.  Perhaps you
could do something like

SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND dim2.b=17 and
(dim1.a+dim2.a)is not null;
 

The details of the extra condition aren't too important as long as it
mentions all the dimension tables and (a) is always true but (b) is
not so obviously always true that the planner can reduce it to constant
true.  (Thus, for example, you might think you could do this with zero
runtime cost by writing "dummy(dim1.a,dim2.a)" where dummy is an
inlineable SQL function that just returns constant TRUE ... but that's
too cute, it won't fix your problem.)
        regards, tom lane

Re: star schema and the optimizer

От

Marc Cousin

Дата:

27 февраля 2015 г., 17:27:16

On 27/02/2015 15:08, Tom Lane wrote:
> Marc Cousin <cousinmarc@gmail.com> writes:
>> So I gave a look at the optimizer's code to try to understand why I got this problem. If I understand correctly, the
optimizerwon't do cross joins, except if it has no choice.

> 
> That's right, and as you say, the planning-speed consequences of doing
> otherwise would be disastrous.  However, all you need to help it find the
> right plan is some dummy join condition between the dimension tables,
> which will allow the join path you want to be considered.  Perhaps you
> could do something like
> 
> SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND dim2.b=17 and
(dim1.a+dim2.a)is not null;

No I can't. I cannot rewrite the query at all, in my context.

What do you mean by disastrous ?

I've given it a few tries here, and with 8 joins (same model, 7
dimensions), planning time is around 100ms. At least in my context, it's
well worth the planning time, to save minutes of execution.

I perfectly understand that it's not something that should be "by
default", that would be crazy. But in a datawarehouse, it seems to me
that accepting one, or even a few seconds of planning time to save
minutes of execution is perfectly legetimate.

Re: star schema and the optimizer

От

Marc Cousin

Дата:

27 февраля 2015 г., 18:48:21

On 27/02/2015 15:27, Marc Cousin wrote:
> On 27/02/2015 15:08, Tom Lane wrote:
>> Marc Cousin <cousinmarc@gmail.com> writes:
>>> So I gave a look at the optimizer's code to try to understand why I got this problem. If I understand correctly,
theoptimizer won't do cross joins, except if it has no choice.
 
>>
>> That's right, and as you say, the planning-speed consequences of doing
>> otherwise would be disastrous.  However, all you need to help it find the
>> right plan is some dummy join condition between the dimension tables,
>> which will allow the join path you want to be considered.  Perhaps you
>> could do something like
>>
>> SELECT * FROM dim1,dim2,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and dim1.b=12 AND dim2.b=17 and
(dim1.a+dim2.a)is not null;
 
> 
> No I can't. I cannot rewrite the query at all, in my context.
> 
> 
> What do you mean by disastrous ?
> 
> I've given it a few tries here, and with 8 joins (same model, 7
> dimensions), planning time is around 100ms. At least in my context, it's
> well worth the planning time, to save minutes of execution.
> 
> I perfectly understand that it's not something that should be "by
> default", that would be crazy. But in a datawarehouse, it seems to me
> that accepting one, or even a few seconds of planning time to save
> minutes of execution is perfectly legetimate.
> 

I have given it a bit more thought. Could it be possible, to mitigate
this, to permit only a few (few being to define) cross joins ? Still
optional, of course, it still has an important cost. Only allowing cross
joins for the first 3 levels, and keeping this to left-right sided
joins, I can plan up to 11 joins on my small test machine in 500ms
(instead of 150ms with the unpatched one), and get a "good plan", good
meaning 100 times faster.

Regards

Re: star schema and the optimizer

От

Tom Lane

Дата:

27 февраля 2015 г., 18:49:45

Marc Cousin <cousinmarc@gmail.com> writes:
> On 27/02/2015 15:08, Tom Lane wrote:
>> That's right, and as you say, the planning-speed consequences of doing
>> otherwise would be disastrous.

> What do you mean by disastrous ?

Let's assume you have ten fact tables, so ten join conditions (11 tables
altogether).  As-is, the planner considers ten 2-way joins (all between
the fact table and one or another dimension table).  At the level of 3-way
joins, there are 45 possible join relations each of which has just 2
possible sets of input relations.  At the level of 4-way joins, there are
120 join relations to consider each of which can be made in only 3 ways.
And so on.  It's combinatorial but the growth rate is manageable.

On the other hand, if we lobotomize the must-have-a-join-condition filter,
there are 11 choose 2 possible 2-way joins, or 55.  At the level of 3-way
joins, there are 165 possible joins, but each of these can be made in 3
different ways from a 2-way join and a singleton.  At the level of 4-way
joins, there are 330 possible joins, but each of these can be made in
4 ways from a 3-way join and a singleton plus 6 different ways from two
2-way joins.  So at this level we're considering something like 3300
different join paths versus 360 in the existing planner.  It gets rapidly
worse.

The real problem is that in most cases, all those extra paths are utterly
useless.  They would not be less useless just because you're a "data
warehouse" or whatever.  So I'm uninterested in simply lobotomizing that
filter.

What would make more sense is to notice that the fact table has an index
that's amenable to this type of optimization, and then use that to loosen
up the join restrictions, rather than just blindly consider useless joins.

I had actually thought that we'd fixed this type of problem in recent
versions, and that you should be able to get a plan that would look like
Nestloop  -> scan dim1  -> Nestloop       -> scan dim2       -> indexscan fact table using dim1.a and dim2.b

which would arise from developing an indexscan on fact that's
parameterized by both other tables, resolving one of those
parameterizations via a join to dim2, and then the other one
via a join to dim1.  I'm not sure offhand why that isn't working
in this example.
        regards, tom lane

Re: star schema and the optimizer

От

Tom Lane

Дата:

27 февраля 2015 г., 19:19:20

I wrote:
> I had actually thought that we'd fixed this type of problem in recent
> versions, and that you should be able to get a plan that would look like

>     Nestloop
>       -> scan dim1
>       -> Nestloop
>            -> scan dim2
>            -> indexscan fact table using dim1.a and dim2.b

> which would arise from developing an indexscan on fact that's
> parameterized by both other tables, resolving one of those
> parameterizations via a join to dim2, and then the other one
> via a join to dim1.  I'm not sure offhand why that isn't working
> in this example.

It looks like the issue is that the computation of param_source_rels
in add_paths_to_joinrel() is overly restrictive: it thinks there is
no reason to generate a parameterized-by-dim2 path for the join
relation {fact, dim1}, or likewise a parameterized-by-dim1 path for
the join relation {fact, dim2}.  So what we need is to understand
when it's appropriate to do that.  Maybe the mere existence of a
multiply-parameterized path among fact's paths is sufficient.
        regards, tom lane

Re: star schema and the optimizer

От

Tom Lane

Дата:

27 февраля 2015 г., 21:45:33

> I wrote:
>> I had actually thought that we'd fixed this type of problem in recent
>> versions, and that you should be able to get a plan that would look like

>> Nestloop
>>   -> scan dim1
>>   -> Nestloop
>>        -> scan dim2
>>        -> indexscan fact table using dim1.a and dim2.b

After closer study, I think this is an oversight in commit
e2fa76d80ba571d4de8992de6386536867250474, which quoth

+It can be useful for the parameter value to be passed down through
+intermediate layers of joins, for example:
+
+    NestLoop
+        -> Seq Scan on A
+        Hash Join
+            Join Condition: B.Y = C.W
+            -> Seq Scan on B
+            -> Index Scan using C_Z_IDX on C
+                Index Condition: C.Z = A.X
+
+If all joins are plain inner joins then this is unnecessary, because
+it's always possible to reorder the joins so that a parameter is used
+immediately below the nestloop node that provides it.  But in the
+presence of outer joins, join reordering may not be possible, and then
+this option can be critical.  Before version 9.2, Postgres used ad-hoc

This reasoning overlooked the fact that if we need parameters from
more than one relation, and there's no way to join those relations
to each other directly, then we have to allow passing the dim1 parameter
down through the join to dim2.

The attached patch seems to fix it (modulo the need for some updates
in the README, and maybe a regression test).  Could you see if this
produces satisfactory plans for you?

            regards, tom lane

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index e6aa21c..ce812b0 100644
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** try_nestloop_path(PlannerInfo *root,
*** 291,299 ****
      if (required_outer &&
          !bms_overlap(required_outer, param_source_rels))
      {
!         /* Waste no memory when we reject a path here */
!         bms_free(required_outer);
!         return;
      }

      /*
--- 291,320 ----
      if (required_outer &&
          !bms_overlap(required_outer, param_source_rels))
      {
!         /*
!          * We override the param_source_rels heuristic to accept nestloop
!          * paths in which the outer rel satisfies some but not all of the
!          * inner path's parameterization.  This is necessary to get good plans
!          * for star-schema scenarios, in which a parameterized path for a
!          * "fact" table may require parameters from multiple "dimension"
!          * tables that will not get joined directly to each other.  We can
!          * handle that by stacking nestloops that have the dimension tables on
!          * the outside; but this breaks the rule the param_source_rels
!          * heuristic is based on, that parameters should not be passed down
!          * across joins unless there's a join-order-constraint-based reason to
!          * do so.  So, we should consider partial satisfaction of
!          * parameterization as another reason to allow such paths.
!          */
!         Relids        outerrelids = outer_path->parent->relids;
!         Relids        innerparams = PATH_REQ_OUTER(inner_path);
!
!         if (!(bms_overlap(innerparams, outerrelids) &&
!               bms_nonempty_difference(innerparams, outerrelids)))
!         {
!             /* Waste no memory when we reject a path here */
!             bms_free(required_outer);
!             return;
!         }
      }

      /*

Re: star schema and the optimizer

От

Marc Cousin

Дата:

27 февраля 2015 г., 22:01:14

On 27/02/2015 19:45, Tom Lane wrote:
>> I wrote:
>>> I had actually thought that we'd fixed this type of problem in recent
>>> versions, and that you should be able to get a plan that would look like
>
>>> Nestloop
>>>    -> scan dim1
>>>    -> Nestloop
>>>         -> scan dim2
>>>         -> indexscan fact table using dim1.a and dim2.b
>
> After closer study, I think this is an oversight in commit
> e2fa76d80ba571d4de8992de6386536867250474, which quoth
>
> +It can be useful for the parameter value to be passed down through
> +intermediate layers of joins, for example:
> +
> +    NestLoop
> +        -> Seq Scan on A
> +        Hash Join
> +            Join Condition: B.Y = C.W
> +            -> Seq Scan on B
> +            -> Index Scan using C_Z_IDX on C
> +                Index Condition: C.Z = A.X
> +
> +If all joins are plain inner joins then this is unnecessary, because
> +it's always possible to reorder the joins so that a parameter is used
> +immediately below the nestloop node that provides it.  But in the
> +presence of outer joins, join reordering may not be possible, and then
> +this option can be critical.  Before version 9.2, Postgres used ad-hoc
>
> This reasoning overlooked the fact that if we need parameters from
> more than one relation, and there's no way to join those relations
> to each other directly, then we have to allow passing the dim1 parameter
> down through the join to dim2.
>
> The attached patch seems to fix it (modulo the need for some updates
> in the README, and maybe a regression test).  Could you see if this
> produces satisfactory plans for you?

From what I see, it's just perfect. I'll give it a more thorough look a 
bit later, but it seems to be exactly what I was waiting for.

Thanks a lot.

Regards

Re: star schema and the optimizer

От

Marc Cousin

Дата:

28 февраля 2015 г., 11:04:00

On 27/02/2015 20:01, Marc Cousin wrote:
> On 27/02/2015 19:45, Tom Lane wrote:
>>> I wrote:
>>>> I had actually thought that we'd fixed this type of problem in recent
>>>> versions, and that you should be able to get a plan that would look
>>>> like
>>
>>>> Nestloop
>>>>    -> scan dim1
>>>>    -> Nestloop
>>>>         -> scan dim2
>>>>         -> indexscan fact table using dim1.a and dim2.b
>>
>> After closer study, I think this is an oversight in commit
>> e2fa76d80ba571d4de8992de6386536867250474, which quoth
>>
>> +It can be useful for the parameter value to be passed down through
>> +intermediate layers of joins, for example:
>> +
>> +    NestLoop
>> +        -> Seq Scan on A
>> +        Hash Join
>> +            Join Condition: B.Y = C.W
>> +            -> Seq Scan on B
>> +            -> Index Scan using C_Z_IDX on C
>> +                Index Condition: C.Z = A.X
>> +
>> +If all joins are plain inner joins then this is unnecessary, because
>> +it's always possible to reorder the joins so that a parameter is used
>> +immediately below the nestloop node that provides it.  But in the
>> +presence of outer joins, join reordering may not be possible, and then
>> +this option can be critical.  Before version 9.2, Postgres used ad-hoc
>>
>> This reasoning overlooked the fact that if we need parameters from
>> more than one relation, and there's no way to join those relations
>> to each other directly, then we have to allow passing the dim1 parameter
>> down through the join to dim2.
>>
>> The attached patch seems to fix it (modulo the need for some updates
>> in the README, and maybe a regression test).  Could you see if this
>> produces satisfactory plans for you?
> 
> 
> From what I see, it's just perfect. I'll give it a more thorough look a
> bit later, but it seems to be exactly what I was waiting for.
> 
> Thanks a lot.
> 
> Regards

I gave it another look this morning. It works very well with the initial test schema. The situation is much improved
forme.
 

I still have one issue: I've extended the test to more than 2 dimensions.

marc=# explain analyze SELECT * FROM dim1,dim2,dim3,dim4,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and
facts.dim3=dim3.aand facts.dim4=dim4.a and dim1.b between 10 and 60 AND dim2.b between 30 and 50 and dim3.b=8526 and
dim4.bbetween 70 and 90;                                                                      QUERY PLAN
                                                     
 

------------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=15.00..957710.99 rows=1 width=200) (actual time=793.247..793.247 rows=0 loops=1)  ->  Nested Loop
(cost=14.71..957704.75rows=1 width=192) (actual time=793.245..793.245 rows=0 loops=1)        Join Filter: (facts.dim3 =
dim3.a)       Rows Removed by Join Filter: 942        ->  Index Scan using idxdim32 on dim3  (cost=0.29..3300.29
rows=10width=8) (actual time=49.498..67.923 rows=1 loops=1)              Filter: (b = 8526)              Rows Removed
byFilter: 99999        ->  Materialize  (cost=14.41..954262.56 rows=962 width=184) (actual time=0.552..724.988 rows=942
loops=1)             ->  Nested Loop  (cost=14.41..954257.75 rows=962 width=184) (actual time=0.542..723.380 rows=942
loops=1)                   ->  Index Scan using idxdim22 on dim2  (cost=0.29..3648.29 rows=192 width=16) (actual
time=0.074..47.445rows=229 loops=1)                          Filter: ((b >= 30) AND (b <= 50))
RowsRemoved by Filter: 99771                    ->  Nested Loop  (cost=14.12..4946.08 rows=501 width=168) (actual
time=2.575..2.946rows=4 loops=229)                          ->  Bitmap Heap Scan on dim1  (cost=13.43..573.60 rows=501
width=16)(actual time=0.170..0.647 rows=509 loops=229)                                Recheck Cond: ((b >= 10) AND (b
<=60))                                Heap Blocks: exact=78547                                ->  Bitmap Index Scan on
idxdim11 (cost=0.00..13.30 rows=501 width=0) (actual time=0.112..0.112 rows=509 loops=229)
       Index Cond: ((b >= 10) AND (b <= 60))                          ->  Index Scan using idxfacts on facts
(cost=0.69..8.72rows=1 width=152) (actual time=0.004..0.004 rows=0 loops=116561)                                Index
Cond:((dim1 = dim1.a) AND (dim2 = dim2.a))  ->  Index Scan using idxdim42 on dim4  (cost=0.29..6.24 rows=1 width=8)
(neverexecuted)        Index Cond: (a = facts.dim4)        Filter: ((b >= 70) AND (b <= 90))Planning time: 8.092
msExecutiontime: 793.621 ms
 
(25 lignes)

marc=# \d facts      Table « public.facts »Colonne |  Type  | Modificateurs 
---------+--------+---------------dim1    | bigint | dim2    | bigint | dim3    | bigint | dim4    | bigint | dim5    |
bigint| dim6    | bigint | dim7    | bigint | dim8    | bigint | dim9    | bigint | dim10   | bigint | data1   | text
|data2   | text   | data3   | text   | data4   | text   | data5   | text   | data6   | text   | 
 
Index :   "idxfacts" btree (dim1, dim2, dim3, dim4, dim5, dim6, dim7, dim8, dim9, dim10)

I hoped that the dim3=dim3.a condition could be used in the index scan on facts (I have chosen the 8526 value because
itonly has one matching row). Maybe that's not possible, or not efficient, but I thought I'd rather ask.
 


Regards.

Re: star schema and the optimizer

От

Tom Lane

Дата:

28 февраля 2015 г., 19:11:12

Marc Cousin <cousinmarc@gmail.com> writes:
> I gave it another look this morning. It works very well with the initial test schema. The situation is much improved
forme.

> I still have one issue: I've extended the test to more than 2 dimensions.

I tried your original test script with 3 dimension tables, and it gave me
a three-deep nestloop plan once I made the fact table big enough.  I think
your test case has simply not got statistics that encourage doing it that
way.  Notice that the planner thinks (correctly) that it's already down
to fetching only one row from the facts table with dim1 and dim2 as
inputs; so there's no cost advantage to stacking up more nestloops, and
it might as well just materialize the result from that and then join
against dim3 and dim4.  Another factor is that it predicts (less
correctly) that it will get 10 rows from dim3, which would make a straight
nestloop plan about 10x more expensive than what it did here.

You could experiment with turning off enable_material, but I think
the planner may actually be making the right choice here.
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: star schema and the optimizer

Вложения