Re: join ordering via Simulated Annealing

Поиск
Список
Период
Сортировка
От bin wang
Тема Re: join ordering via Simulated Annealing
Дата
Msg-id 323338680912221757p3028180co317b923e71950f98@mail.gmail.com
обсуждение исходный текст
Ответ на join ordering via Simulated Annealing  (Jan Urbański <wulczer@wulczer.org>)
Список pgsql-hackers
I will follow it.<br />Thank you.<br /><br /><div class="gmail_quote">2009/12/23 Jan Urba��ski <span dir="ltr"><<a
href="mailto:wulczer@wulczer.org">wulczer@wulczer.org</a>></span><br/><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi,<br /><br /> I've
beenplaying with using a Simulated Annealing-type algorithm for<br /> determinig join ordering for relations. To get
intocontext see<br /><a href="http://archives.postgresql.org/pgsql-hackers/2009-05/msg00098.php"
target="_blank">http://archives.postgresql.org/pgsql-hackers/2009-05/msg00098.php</a><br/> (there's also a TODO in the
wiki).There's a nice paper on that  in<br /><a
href="http://reference.kfupm.edu.sa/content/h/e/heuristic_and_randomized_optimization_fo_87585.pdf"
target="_blank">http://reference.kfupm.edu.sa/content/h/e/heuristic_and_randomized_optimization_fo_87585.pdf</a><br/>
(alsolinked from that thread) and someone even wrote a patch:<br /><a
href="http://archives.postgresql.org/pgsql-hackers/2009-05/msg00736.php"
target="_blank">http://archives.postgresql.org/pgsql-hackers/2009-05/msg00736.php</a><br/><br /> This generally aims at
beingable to replace GEQO.<br /><br /> I have some rough prototype code, but I'm not even asking people to look<br />
atit. It's stuffed with silly things and writes three Graphviz-style<br /> .dot files in /tmp for each algorithm step.
ButI'd like to at least<br /> present the idea.<br /><br /> There are three main problems that have to be solved to get
agood<br /> Simulated Annealing implementation:<br />  o) choosing the starting point for the algorithm<br />  o)
generatingthe next point<br />  o) computing the cost in the current point<br /><br /> The code I have now creates the
initialplan by doing something similar<br /> to what gimme_tree does in GEQO, but without any kind of heuristic to<br
/>try to favour joins with join restrictions (the idea is that it doesn't<br /> matter, since we only want to get *any*
planand we only do it once),<br /> but ideally it would be somehow chosen randonly from the space of all<br /> possible
joinorderings.<br /><br /> I keep a binary tree of relations in memory, where leaves are<br /> RelOptInfos with 1 relid
andthe root is a RelOptInfo with all relids.<br /> Each iteration of the algorithm picks two nodes at random (with
some<br/> restrictions, but that's not important) and tries to swap them around,<br /> meaning that a tree like (use a
monospacefont for best results):<br /><br />           (1 2 3 4)<br />      *(1 2)        (3 4)<br />      (1) (2)    
*(3)(4)<br /><br /> where the parenthesised things are the two chosen nodes would get<br /> transformed into<br /><br
/>          (1 2 3 4)<br />         (3)      (1 2 4)<br />               (1 2)    (4)<br />              (1) (2)<br
/><br/> that is, the (1 2) node and its subtree gets swapped with the (3) node<br /> and the upper-level nodes get
changedaccordingly. Sometimes a swap like<br /> that will produce an invalid join ordering - then swap is then
reverted.<br/> If the join order given by the tree after the swap is legal, the paths<br /> are recomputed, much like
ingeqo_eval, and if the root node's cheapest<br /> path is not cheaper that before the swap, the swap gets reverted.<br
/>Simulated Annealing algorithms permit uphill moves in terms of cost,<br /> with some probability that's decreasing as
timepasses, but that's not<br /> implemented yet. After a fixed amount of moves, the algorithm stops and<br /> returns
theroot node of the tree as the single RelOptInfo.<br /><br /> The issues with the approach are:<br /><br />  o) the
initialstate is not really a random plan, it's usualy a<br /> left-deep tree (and is very inefficient) and this might
skewresults.<br /><br />  o) is swapping random nodes like that a good way of exploring the<br /> solution space? The
SApaper suggests something much simpler, but some<br /> of the moves proposed there don't really make sense when using
the<br/> make_join_rel machinery:<br />   *) changing the inner and outer relation of a join doesn't make<br /> sense,
becausemake_join_rel is symmetrical<br />   *) changing the join executing strategy (nested loop, merge join,<br />
etc.)doesn't make sense, because make_join_rel considers all possible<br /> paths for a join<br /><br />  o) each swap
needsa full recalcualtion of the whole solution<br /> (geqo_eval does that, for instance), maybe it's possible to leave
the<br/> subtrees of swapped nodes intact and only recalculate the nodes above<br /> the two swapped nodes?<br /><br />
 o)because make_join_rel scribbles on the PlannerInfo, the same hack as<br /> in geqo_eval is necessary: truncating
join_rel_listafter building all<br /> joinrels and nulling out the join_rel_hash<br /><br />  o) I use make_join_rel to
determinewhether a join is legal, which does<br /> lots of other things. That looks like wasted effort, although in the
end<br/> each time I need to build the final rel to assess the resulting path's<br /> cost. To follow the SA algorithm
spiritmore closely it would be<br /> necessary to only consider one, random path for each relation and make<br /> using
differentpaths a move that the algorithm can choose while<br /> exploring the solution space. A cheaper way of
determiningthe current<br /> solution's cost would be nice, too - fully rebuilding the final<br /> RelOptInfo after
eachmove is annoying.<br /><br /> Lastly, I'm lacking good testcases or even a testing approach: I'm<br /> generating
sillyqueries and looking at how they get optimised, but if<br /> someone has a real dataset and examples of joins that
cannotbe planned<br /> with the standard planner, I would be interested to compare the results<br /> my prototype gets
withthose produced by GEQO.<br /><br /> The code, a module that you can LOAD into a backend, is here (if you<br />
reallywant to see it):<br /><a href="http://git.wulczer.org/?p=saio.git"
target="_blank">http://git.wulczer.org/?p=saio.git</a><br/> Set the saio GUC to true and saio_cutoff to the number of
iterationsyou<br /> want.<br /><br /> Cheers,<br /> Jan<br /><font color="#888888"><br /> --<br /> Sent via
pgsql-hackersmailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> To
makechanges to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></font></blockquote></div><br /> 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jan Urbański
Дата:
Сообщение: join ordering via Simulated Annealing
Следующее
От: Tom Lane
Дата:
Сообщение: Re: creating index names automatically?