Re: PoC: Partial sort

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: PoC: Partial sort
Дата
Msg-id CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PoC: Partial sort  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: PoC: Partial sort  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
Hi, Peter!

Thank you for review!

On Thu, Mar 24, 2016 at 3:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> Sort Method
>> ----------------
>>
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+       if (node->tuplesortstate != NULL)
+       {
+               tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+               node->groupsCount++;
+       }
+       else
+       {
+               /* Support structures for cmpSortSkipCols - already
sorted columns */
+               if (skipCols)
+                       prepareSkipCols(plannode, node);

+               /*
+                * Only pass on remaining columns that are unsorted.
Skip abbreviated
+                * keys usage for partial sort.  We unlikely will have
huge groups
+                * with partial sort.  Therefore usage of abbreviated
keys would be
+                * likely a waste of time.
+                */
                tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.
 
Done.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields.

Done.
 
SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

Fixed.
 
The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

Comment is added.
 
> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+       /*
+        * Adjust bound_Done with number of tuples we've actually sorted.
+        */
+       if (node->bounded)
+       {
+               if (node->finished)
+                       node->bound_Done = node->bound;
+               else
+                       node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall).  It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

Hmm... I'm not completely agree with that. In typical usage partial sort should definitely use quicksort.  However, fallback to other sort methods is very useful.  Decision of partial sort usage is made by planner.  But planner makes mistakes.  For example, our HashAggregate is purely in-memory.  In the case of planner mistake it causes OOM.  I met such situation in production and not once.  This is why I'd like partial sort to have graceful degradation for such cases.

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I've adjusted EXPLAIN ANALYZE to show maximum resources consumption.
 

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       int i;
+
+       if (state->tapeset)
+               LogicalTapeSetClose(state->tapeset);
+
+       for (i = 0; i < state->memtupcount; i++)
+               free_sort_tuple(state, state->memtuples + i);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

I tried to manage this by introducing another memory context which is persistent between partial sort batches.  Other memory contexts are reset.
 
>> I don't like that you've added a Plan node argument to
>> ExecMaterializesOutput() in this function, too.
>
>
> I don't like this too. But I didn't find better solution without significant
> rework of planner.
> However, "Upper planner pathification" by Tom Lane seems to have such
> rework. It's likely sort becomes separate path node there.
> Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

I've added ExecPlanMaterializesOutput() function.
 
>> Optimizer
>> -------------
>>

>> * cost_sort() needs way way more comments.  Doesn't even mention
>> indexes. Not worth commenting further on until I know what it's
>> *supposed* to do, though.
>
>
> I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

I'm not sure that in partial sort we should estimate sorting of one group in other way than simple sort does.  I see following reasons:
1) I'd like partial sort to be able to use other sorting methods to provide graceful degradation in the case of planner mistakes as I pointed above.
2) Planner should don't choose partial sort plans in some cases.  That should happen because higher cost of these plans including high cost of partial sort.  Estimation of other sort methods looks like good way for reporting such high costs. 
 
This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

I think this is rather a typo fix.  Because now comment doesn't meet function name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: [postgresSQL] [bug] Two or more different types of constraints with same name creates ambiguity while drooping.
Следующее
От: Fabien
Дата:
Сообщение: Desirable pgbench features?