Обсуждение: Third thoughts about the DISTINCT MAX() problem

Поиск
Список
Период
Сортировка

Third thoughts about the DISTINCT MAX() problem

От
Tom Lane
Дата:
I just realized that the patch I applied here
http://archives.postgresql.org/pgsql-committers/2008-03/msg00531.php
for Taiki Yamaguchi's bug report here
http://archives.postgresql.org/pgsql-bugs/2008-03/msg00275.php
really doesn't work.  It assumes that an ungrouped aggregate
query can't return more than one row, which is true in straight
SQL ... but it's not true if you consider SRFs in the target list.
CVS HEAD gives the wrong answer for this example in the regression
database:

regression=# select max(unique1), generate_series(1,3) as g from tenk1 order by g desc;max  | g 
------+---9999 | 19999 | 29999 | 3
(3 rows)

because it wrongly supposes it can discard the ORDER BY.

So, back to the drawing board.  I had thought of two approaches to
fixing the problem instead of just dodging it.  Plan A was to
apply planagg.c's Aggref->Param substitution inside EquivalenceClasses,
as in the draft patch here:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00388.php
which I didn't entirely like for reasons mentioned in that post.
Plan B was to try to revert to the way sort clause matching was
done pre-8.3, that is have make_sort_from_pathkeys check first
for a matching ressortgroupref tag before it goes looking for equal()
expressions.  I had actually tried to do that first but got hung
up on the problem of identifying the correct sort operator ---
just taking the exposed type of the targetlist entry doesn't always
work, because of binary-compatible cases (eg, tlist entry may say
it yields varchar but we need to find the text opclass).  Perhaps
thinking a bit harder would yield a solution though.

Does anyone have comments for or against either of these approaches,
or perhaps a Plan C to consider?
        regards, tom lane


Re: Third thoughts about the DISTINCT MAX() problem

От
Tom Lane
Дата:
I wrote:
> Plan B was to try to revert to the way sort clause matching was
> done pre-8.3, that is have make_sort_from_pathkeys check first
> for a matching ressortgroupref tag before it goes looking for equal()
> expressions.  I had actually tried to do that first but got hung
> up on the problem of identifying the correct sort operator ---
> just taking the exposed type of the targetlist entry doesn't always
> work, because of binary-compatible cases (eg, tlist entry may say
> it yields varchar but we need to find the text opclass).  Perhaps
> thinking a bit harder would yield a solution though.

Ah, got it.  I had previously attached a sortref to EquivalenceClasses
(ec_sortref) to deal properly with multiple textually-identical but
volatile expressions.  But that's really the wrong thing: sortrefs
should be attached to individual EquivalenceMembers, instead.  If
we do that then the existing logic in make_sort_from_pathkeys can be
made to use the sortref as a preferred (and faster) way of identifying
which tlist member matches a pathkey.

This implies a change in the EquivalenceClass data structures,
but those are never dumped to disk so it's not a problem to back-patch.

Note: we still need to be able to match on equal() since we may have
to deal with sort keys that are not any of the explicit ORDER BY
expressions and hence don't have a sortref.  But that's not a problem
for the MIN/MAX optimization since the only things left to do after
planagg.c are apply explicit ORDER BY or DISTINCT operations.
        regards, tom lane


Re: Third thoughts about the DISTINCT MAX() problem

От
"Gurjeet Singh"
Дата:
On Sat, Mar 29, 2008 at 3:21 AM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br
/><divclass="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt
0pt0pt 0.8ex; padding-left: 1ex;"> I just realized that the patch I applied here<br /><a
href="http://archives.postgresql.org/pgsql-committers/2008-03/msg00531.php"
target="_blank">http://archives.postgresql.org/pgsql-committers/2008-03/msg00531.php</a><br/> for Taiki Yamaguchi's bug
reporthere<br /><a href="http://archives.postgresql.org/pgsql-bugs/2008-03/msg00275.php"
target="_blank">http://archives.postgresql.org/pgsql-bugs/2008-03/msg00275.php</a><br/> really doesn't work.  It
assumesthat an ungrouped aggregate<br /> query can't return more than one row, which is true in straight<br /> SQL ...
butit's not true if you consider SRFs in the target list.<br /> CVS HEAD gives the wrong answer for this example in the
regression<br/> database:<br /><br /> regression=# select max(unique1), generate_series(1,3) as g from tenk1 order by g
desc;<br/>  max  | g<br /> ------+---<br />  9999 | 1<br />  9999 | 2<br />  9999 | 3<br /> (3 rows)<br /><br />
becauseit wrongly supposes it can discard the ORDER BY.<br /><br /> So, back to the drawing board.  I had thought of
twoapproaches to<br /> fixing the problem instead of just dodging it.  Plan A was to<br /> apply planagg.c's
Aggref->Paramsubstitution inside EquivalenceClasses,<br /> as in the draft patch here:<br /><a
href="http://archives.postgresql.org/pgsql-patches/2008-03/msg00388.php"
target="_blank">http://archives.postgresql.org/pgsql-patches/2008-03/msg00388.php</a><br/> which I didn't entirely like
forreasons mentioned in that post.<br /> Plan B was to try to revert to the way sort clause matching was<br /> done
pre-8.3,that is have make_sort_from_pathkeys check first<br /> for a matching ressortgroupref tag before it goes
lookingfor equal()<br /> expressions.  I had actually tried to do that first but got hung<br /> up on the problem of
identifyingthe correct sort operator ---<br /> just taking the exposed type of the targetlist entry doesn't always<br
/>work, because of binary-compatible cases (eg, tlist entry may say<br /> it yields varchar but we need to find the
textopclass).  Perhaps<br /> thinking a bit harder would yield a solution though.<br /><br /> Does anyone have comments
foror against either of these approaches,<br /> or perhaps a Plan C to consider?<br /><br />  <br
/></blockquote></div><br/>In the past I had seen suggestions (perhaps from you) that we should disallow SRFs in target
list...Although not for 8.3, but would this be a good time for 8.4 to deprecate the usage of SRFs in targetlist?<br
/><br/>Best regards,<br />-- <br />gurjeet[.singh]@EnterpriseDB.com<br />singh.gurjeet@{ gmail | hotmail | indiatimes |
yahoo}.com<br /><br />EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /><br
/>Mailsent from my BlackLaptop device  

Re: Third thoughts about the DISTINCT MAX() problem

От
Simon Riggs
Дата:
On Sat, 2008-03-29 at 07:07 +0530, Gurjeet Singh wrote:
> On Sat, Mar 29, 2008 at 3:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>         regression=# select max(unique1), generate_series(1,3) as g
>         from tenk1 order by g desc;
>          max  | g
>         ------+---
>          9999 | 1
>          9999 | 2
>          9999 | 3
>         (3 rows)
>           
> 
> In the past I had seen suggestions (perhaps from you) that we should
> disallow SRFs in target list... Although not for 8.3, but would this
> be a good time for 8.4 to deprecate the usage of SRFs in targetlist?

I think it makes sense to throw an error if we have both ungrouped
aggregates and SRFs in the targetlist. The two concepts seem at odds
with each other anyhow.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 
 PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk



Re: Third thoughts about the DISTINCT MAX() problem

От
Tom Lane
Дата:
I wrote:
> Plan B was to try to revert to the way sort clause matching was
> done pre-8.3, that is have make_sort_from_pathkeys check first
> for a matching ressortgroupref tag before it goes looking for equal()
> expressions.

So I tried that, and after a whole bunch of regression test failures
I realize it won't work.  ressortgroupref tags are only guaranteed
unique within a given targetlist --- the values applicable in the
final result tlist might mean something else within a plan node further
down in the tree.  Since EquivalenceClasses are global to the whole
plan, trying to match them to tlist entries by ressortgroupref isn't
safe.

This actually calls into question the existing fix for making sure
we pick the right tlist entry when dealing with volatile ORDER BY
expressions.  I think it's all right because sorting by such expressions
could only occur at top tree level --- we'd never consider such an
expression as a mergejoin key, for instance.  But it seems fragile
and something we'll need to keep an eye on.

Anyway, it seems that the only remaining acceptable fix is the one
involving modifying the EquivalenceClasses when planagg.c makes
its substitutions.  Unless someone has another idea.
        regards, tom lane