Обсуждение: Is it possible and worthy to optimize scanRTEForColumn()?

Поиск
Список
Период
Сортировка

Is it possible and worthy to optimize scanRTEForColumn()?

От
Rui Hai Jiang
Дата:

Hello,

 

When I run a select query, e.g. select id from t,  all columns in table "t" are checked to see if a column named "id" exists or not, and a Var is created for "id" if the column does exist.

 

Function scanRTEForColumn() does this job.

 

But I see in scanRTEForColumn(), the loop does not stop when a match is found, it continues to compare all other columns. And this will waste lots of computing.

 

 

I guess there may be some reasons for this. But I don't know yet.  I just wonder if it is possible and worthy to optimize this. And please note, "select *" does not call this function.

 

 

Node *

scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,

                                 int location, int fuzzy_rte_penalty,

                                 FuzzyAttrMatchState *fuzzystate)

{

      foreach(c, rte->eref->colnames)

        {

                const char *attcolname = strVal(lfirst(c));

                attnum++;

                if (strcmp(attcolname, colname) == 0)

                {

                         if (result)

                                ereport(ERROR,

                                                (errcode(ERRCODE_AMBIGUOUS_COLUMN),

                                                 errmsg("column reference \"%s\" is ambiguous",

                                                                colname),

                                                 parser_errposition(pstate, location)));

                        var = make_var(pstate, rte, attnum, location);

                        /* Require read access to the column */

                        markVarForSelectPriv(pstate, var, rte);

                        result = (Node *) var;

                }

 

                /* Updating fuzzy match state, if provided. */

                if (fuzzystate != NULL)

                        updateFuzzyAttrMatchState(fuzzy_rte_penalty, fuzzystate,

                                                                          rte, attcolname, colname, attnum);

        }

 

        /*

         * If we have a unique match, return it.  Note that this allows a user

         * alias to override a system column name (such as OID) without error.

         */

        if (result)

                return result;

 

....................

.....................

}

Re: Is it possible and worthy to optimize scanRTEForColumn()?

От
Tom Lane
Дата:
Rui Hai Jiang <ruihaijiang@msn.com> writes:
> When I run a select query, e.g. select id from t,  all columns in table "t" are checked to see if a column named "id"
existsor not, and a Var is created for "id" if the column does exist. 

> Function scanRTEForColumn() does this job.

> But I see in scanRTEForColumn(), the loop does not stop when a match is found, it continues to compare all other
columns.And this will waste lots of computing. 

> I guess there may be some reasons for this. But I don't know yet.

It's necessary because we have to check whether the column name is
ambiguous.  Although in the case of a table, the names are constrained
to all be different, this is not the case for all RTE types ... nor
even for table RTEs, if aliases have been applied.

I'm not particularly concerned about it --- I've not seen profiles
suggesting that that function is a big time sink.  Tables with very
many columns tend to be inefficient for lots of reasons, and I rather
doubt that this particular place is the first thing to hit if you
want to make that better.

            regards, tom lane


Re: Is it possible and worthy to optimize scanRTEForColumn()?

От
Andres Freund
Дата:
On 2017-12-08 10:05:17 -0500, Tom Lane wrote:
> I'm not particularly concerned about it --- I've not seen profiles
> suggesting that that function is a big time sink.  Tables with very
> many columns tend to be inefficient for lots of reasons, and I rather
> doubt that this particular place is the first thing to hit if you
> want to make that better.

FWIW, I've definitely seen scanRTEForColumn() show up in profiles.  In
some of those cases the issue could be worked around by mostly using
prepared statements.  But it definitely can be painful, not that
surprising given the the complexity is essentially
O(#available-columns * #looked-up-columns).

However I don't think a microoptimization, even if it were correct, as
breaking earlier out of the loop would help meaningfully. We'd need a
different datastructure.

Greetings,

Andres Freund


Re: Is it possible and worthy to optimize scanRTEForColumn()?

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-12-08 10:05:17 -0500, Tom Lane wrote:
>> I'm not particularly concerned about it --- I've not seen profiles
>> suggesting that that function is a big time sink.  Tables with very
>> many columns tend to be inefficient for lots of reasons, and I rather
>> doubt that this particular place is the first thing to hit if you
>> want to make that better.

> FWIW, I've definitely seen scanRTEForColumn() show up in profiles.  In
> some of those cases the issue could be worked around by mostly using
> prepared statements.  But it definitely can be painful, not that
> surprising given the the complexity is essentially
> O(#available-columns * #looked-up-columns).

> However I don't think a microoptimization, even if it were correct, as
> breaking earlier out of the loop would help meaningfully. We'd need a
> different datastructure.

Yeah, if someone were holding a gun on me and saying "make that particular
function faster", I'd think about a hash table rather than scanning a
list.  Perhaps a hash table with all the column names exposed by FROM,
not one hash per RTE.  However, if you have a FROM that exposes a lot of
column names, and then the query only looks up a few of them, you might
come out behind by building a hash table :-(

I'm still unconvinced that this is the first place to improve for
wide tables, anyway.

            regards, tom lane


Re: Is it possible and worthy to optimize scanRTEForColumn()?

От
Andres Freund
Дата:
Hi,

On 2017-12-08 14:41:14 -0500, Tom Lane wrote:
> Yeah, if someone were holding a gun on me and saying "make that particular
> function faster", I'd think about a hash table rather than scanning a
> list.  Perhaps a hash table with all the column names exposed by FROM,
> not one hash per RTE.

That sounds right.


> However, if you have a FROM that exposes a lot of column names, and
> then the query only looks up a few of them, you might come out behind
> by building a hash table :-(

Hm, I don't think that's that big of a deal - you don't need many
lookups to make a hashtable worthwhile if the alternative is exhaustive
scans through linked lists.  I'd be more concerned about the pretty
common case we're most of the time hitting now, where there's just a
handfull of columns selected from about as many available columns, the
additional allocations and such might show up.


> I'm still unconvinced that this is the first place to improve for
> wide tables, anyway.

I've run a few profiles with wide columns lately, and on the read-mostly
side without prepared statements this is very commonly the biggest entry
by a lot. Over 90% of the time in one of them.

If the queries select a large number of the underlying rows tupledesc
computations, and their syscache lookups, become the bottleneck. That's
partially what lead me to microoptimize syscache ~two months back.  The
real solution there is going to be to move tupledesc computations to the
planner, but that's a bigger piece of work than I can take on right now.

Greetings,

Andres Freund