Re: multivariate statistics v14

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics v14
Дата
Msg-id 66a87077-0ad5-8a93-296d-f74ad0df5538@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics v14  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
On 03/21/2016 04:34 AM, Alvaro Herrera wrote:
> Another skim on 0002:
>
> reference.sgml is missing a call to &alterStatistic.
>
> ObjectProperty[] contains a comment that the ACL is "same as relation",
> but is that still correct, given that now stats may be related to more
> than one relation?  Do we even know what the rules for ACLs on
> cross-relation stats are?  One very simple way to get around this is to
> dictate that all the rels must have the same owner.  Perhaps we're not
> considering the multi-relation case yet?

As I wrote in response to Robert's message, I don't think we need ACLs 
for statistics - the user should be able to use them when they can 
access all the underlying relations (in a query). For ALTER STATISTICS 
the (owner || superuser) check should be enough, right?

>
> We have this FIXME comment in do_analyze_rel:
>
> +     * FIXME This sample sizing is mostly OK when computing stats for
> +     *       individual columns, but when computing multi-variate stats
> +     *       for multivariate stats (histograms, mcv, ...) it's rather
> +     *       insufficient. For stats on multiple columns / complex stats
> +     *       we need larger sample sizes, because we need to build more
> +     *       detailed stats (more MCV items / histogram buckets) to get
> +     *       good accuracy. Maybe it'd be appropriate to use samples
> +     *       proportional to the table (say, 0.5% - 1%) instead of a
> +     *       fixed size might be more appropriate. Also, this should be
> +     *       bound to the requested statistics size - e.g. number of MCV
> +     *       items or histogram buckets should require several sample
> +     *       rows per item/bucket (so the sample should be k*size).
>
> Maybe this merits more discussion.  Right now we have an upper bound on
> how much to scan for analyze; if we introduce the idea of scanning a
> percentage of the relation, the time to analyze very large relations
> could increase significantly.  Do we have an idea of what to do for
> this?  For instance, a rule that would make me comfortable would say to
> scan a sample 3x the current size when you have a mvstats on 3 columns;
> then the size of fraction to scan is still bounded.  But does that
> actually work?  From the wording of this comment, I assume you don't
> actually know.

Yeah. I think more discussion is needed, because I myself am not sure 
the FIXME is actually correct. For now I think we're OK with using the 
same logic as statistics on a single column (300 * target).

>
> In this block (CreateStatistics)
> +    /* look for duplicities */
> +    for (i = 0; i < numcols; i++)
> +        for (j = 0; j < numcols; j++)
> +            if ((i != j) && (attnums[i] == attnums[j]))
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_UNDEFINED_COLUMN),
> +                         errmsg("duplicate column name in statistics definition")));
>
> isn't it easier to have the inner loop go from i+1 to numcols?

It probably is.

>
> I wonder if this is sensible with multi-relation statistics:
> +    /*
> +     * Store a dependency too, so that statistics are dropped on DROP TABLE
> +     */
> +    parentobject.classId = RelationRelationId;
> +    parentobject.objectId = ObjectIdGetDatum(RelationGetRelid(rel));
> +    parentobject.objectSubId = 0;
> +    childobject.classId = MvStatisticRelationId;
> +    childobject.objectId = statoid;
> +    childobject.objectSubId = 0;
>
> I suppose the idea is to drop the stats if any of the rels they are for
> is dropped.

What do you mean by sensible? I mean, we don't support multiple tables 
at this point (except for choosing a syntax that should allow that), but 
the code assumes a single relation on a few places (like this one).

>
> Right after that you create a dependency on the schema.  Is that
> necessary?  Since you have the dependency on the relation, the stats
> would be dropped by recursion.

Hmmmm, that's probably right. Also, now that I think about it, it 
probably gets broken after ALTER STATISTICS ... SET SCHEMA, because the 
code does not remove the old dependency (and does not create a new one).

>
> Why are you #include'ing builtins.h everywhere?

Stupidity.

>
> RelationGetMVStatList() needs a comment.

OK.

>
> Please get rid of common.h.  It's totally unlike the way we structure
> our header files.  We don't keep headers in src/backend; they're all in
> src/include.  One reason is that the latter gets installed as a whole in
> include/server, which this file will not be.  This file may be necessary
> to build some extensions in the future, for example.

OK, I'll rework that and move it to src/include/.

>
> In mvstats.h, please mark function prototypes as "extern".
>
> Many files need a pgindent pass.

OK.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Performance degradation in commit ac1d794
Следующее
От: Artur Zakirov
Дата:
Сообщение: Re: [PATCH] Phrase search ported to 9.6