Re: planner support functions: handle GROUP BY estimates ?
От | Justin Pryzby |
---|---|
Тема | Re: planner support functions: handle GROUP BY estimates ? |
Дата | |
Msg-id | 20191223001648.GG30414@telsasoft.com обсуждение исходный текст |
Ответ на | planner support functions: handle GROUP BY estimates ? (Justin Pryzby <pryzby@telsasoft.com>) |
Ответы |
Re: planner support functions: handle GROUP BY estimates ?
|
Список | pgsql-hackers |
On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote: > Tom implemented "Planner support functions": > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b > https://www.postgresql.org/docs/12/xfunc-optimization.html > > I wondered whether there was any consideration to extend that to allow > providing improved estimates of "group by". That currently requires manually > by creating an expression index, if the function is IMMUTABLE (which is not > true for eg. date_trunc of timestamptz). I didn't hear back so tried implementing this for date_trunc(). Currently, the planner assumes that functions output equally many groups as their input variables. Most invocations of our reports use date_trunc (or similar), so my earlier attempt to alert on rowcount misestimates was very brief. I currently assume that the input data has 1 second granularity: |postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1 seconds')a;ANALYZE t; |postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1; | Group (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1) | |postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1; | Finalize HashAggregate (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081 loops=1) | |postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1; | Group (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1) If the input timestamps have (say) hourly granularity, rowcount will be *underestimated* by 3600x, which is worse than the behavior in master of overestimating by (for "day") 24x. I'm trying to think of ways to address that: 0) Add a fudge factor of 4x or maybe 30x; 1) Avoid applying a corrective factor for seconds or minutes that makes the rowcount less than (say) 2 or 100. That would divide 24 but might then avoid the last /60 or /60/60. Ultimately, that's more "fudge" than anything else; 2) Leave alone pg_catalog.date_trunc(), but provide "template" support functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the given corrective factor, which should allow more accurate rowcount for input data with granularity of the given number of seconds. Ideally, that would be user-specified factor, but I don't think that's possible to specify in SQL; the constant has to be built into the C function. At telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe make a "date_trunc" function in the user schema which calls the pg_catalog.date_trunc with support function timestamp_support_10pow3; There could be a "base" support function that accepts a multiplier argument, and then any user-provided C extension would be a one-liner specifing an arbitrary value; 3) Maybe there are better functions than date_trunc() to address; 4) Leave it as a patch in the archives for people to borrow from; Justin
Вложения
В списке pgsql-hackers по дате отправления: