Обсуждение: Super PathKeys (Allowing sort order through precision loss functions)
Dear Hackers, I've started working on something I've ended up calling "Super PathKeys". The idea here is to increase the likelihood of a Path with PathKeys being used for a purpose that requires a less strict sort order due to ordering being required from the return value of some precision loss function. This is best explained by example, so here's the actual and expected output of some tests that the patch adds: create table tstbl (ts timestamp, a int); create index on tstbl (ts, a); set enable_sort = 0; explain (costs off) select date_trunc('year', ts), a, count(*) from tstbl group by 1,2 order by 1,2; QUERY PLAN ----------------------------------------------------- GroupAggregate Group Key: date_trunc('year'::text, ts), a -> Index Only Scan using tstbl_ts_a_idx on tstbl (3 rows) -- Test a more complex case where the superkey can be matched to multiple pathkeys explain (costs off) select date_trunc('year', ts), date_trunc('month', ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3; QUERY PLAN ----------------------------------------------------------------------------- GroupAggregate Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a -> Index Only Scan using tstbl_ts_a_idx on tstbl (3 rows) You can see here that the planner managed to make use of the index on (ts, a) to provide sorted input to the GROUP BY date_trunc('year', ts). This saves a possible slow HashAggregate plan, which would be especially bad if there are a large number of groups. It'll also be very useful if only a subset of all groups are required and a LIMIT clause is specified. It should also be useful for reports run on large timeseries data tables, especially partitioned ones when teamed up with the "Ordered Partitioned Table Scans" in [1]. Benchmarks don't seem so relevant here as it would be simple enough to craft them to show anything between 0% improvement up to some number that's difficult to pronounce. I think it's better to discuss how I think this is possible. Implementation: I've modified pg_proc to add a new column which optionally can be set to the argument number of the input parameter that the return value will be calculated from. In the patch, I've only set this so far for the date_trunc() function and a handful of casting functions. When building a PathKey the expression used in the key is passed into a function which tries to extract the superkey expr. If a non-NULL value is returned then an additional PathKey is generated with that expr and that's set in a new field named pk_superkey in the original PathKey. If the superkey expr itself contains another superkey expr then the superkey PathKey just itself has a link to another superkey, and so on. pathkeys_contained_in() has been modified so that when a non-matching PathKey is found it tries again by looking at the pk_superkey (if set) and keeps walking up the chain of PathKeys until it reaches a PathKey without a pk_superkey, or it finds a match. The function also must check if the next key1 matches the same key2. This allows the 2nd example above to work since both date_trunc() calls match the "ts" key in the index scan. If the 2nd attempt does not match then we just backup one and move to the next key2 and repeat. I think this functionality could also be used to allow superkeys to be derived through casts. We'd need to be careful to only tag casting functions where we're certain the sort order of the input and output types match. In the patch, I've just tagged a couple of casting functions between timestamp and date, but I imagine it should also be possible for casts between all the INT types, probably in either direction. More careful thought would be needed for other numerical types and text types, I imagine, though I've not thought much about that. Syntax: I've not really gone very far to think about exposing this to userland yet. I've only thought as far as CREATE FUNCTION name (ORDER KEY <p1name> <type>) RETURNS ... with some checks done in CreateFunction() to ensure there's not more than 1 ORDER KEY, and one is only specified for non-void functions. "KEY" is unreserved but so is "BY", so I don't forsee any grammar issues with the above, though I'm no expert. Status of the patch: The patch is very much still a proof of concept. I'm abusing some parser level functions to find the correct sortop for the super PathKey. There also might be some giant holes in the entire concept as I only dreamed the entire thing during a weekend trip while about 1.5 mountains out of range of the internet. I only started experimenting with the idea yesterday. The reason I'm posting now is: 1. Transparency. I'm working on this. 2. Someone might point out a good reason why this can't be done. 3. Some testing or fuzz testing might reveal some bugs in the patch. Please find attached the proof of concept patch. Comments (good or bad) welcome. Reviews and testing is also welcome. I'm considering adding it to the November 'fest. Likely I'll decided that based on any feedback received before then. The "superkey" name is borrowed from OOP. Think "superclass". If it turns out not to be a dud, we can call it something else. [1] https://commitfest.postgresql.org/20/1850/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Tue, 30 Oct 2018 at 07:58, David Rowley <david.rowley@2ndquadrant.com> wrote:
--
I've started working on something I've ended up calling "Super
PathKeys". The idea here is to increase the likelihood of a Path with
PathKeys being used for a purpose that requires a less strict sort
order due to ordering being required from the return value of some
precision loss function.
Anything left anchored would benefit, so SUBSTR(), TRIM() etc
Main use for this would be where the partition condition is a function, so we can still order by partitions easily.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > I've started working on something I've ended up calling "Super > PathKeys". The idea here is to increase the likelihood of a Path with > PathKeys being used for a purpose that requires a less strict sort > order due to ordering being required from the return value of some > precision loss function. I'm a little confused by the terminology here, or why this has anything at all to do with a new sort of pathkey. It seems like the idea you're driving at is to be able to mark a function as being order-preserving, in the sense that if one input is known sorted then the output will also be sorted (by the same or a related opclass). You probably need some additional constraints, like any other inputs being constants, before that really works. But given that, I don't see why you need a new kind of pathkey: you just have a new way to conclude that some path is sorted by the pathkey you already want. Maybe if I read the patch it'd be clearer why you want to describe it that way, but I'm too lazy to do that right now. One thing I would say though is that a pg_proc column identifying the interesting input parameter is insufficient; you'd need to *explicitly* say which opclass(es) the sorting behavior guarantee applies for. > -- Test a more complex case where the superkey can be matched to > multiple pathkeys > explain (costs off) select date_trunc('year', ts), date_trunc('month', > ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3; > QUERY PLAN > ----------------------------------------------------------------------------- > GroupAggregate > Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a > -> Index Only Scan using tstbl_ts_a_idx on tstbl > (3 rows) [ squint... ] Does that really work? If so, how? It would need a whole lot more knowledge about the behavior of date_trunc than I think could possibly be reasonable to encode in a general mechanism. regards, tom lane
On 31 October 2018 at 08:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> I've started working on something I've ended up calling "Super >> PathKeys". The idea here is to increase the likelihood of a Path with >> PathKeys being used for a purpose that requires a less strict sort >> order due to ordering being required from the return value of some >> precision loss function. > > I'm a little confused by the terminology here, or why this has anything > at all to do with a new sort of pathkey. It seems like the idea you're > driving at is to be able to mark a function as being order-preserving, > in the sense that if one input is known sorted then the output will > also be sorted (by the same or a related opclass). You probably need > some additional constraints, like any other inputs being constants, > before that really works. But given that, I don't see why you need a > new kind of pathkey: you just have a new way to conclude that some > path is sorted by the pathkey you already want. Thanks for chipping in on this. The additional pathkeys are not required to make it work, they're just required to make it work efficiently. The fact that we could to the trouble of making pathkeys canonical so we can perform pointer comparison rather than using equals() says to me that I'd better not do anything to slow this down too much. Doing it without the superkey idea seems to require quite a bit of analysis during pathkeys_contained_in() as we'd need to check for superkeys then, instead of when we're building the pathkey in the first place. As for the code that I did add to pathkeys_contained_in(), I doubt it's measurably slower for the normal case. The other fields being Const part I did miss. That will also be a requirement. I just failed to consider that date_trunc() could be used with a variable 1st arg. > Maybe if I read the patch it'd be clearer why you want to describe it > that way, but I'm too lazy to do that right now. One thing I would > say though is that a pg_proc column identifying the interesting input > parameter is insufficient; you'd need to *explicitly* say which opclass(es) > the sorting behavior guarantee applies for. > >> -- Test a more complex case where the superkey can be matched to >> multiple pathkeys >> explain (costs off) select date_trunc('year', ts), date_trunc('month', >> ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3; >> QUERY PLAN >> ----------------------------------------------------------------------------- >> GroupAggregate >> Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a >> -> Index Only Scan using tstbl_ts_a_idx on tstbl >> (3 rows) > > [ squint... ] Does that really work? If so, how? It would need a whole > lot more knowledge about the behavior of date_trunc than I think could > possibly be reasonable to encode in a general mechanism. I'm not entirely certain we can always consume multiple matching keys or if it has to be one for one. However, I get an empty diff from each of the following two query pairs. select date_trunc('month'::text,date),date_Trunc('year', date) from dt order by dt; select date_trunc('month'::text,date),date_Trunc('year', date) from dt order by 1,2; and select date_trunc('year'::text,date),date_Trunc('month', date) from dt order by dt; select date_trunc('year'::text,date),date_Trunc('month', date) from dt order by 1,2; with setup: create table dt (date date); insert into dt select d from generate_series('2018-01-01', '2018-12-31', '1 day'::interval) d; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 10/30/2018 11:41 PM, David Rowley wrote: > On 31 October 2018 at 08:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> David Rowley <david.rowley@2ndquadrant.com> writes: >>> I've started working on something I've ended up calling "Super >>> PathKeys". The idea here is to increase the likelihood of a Path with >>> PathKeys being used for a purpose that requires a less strict sort >>> order due to ordering being required from the return value of some >>> precision loss function. >> >> I'm a little confused by the terminology here, or why this has anything >> at all to do with a new sort of pathkey. It seems like the idea you're >> driving at is to be able to mark a function as being order-preserving, >> in the sense that if one input is known sorted then the output will >> also be sorted (by the same or a related opclass). You probably need >> some additional constraints, like any other inputs being constants, >> before that really works. But given that, I don't see why you need a >> new kind of pathkey: you just have a new way to conclude that some >> path is sorted by the pathkey you already want. > > Thanks for chipping in on this. > > The additional pathkeys are not required to make it work, they're just > required to make it work efficiently. The fact that we could to the > trouble of making pathkeys canonical so we can perform pointer > comparison rather than using equals() says to me that I'd better not > do anything to slow this down too much. Doing it without the superkey > idea seems to require quite a bit of analysis during > pathkeys_contained_in() as we'd need to check for superkeys then, > instead of when we're building the pathkey in the first place. As > for the code that I did add to pathkeys_contained_in(), I doubt it's > measurably slower for the normal case. > > The other fields being Const part I did miss. That will also be a > requirement. I just failed to consider that date_trunc() could be used > with a variable 1st arg. > >> Maybe if I read the patch it'd be clearer why you want to describe it >> that way, but I'm too lazy to do that right now. One thing I would >> say though is that a pg_proc column identifying the interesting input >> parameter is insufficient; you'd need to *explicitly* say which opclass(es) >> the sorting behavior guarantee applies for. >> The other thing likely affecting this is locale / collation. Probably not for date_trunc, but certainly for things like substr()/trim(), mentioned by Simon upthread. In some languages the rules are pretty complex, and there's no chance it'll survive arbitrary substr() applied to the string. For example, in Czech we mostly sort character-by-character, but "ch" is an exception sorted in between "h" and "i". So essentially "hhhh <= hchh <= hihh". Obviously, substr($1,0,3) cuts the "ch" in half, changing the ordering: create table x (y text collate "cs_CZ"); insert into x values ('hhhh'), ('hchh'), ('hihh'); test=# select y from x order by 1; y ------ hhhh hchh hihh (3 rows) test=# select substr(y,0,3) from x order by 1; substr -------- hc hh hi (3 rows) I'm preeeeeeetty sure other languages have even funnier rules ... >>> -- Test a more complex case where the superkey can be matched to >>> multiple pathkeys >>> explain (costs off) select date_trunc('year', ts), date_trunc('month', >>> ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3; >>> QUERY PLAN >>> ----------------------------------------------------------------------------- >>> GroupAggregate >>> Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a >>> -> Index Only Scan using tstbl_ts_a_idx on tstbl >>> (3 rows) >> >> [ squint... ] Does that really work? If so, how? It would need a whole >> lot more knowledge about the behavior of date_trunc than I think could >> possibly be reasonable to encode in a general mechanism. > > I'm not entirely certain we can always consume multiple matching keys > or if it has to be one for one. However, I get an empty diff from each > of the following two query pairs. > > select date_trunc('month'::text,date),date_Trunc('year', date) from dt > order by dt; > select date_trunc('month'::text,date),date_Trunc('year', date) from dt > order by 1,2; > > and > > select date_trunc('year'::text,date),date_Trunc('month', date) from dt > order by dt; > select date_trunc('year'::text,date),date_Trunc('month', date) from dt > order by 1,2; > > with setup: > > create table dt (date date); > insert into dt select d from generate_series('2018-01-01', > '2018-12-31', '1 day'::interval) d; > I'm mildly suspicious of this data set, because it's essentially perfectly sorted/deterministic, and the Sort node won't have to do anything. So perhaps it just works by chance. Consider this instead: create table dt (ts timestamp, x text); insert into dt select * from (select d, (case when random() < 0.5 then 'month' else 'hour' end) from generate_series('2018-01-01'::timestamp, '2018-12-31'::timestamp, '1 hour'::interval) d) as foo order by random(); test=# explain select date_trunc(x, ts) from dt order by 1; QUERY PLAN ------------------------------------------------------------- Sort (cost=729.18..751.02 rows=8737 width=8) Sort Key: (date_trunc(x, ts)) -> Seq Scan on dt (cost=0.00..157.21 rows=8737 width=8) (3 rows) test=# select date_trunc(x, ts) from dt order by 1; date_trunc --------------------- 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 ... seems ok ... test=# explain select date_trunc(x, ts) from dt order by ts; QUERY PLAN -------------------------------------------------------------- Sort (cost=729.18..751.02 rows=8737 width=16) Sort Key: ts -> Seq Scan on dt (cost=0.00..157.21 rows=8737 width=16) (3 rows) test=# select date_trunc(x, ts) from dt order by ts; date_trunc --------------------- 2018-01-01 00:00:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 00:00:00 2018-01-01 04:00:00 2018-01-01 05:00:00 2018-01-01 06:00:00 2018-01-01 07:00:00 2018-01-01 08:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 13:00:00 2018-01-01 14:00:00 2018-01-01 00:00:00 2018-01-01 16:00:00 2018-01-01 17:00:00 2018-01-01 18:00:00 2018-01-01 19:00:00 2018-01-01 20:00:00 2018-01-01 21:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-02 00:00:00 2018-01-01 00:00:00 2018-01-02 02:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-02 05:00:00 2018-01-02 06:00:00 2018-01-02 07:00:00 2018-01-02 08:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-01 00:00:00 2018-01-02 13:00:00 ... kaboooooom! ... I suspect this runs into the date_trunc() precision parameter being variable, which is something Tom mentioned as a potential issue. The already-mentioned substr/trim are another example of this issue. Not only must the parameters be constant (or generally using the same value for all rows), but it may work for some values and fail for others. For example substr($1,0,$2) might work, but substr($1,10,$2) likely not. Similarly for trim(). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31 October 2018 at 14:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The other thing likely affecting this is locale / collation. Probably > not for date_trunc, but certainly for things like substr()/trim(), > mentioned by Simon upthread. > > In some languages the rules are pretty complex, and there's no chance > it'll survive arbitrary substr() applied to the string. For example, in > Czech we mostly sort character-by-character, but "ch" is an exception > sorted in between "h" and "i". > > So essentially "hhhh <= hchh <= hihh". Obviously, substr($1,0,3) cuts > the "ch" in half, changing the ordering: > > create table x (y text collate "cs_CZ"); > insert into x values ('hhhh'), ('hchh'), ('hihh'); > > test=# select y from x order by 1; > y > ------ > hhhh > hchh > hihh > (3 rows) > > test=# select substr(y,0,3) from x order by 1; > substr > -------- > hc > hh > hi > (3 rows) > > I'm preeeeeeetty sure other languages have even funnier rules ... That's pretty interesting, but as mentioned in my initial email... More careful thought would be needed for other numerical types and text types, I imagine, though I've not thought much about that. I don't really think trim() or substr() would ever work for the reason that they don't always operate on a prefix of the string. What you've mentioned seems to rule out LEFT(). > I'm mildly suspicious of this data set, because it's essentially > perfectly sorted/deterministic, and the Sort node won't have to do > anything. So perhaps it just works by chance. > > Consider this instead: > > create table dt (ts timestamp, x text); > > insert into dt select * from (select d, (case when random() < 0.5 then > 'month' else 'hour' end) from generate_series('2018-01-01'::timestamp, > '2018-12-31'::timestamp, '1 hour'::interval) d) as foo order by random(); > [...] > 2018-01-01 00:00:00 > 2018-01-01 00:00:00 > 2018-01-01 00:00:00 > 2018-01-02 13:00:00 > ... kaboooooom! ... Yeah. This is an issue. Thanks for the test case. However, I acknowledged that in my reply to Tom. I did overlook it, which was completely down to lack of imagination on my part. I'd never considered using date_trunc() without a const 1st argument before. It seems simple enough to disable the optimisation in that case. I've attached a patch which does that. Maybe that'll help us look beyond this and focus on any other reasons why this is not possible. It's also true that this diminishes the usefulness of the idea, but part of the reason I've posting the idea so early after having thought about it is precisely to see if this is going to float or sink. Maybe we'll decide the scope of usefulness is so small that it's not worth it, or that each function has such different requirements that we can't reasonably make it work by adding a few columns to pg_proc. I'm personally more interested in the cases that can work. I understand there is no shortage of cases where it can't. Giving that we require const arguments away from the orderkey, perhaps it could be made to work for simple arithmetic OpExprs. I'm not sure if the use cases are actually there for that sort of thing and I've seen WHERE indexcol+0 = <n> used many times to disable the use of indexes, so making pathkeys see through those might be more annoying than useful... But it's a thought... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 10/31/2018 04:32 AM, David Rowley wrote: > On 31 October 2018 at 14:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> The other thing likely affecting this is locale / collation. Probably >> not for date_trunc, but certainly for things like substr()/trim(), >> mentioned by Simon upthread. >> >> In some languages the rules are pretty complex, and there's no chance >> it'll survive arbitrary substr() applied to the string. For example, in >> Czech we mostly sort character-by-character, but "ch" is an exception >> sorted in between "h" and "i". >> >> So essentially "hhhh <= hchh <= hihh". Obviously, substr($1,0,3) cuts >> the "ch" in half, changing the ordering: >> >> create table x (y text collate "cs_CZ"); >> insert into x values ('hhhh'), ('hchh'), ('hihh'); >> >> test=# select y from x order by 1; >> y >> ------ >> hhhh >> hchh >> hihh >> (3 rows) >> >> test=# select substr(y,0,3) from x order by 1; >> substr >> -------- >> hc >> hh >> hi >> (3 rows) >> >> I'm preeeeeeetty sure other languages have even funnier rules ... > > That's pretty interesting, but as mentioned in my initial email... > More careful thought would be needed for other numerical types and > text types, I imagine, though I've not thought much about that. > > I don't really think trim() or substr() would ever work for the reason > that they don't always operate on a prefix of the string. What you've > mentioned seems to rule out LEFT(). > Sure, but it wasn't very clear to me what the "more careful thought" would mean. I think a direct consequence of Tom's point about opclasses is that storing this information in pg_proc is not going to fly - you would need a separate catalog for that, to allow mapping the function to multiple opclasses (possibly with different features?). But OK, that's doable. But what if you also need to do that for collations? Firstly, it would it add another degree of freedom, essentially making it (proc, opclass, collation, ... metadata ...) so there would be many such combinations. I can't really imagine defining that manually, but maybe it can be simplified. But more importantly - the list of collations is kinda dynamic, and AFAIK the collation rules may change depending on the glibc/icu versions. So, that seems pretty tricky. It's certainly not a just a SMOP. >> I'm mildly suspicious of this data set, because it's essentially >> perfectly sorted/deterministic, and the Sort node won't have to do >> anything. So perhaps it just works by chance. >> >> Consider this instead: >> >> create table dt (ts timestamp, x text); >> >> insert into dt select * from (select d, (case when random() < 0.5 then >> 'month' else 'hour' end) from generate_series('2018-01-01'::timestamp, >> '2018-12-31'::timestamp, '1 hour'::interval) d) as foo order by random(); >> > > [...] > >> 2018-01-01 00:00:00 >> 2018-01-01 00:00:00 >> 2018-01-01 00:00:00 >> 2018-01-02 13:00:00 >> ... kaboooooom! ... > > Yeah. This is an issue. Thanks for the test case. However, I > acknowledged that in my reply to Tom. I did overlook it, which was > completely down to lack of imagination on my part. I'd never > considered using date_trunc() without a const 1st argument before. It > seems simple enough to disable the optimisation in that case. I've > attached a patch which does that. Maybe that'll help us look beyond > this and focus on any other reasons why this is not possible. > Ah, sorry - I've missed this bit in your response. In my defense, it's been quite late over here, and my caffeine level got a tad too low. Anyway, the question is how strong would the requirement need to be, and how would that affect applicability of this optimization in other cases. I agree it's probably not an issue for date_trunc() - I don't recall ever using it with non-constant first parameter. I wonder if there are other functions where that's not the case - although, that's not really an argument against this optimization. > It's also true that this diminishes the usefulness of the idea, but > part of the reason I've posting the idea so early after having thought > about it is precisely to see if this is going to float or sink. Sure. And pgsql-hackers are very good in sinking ideas ;-) > Maybe we'll decide the scope of usefulness is so small that it's not > worth it, or that each function has such different requirements that > we can't reasonably make it work by adding a few columns to pg_proc. > I'm personally more interested in the cases that can work. I > understand there is no shortage of cases where it can't. > I think it can't be made just by adding a couple of columns to pg_proc, as explained above. A separate catalog mapping procs to opclasses (and maybe collations) may be needed. For cases where it depends on the actual parameter values (like substr/trim/ltrim) an extra function might be needed (something like the selectivity functions for data types). > Giving that we require const arguments away from the orderkey, perhaps > it could be made to work for simple arithmetic OpExprs. I'm not sure > if the use cases are actually there for that sort of thing and I've > seen WHERE indexcol+0 = <n> used many times to disable the use of > indexes, so making pathkeys see through those might be more annoying > than useful... But it's a thought... > Well, that really depends on the definition of the "+" operator, which is pretty much just a function. So I don't see why would that simplify the situation? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Oct 31, 2018 at 9:19 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I think it can't be made just by adding a couple of columns to pg_proc, > as explained above. A separate catalog mapping procs to opclasses (and > maybe collations) may be needed. For cases where it depends on the > actual parameter values (like substr/trim/ltrim) an extra function might > be needed (something like the selectivity functions for data types). This kinda reminds me of commit 8f9fe6edce358f7904e0db119416b4d1080a83aa. We needed a way to provide the planner with knowledge about the behavior of specific functions. In that case, the specific need was to be able to tell the planner that a certain function call could be omitted or strength-reduced, and we did that by having the planner call a function that encoded the necessary knowledge. Here, we want to evaluate a function call and see whether it is order preserving, which could depend on a whole bunch of stuff that isn't easily parameterized by catalog entries, but could be figured out by a C function written for that purpose. I'm not really sure how that would work in this case, or whether it's a good idea, but I thought I'd mention it just in case it's helpful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > This kinda reminds me of commit > 8f9fe6edce358f7904e0db119416b4d1080a83aa. We needed a way to provide > the planner with knowledge about the behavior of specific functions. > In that case, the specific need was to be able to tell the planner > that a certain function call could be omitted or strength-reduced, and > we did that by having the planner call a function that encoded the > necessary knowledge. Here, we want to evaluate a function call and > see whether it is order preserving, which could depend on a whole > bunch of stuff that isn't easily parameterized by catalog entries, but > could be figured out by a C function written for that purpose. +1. If we're otherwise going to need multiple pg_proc columns, this is a better answer just from the standpoint of avoiding pg_proc bloat: we'd only need to add one OID column. And I concur with your point that we're going to have a really hard time parameterizing the mechanism adequately if there isn't dedicated per-function code. regards, tom lane
On 1 November 2018 at 05:40, Robert Haas <robertmhaas@gmail.com> wrote: > This kinda reminds me of commit > 8f9fe6edce358f7904e0db119416b4d1080a83aa. We needed a way to provide > the planner with knowledge about the behavior of specific functions. > In that case, the specific need was to be able to tell the planner > that a certain function call could be omitted or strength-reduced, and > we did that by having the planner call a function that encoded the > necessary knowledge. Here, we want to evaluate a function call and > see whether it is order preserving, which could depend on a whole > bunch of stuff that isn't easily parameterized by catalog entries, but > could be figured out by a C function written for that purpose. I'm > not really sure how that would work in this case, or whether it's a > good idea, but I thought I'd mention it just in case it's helpful. Agreed. That's a good idea. Thanks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On Oct 30, 2018, at 9:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 30 Oct 2018 at 07:58, David Rowley <david.rowley@2ndquadrant.com> wrote: > > I've started working on something I've ended up calling "Super > PathKeys". The idea here is to increase the likelihood of a Path with > PathKeys being used for a purpose that requires a less strict sort > order due to ordering being required from the return value of some > precision loss function. > > Anything left anchored would benefit, so SUBSTR(), TRIM() etc > > Main use for this would be where the partition condition is a function, so we can still order by partitions easily. This would also be very helpful in many BI cases; it’s very common to aggregate based on year, year/month, year/quarter,etc. The other thing that would be extremely useful would be pushing predicats through this, so you could do things like WHERE date_trunc(‘year’, timestamp_field) = 2018
On 10/31/2018 10:07 PM, David Rowley wrote: > On 1 November 2018 at 05:40, Robert Haas <robertmhaas@gmail.com> wrote: >> This kinda reminds me of commit >> 8f9fe6edce358f7904e0db119416b4d1080a83aa. We needed a way to provide >> the planner with knowledge about the behavior of specific functions. >> In that case, the specific need was to be able to tell the planner >> that a certain function call could be omitted or strength-reduced, and >> we did that by having the planner call a function that encoded the >> necessary knowledge. Here, we want to evaluate a function call and >> see whether it is order preserving, which could depend on a whole >> bunch of stuff that isn't easily parameterized by catalog entries, but >> could be figured out by a C function written for that purpose. I'm >> not really sure how that would work in this case, or whether it's a >> good idea, but I thought I'd mention it just in case it's helpful. > > Agreed. That's a good idea. Thanks. > FWIW this is mostly what I had in mind when referring to the selectivity estimation functions for operators, although I now realize it might not have been explained very clearly. Anyway, I agree this seems like a much better way than trying to store all the potentially relevant meta-data in catalogs. I still have trouble imagining what exactly would the function do to determine if the optimization can be applied to substr() and similar collation-dependent cases. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1 November 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I still have trouble imagining what exactly would the function do to > determine if the optimization can be applied to substr() and similar > collation-dependent cases. I guess the function would have to check for a Const offset of 0, and a collection, perhaps of "C" for the 1st arg. In any case, I wouldn't want this idea to be hung up on the fact we can't determine how to make substr() work correctly with it. I'm most interested in date_trunc() and friends. A first cut implementation would not have to implement functions for everything that's possible to implement. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-11-01 12:19:32 +1300, David Rowley wrote: > On 1 November 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I still have trouble imagining what exactly would the function do to > > determine if the optimization can be applied to substr() and similar > > collation-dependent cases. > > I guess the function would have to check for a Const offset of 0, and > a collection, perhaps of "C" for the 1st arg. In any case, I wouldn't > want this idea to be hung up on the fact we can't determine how to > make substr() work correctly with it. > > I'm most interested in date_trunc() and friends. A first cut > implementation would not have to implement functions for everything > that's possible to implement. FWIW, I kind of wonder if we built proper infrastructure to allow to make such inferrences from function calls, whether it could also be made to support the transformation of LIKEs into indexable <= >= clauses. Greetings, Andres Freund
On 1 November 2018 at 12:24, Andres Freund <andres@anarazel.de> wrote: > FWIW, I kind of wonder if we built proper infrastructure to allow to > make such inferrences from function calls, whether it could also be made > to support the transformation of LIKEs into indexable <= >= clauses. Perhaps, but I doubt it would be the same function to do both. Surely I need something that accepts details about the function call as arguments and returns an Expr * of the argument that we can derive the order of the return value from, or NULL. I think the transformation you need might be more along the lines of returning a List * of quals that can substitute an OpExpr containing a function call. I'm not that clear on how we'd know the new quals were better than the existing ones. For example extract('year', dt) = 2018 could be transformed to dt >= '2018-01-01' AND dt < '2019-01-01', but how would we know that was better. There might be an index on extract('year', dt). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11/01/2018 02:40 AM, David Rowley wrote: > On 1 November 2018 at 12:24, Andres Freund <andres@anarazel.de> wrote: >> FWIW, I kind of wonder if we built proper infrastructure to allow to >> make such inferrences from function calls, whether it could also be made >> to support the transformation of LIKEs into indexable <= >= clauses. > > Perhaps, but I doubt it would be the same function to do both. Surely > I need something that accepts details about the function call as > arguments and returns an Expr * of the argument that we can derive the > order of the return value from, or NULL. I think the transformation > you need might be more along the lines of returning a List * of quals > that can substitute an OpExpr containing a function call. I'm not that > clear on how we'd know the new quals were better than the existing > ones. For example extract('year', dt) = 2018 could be transformed to > dt >= '2018-01-01' AND dt < '2019-01-01', but how would we know that > was better. There might be an index on extract('year', dt). > IMHO there is only a handful of "useful" transformations of this kind, depending on the operator class of an index. So when building index paths and checking which quals may be used as index conditions, we'd do try transforming incompatible quals and leave the rest up to the existing create_index_path machinery (which already makes the decision about which quals to use for index search etc.) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services