Обсуждение: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)
On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote: > pg_stat_statements' fingerprinting logic considers the following two > statements as distinct: > > select 1 in (1, 2, 3); > select 1 in (1, 2, 3, 4); > > This is because the ArrayExpr jumble case jumbles any ArrayExpr's list > of elements recursively. In this case it's a list of Const nodes, and > the fingerprinting logic jumbles those nodes indifferently. > > Somebody told me that they think that pg_stat_statements should not do > that. This person felt that it would be preferable for such > expressions to be normalized without regard to the number of distinct > Const elements. I suppose that that would work by determing if the > ArrayExpr elements list was a list of Const nodes and only const > nodes. Iff that turned out to be the case, something else would be > jumbled (something other than the list) that would essentially be a > representation of "some list of zero or more (or maybe one or more) > Const nodes with consttype of, in this example, 23". I think that this > would make at least one person happy, because of course the two > statements above would have their costs aggregated within a single > pg_stat_statements entry. Baron Schwartz recently gave a talk at PGConf Silicon Valley about the proprietary query instrumentation tool, VividCortex. The slides are available from: http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf One specific justification he gave for not using pg_stat_statements was: "Doesn’t merge bind vars in IN()" (See slide #11) His theory is that you should allow a proprietary binary to run with root permissions, a binary that sniffs the wire protocol, mostly because pg_stat_statements has this limitation (all other pg_stat_statements limitations listed are pretty unconvincing, IMV). That doesn't seem like a great plan to me, but I think he has a point about pg_stat_statements. It's about time that we fixed this -- it isn't realistic to imagine that people are going to know to use an array constant like "= any ('{1,2,3}')" -- even a major contributor to Django that I talked to about this issue a couple of years ago didn't know about that. It also isn't portable across database systems. I wonder: * How do other people feel about this? Personally, I've seen enough problems of this kind in the field that "slippery slope" arguments against this don't seem very compelling. * How might altering the jumbling logic to make it recognize a variable number of constants as equivalent work in practice? Perhaps we should do something to flatten the representation based on which powers of two the number of constants is between. There are still some details to work out there, but that's my first idea. That seems like a good compromise between the current behavior, and completely disregarding the number of constants. -- Peter Geoghegan
On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote: > * How do other people feel about this? Personally, I've seen enough > problems of this kind in the field that "slippery slope" arguments > against this don't seem very compelling. I also always felt there should be some kind of ??? symbol to represent a list of constants. In my experience these lists are actually *more* likely to be variables being inlined than single constants since it's easy to use :1 etc for single constants and quite a bit trickier to do it for lists. I guess that might be changed these days since I think you can do =any(?::int[]) and construct an array literal as a parameter. But plenty of code actually constructs lists of question marks to interpolate. I have also seen code where I would have needed *not* to have this jumbling though. I think this is a general problem with jumbling that there needs to be some kind of intelligence that deduces when it's important to break out the statistics by constant. In my case it was an IN query where specific values were very common but others very rare. Partial indexes ended up being the solution and we had to identify which partial indexes were needed. Incidentally there's another feature pg_stat_statements *really* needs. A way to convert a jumbled statement into one that can be prepared easily. The use of ? instead of :1 :2 etc makes this a mechanical but annoying process. Adding ??? would make it even more annoying. Even just a function that does this (and takes an optional list of counts for lists I guess?) would be a big help. -- greg
On Tue, Nov 24, 2015 at 2:25 PM, Greg Stark <stark@mit.edu> wrote: > On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <pg@heroku.com> wrote: >> * How do other people feel about this? Personally, I've seen enough >> problems of this kind in the field that "slippery slope" arguments >> against this don't seem very compelling. +1 for changing jumbling logic to consider any number of constants as identical. > I have also seen code where I would have needed *not* to have this > jumbling though. I think this is a general problem with jumbling that > there needs to be some kind of intelligence that deduces when it's > important to break out the statistics by constant. In my case it was > an IN query where specific values were very common but others very > rare. Partial indexes ended up being the solution and we had to > identify which partial indexes were needed. This is an issue with jumbling in general and very vaguely related to merging the number of constants in IN. I think a better solution for this type of issue would be a way to sample the parameter values and possibly EXPLAIN ANALYZE results to logs. Having runtime intelligence in PostgreSQL that would be capable of distinguishing interesting variations from irrelevant doesn't seem like a feasible plan. In my view the best we could do is to aim to have entries roughly correspond to application query invocation points and leave the more complex statistical analysis use cases to more specialized tools. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Peter Geoghegan <pg@heroku.com> writes: > On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg@heroku.com> wrote: >> pg_stat_statements' fingerprinting logic considers the following two >> statements as distinct: >> >> select 1 in (1, 2, 3); >> select 1 in (1, 2, 3, 4); >> >> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list >> of elements recursively. In this case it's a list of Const nodes, and >> the fingerprinting logic jumbles those nodes indifferently. I think this is a vastly oversimplified explanation of the problem. In particular, because the planner will flatten an ArrayExpr containing only Const nodes to an array constant (see eval_const_expressions), I don't believe the case ever arises in exactly the form you posit here. A portion of the problem is possibly due to the heuristics in parse_expr.c's transformAExprIn(): * We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only * possible if there is a suitable arraytype available. If not, we fall * back to a boolean condition tree with multiple copies of the lefthand * expression. Also, any IN-list items that contain Vars are handled as * separate boolean conditions, because that givesthe planner more scope * for optimization on such clauses. If the original text actually involves a variable number of Vars, then you will end up with a boolean expression with a varying number of OR arms, even if the Vars later get flattened to constants. However, it's not clear to me that anyone would expect such cases to be treated as identical. Another possibility is a type clash, for example "x IN (42, 44.1)" will end up as a boolean tree for lack of a common type for the would-be array elements. That case might possibly be an issue in practice. But what seems more likely to be annoying people is cases in which the original text contains a varying number of Param markers. Those might or might not get folded to constants during planning depending on context, so that they might or might not look different to pg_stat_statements. So I suspect the real problem here is that we might want all of these things to look identical to pg_stat_statements: ARRAY[$1, $2, 42] ARRAY[$1, $2, $3, 47] '{1,2,3,47}'::int[] Don't see a very clean way to do that ... regards, tom lane
On 11/24/15 1:29 PM, Tom Lane wrote: > So I suspect the real problem here is that we might want all of these > things to look identical to pg_stat_statements: > > ARRAY[$1, $2, 42] > ARRAY[$1, $2, $3, 47] > '{1,2,3,47}'::int[] > > Don't see a very clean way to do that ... Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words, treating an integer as text. A lot of frameworks like to do that and just push the problem onto the database. I'm not sure what pg_stat_statements would ultimately see in that case.. Since there's a few different things people might want, maybe a good first step is to allow extending/changing the jumbling decision at the C level. That would make it easy for a knowledgeable enough person to come up with an alternative as a plugin that regular users could use. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words, > treating an integer as text. A lot of frameworks like to do that and just > push the problem onto the database. I'm not sure what pg_stat_statements > would ultimately see in that case.. They do? postgres=# select 5::int4 in ('5');?column? ──────────t (1 row) postgres=# select 5::int4 in ('5a'); ERROR: 22P02: invalid input syntax for integer: "5a" LINE 1: select 5::int4 in ('5a'); ^ -- Peter Geoghegan
I wrote: > Peter Geoghegan <pg@heroku.com> writes: >> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list >> of elements recursively. In this case it's a list of Const nodes, and >> the fingerprinting logic jumbles those nodes indifferently. > I think this is a vastly oversimplified explanation of the problem. > In particular, because the planner will flatten an ArrayExpr containing > only Const nodes to an array constant (see eval_const_expressions), > I don't believe the case ever arises in exactly the form you posit here. Um ... disregard that. For some reason I was thinking that pg_stat_statements looks at plan trees, but of course what it looks at is the query tree immediately post-parse-analysis. So the behavior of the const-folding pass is not relevant. The heuristics in transformAExprIn() are still relevant, though, and I suspect that the question of whether Params should be considered the same as Consts is also highly relevant. I wonder whether we could improve this by arranging things so that both Consts and Params contribute zero to the jumble hash, and a list of these things also contributes zero, regardless of the length of the list. regards, tom lane
On 11/24/15 7:46 PM, Peter Geoghegan wrote: > On Tue, Nov 24, 2015 at 5:39 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> Another not-uncommon case is IN ( '1', '2', ... , '2342' ); in other words, >> treating an integer as text. A lot of frameworks like to do that and just >> push the problem onto the database. I'm not sure what pg_stat_statements >> would ultimately see in that case.. > > They do? > > postgres=# select 5::int4 in ('5'); > ?column? > ────────── > t > (1 row) > > postgres=# select 5::int4 in ('5a'); > ERROR: 22P02: invalid input syntax for integer: "5a" > LINE 1: select 5::int4 in ('5a'); > ^ I'm not following your point. Obviously you can't compare int to text that doesn't convert back to an int, but that's not what I was talking about. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Nov 24, 2015 at 10:55 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I'm not following your point. Obviously you can't compare int to text that > doesn't convert back to an int, but that's not what I was talking about. I didn't see what else you could have meant. In any case, the type text has no involvement in your example. pg_stat_statements sees the following SQL queries as equivalent: select integer '5'; select 6; select 7::int4; -- Peter Geoghegan
On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
One specific justification he gave for not using pg_stat_statements was:
"Doesn’t merge bind vars in IN()" (See slide #11)
I wonder:
* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.
As someone who runs a little monitoring service thats solely based on pg_stat_statements data, ignoring IN list length would certainly be a good change.
We currently do this in post-processing, together with other data cleanup (e.g. ignoring the length of a VALUES list in an INSERT statement).
Given the fact that pgss data is normalized & you don't know which plan was chosen, I don't think distinguishing based on the length of the list helps anyone really.
I see pg_stat_statements as a high-level overview of which queries have run, and which ones you might want to look into closer using e.g. auto_explain.
Best,
Lukas
Lukas Fittl
Skype: lfittl
Phone: +1 415 321 0630
Skype: lfittl
Phone: +1 415 321 0630
Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)
От
"Shulgin, Oleksandr"
Дата:
On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote:
On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote:One specific justification he gave for not using pg_stat_statements was:
"Doesn’t merge bind vars in IN()" (See slide #11)I wonder:
* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.
As someone who runs a little monitoring service thats solely based on pg_stat_statements data, ignoring IN list length would certainly be a good change.We currently do this in post-processing, together with other data cleanup (e.g. ignoring the length of a VALUES list in an INSERT statement).Given the fact that pgss data is normalized & you don't know which plan was chosen, I don't think distinguishing based on the length of the list helps anyone really.I see pg_stat_statements as a high-level overview of which queries have run, and which ones you might want to look into closer using e.g. auto_explain.
I still have the plans to try to marry pg_stat_statements and auto_explain for the next iteration of "online query plans" extension I was proposing a few months ago, and the first thing I was going to look into is rectifying this problem with IN() jumbling. So, have a +1 from me.
--
Alex
On Mon, Nov 30, 2015 at 02:41:02PM +0100, Shulgin, Oleksandr wrote: > On Wed, Nov 25, 2015 at 9:13 AM, Lukas Fittl <lukas@fittl.com> wrote: > > On Mon, Nov 23, 2015 at 11:53 PM, Peter Geoghegan <pg@heroku.com> wrote: > > One specific justification he gave for not using pg_stat_statements > was: > > "Doesn’t merge bind vars in IN()" (See slide #11) > > > I wonder: > > * How do other people feel about this? Personally, I've seen enough > problems of this kind in the field that "slippery slope" arguments > against this don't seem very compelling. > > > As someone who runs a little monitoring service thats solely based on > pg_stat_statements data, ignoring IN list length would certainly be a good > change. > > We currently do this in post-processing, together with other data cleanup > (e.g. ignoring the length of a VALUES list in an INSERT statement). > > Given the fact that pgss data is normalized & you don't know which plan was > chosen, I don't think distinguishing based on the length of the list helps > anyone really. > > I see pg_stat_statements as a high-level overview of which queries have > run, and which ones you might want to look into closer using e.g. > auto_explain. > > > I still have the plans to try to marry pg_stat_statements and auto_explain for > the next iteration of "online query plans" extension I was proposing a few > months ago, and the first thing I was going to look into is rectifying this > problem with IN() jumbling. So, have a +1 from me. Is this a TODO? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote: > Is this a TODO? It's on my (very long) personal TODO list. It would be great if someone else took it, though. So, yes. -- Peter Geoghegan
On Wed, Dec 30, 2015 at 05:21:05PM -0800, Peter Geoghegan wrote: > On Wed, Dec 30, 2015 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Is this a TODO? > > It's on my (very long) personal TODO list. It would be great if > someone else took it, though. So, yes. TODO added. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +