Обсуждение: queryId constant squashing does not support prepared statements
62d712ec added the ability to squash constants from an IN LIST for queryId computation purposes. This means that a similar queryId will be generated for the same queries that only different on the number of values in the IN-LIST. The patch lacks the ability to apply this optimization to values passed in as parameters ( i.e. parameter kind = PARAM_EXTERN ) which will be the case for SQL prepared statements and protocol level prepared statements, i.e. ``` select from t where id in (1, 2, 3) \bind ``` or ``` prepare prp(int, int, int) as select from t where id in ($1, $2, $3); ``` Here is the current state, ``` postgres=# create table t (id int); CREATE TABLE postgres=# prepare prp(int, int, int) as select from t where id in ($1, $2, $3); PREPARE postgres=# execute prp(1, 2, 3); postgres=# select from t where id in (1, 2, 3); -- (0 rows) postgres=# SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; query | calls -----------------------------------------------------------------------------------------------------------+------- ... .... select from t where id in ($1 /*, ... */) | 1 select from t where id in ($1, $2, $3) | 1 <<- prepared statement (6 rows) ``` but with the attached patch, the optimization applies. ``` create table t (id int) | 1 select from t where id in ($1 /*, ... */) | 2 (3 rows) ``` I think this is a pretty big gap as many of the common drivers such as JDBC, which use extended query protocol, will not be able to take advantage of the optimization in 18, which will be very disappointing. Thoughts? Sami Imseih Amazon Web Services (AWS)
Вложения
On Wed, Apr 30, 2025 at 04:52:06PM -0500, Sami Imseih wrote: > 62d712ec added the ability to squash constants from an IN LIST > for queryId computation purposes. This means that a similar > queryId will be generated for the same queries that only > different on the number of values in the IN-LIST. > > The patch lacks the ability to apply this optimization to values > passed in as parameters ( i.e. parameter kind = PARAM_EXTERN ) > which will be the case for SQL prepared statements and protocol level > prepared statements, i.e. > > I think this is a pretty big gap as many of the common drivers such as JDBC, > which use extended query protocol, will not be able to take advantage of > the optimization in 18, which will be very disappointing. > > Thoughts? Yes. Long IN/ANY clauses are as far as a more common pattern caused by ORMs, and I'd like to think that application developers would not hardcode such clauses in their right minds (well, okay, I'm likely wrong about this assumption, feel free to counter-argue). These also like relying on the extended query protocol. Not taking into account JDBC in that is a bummer, and it is very popular. I agree that the current solution we have in the tree feels incomplete because we are not taking into account the most common cases that users would care about. Now, allowing PARAM_EXTERN means that we allow any expression to be detected as a squashable thing, and this kinds of breaks the assumption of IsSquashableConst() where we want only constants to be allowed because EXECUTE parameters can be any kind of Expr nodes. At least that's the intention of the code on HEAD. Now, I am not actually objecting about PARAM_EXTERN included or not if there's a consensus behind it and my arguments are considered as not relevant. The patch is written so as it claims that a PARAM_EXTERN implies the expression to be a Const, but it may not be so depending on what the execution path is given for the parameter. Or at least the patch could be clearer and rename the parts about the "Const" squashable APIs around queryjumblefuncs.c. To be honest, the situation of HEAD makes me question whether we are using the right approach for this facility. I did mention a couple of months ago about an alternative, but it comes down to accept that any expressions would be normalized, unfortunately I never got down to study it in details as this touches around expr_list in the parser: we could detect in the parser the start and end locations of an expression list in a query string, then group all of them together based on their location in the string. This would be also cheaper than going through all the elements in the list, tweaking things when dealing with a subquery.. The PARAM_EXTERN part has been mentioned a couple of weeks ago here, btw: https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=VpEA@mail.gmail.com -- Michael
Вложения
> On Thu, May 01, 2025 at 09:29:13AM GMT, Michael Paquier wrote: > > I agree that the current solution we have in the tree feels incomplete > because we are not taking into account the most common cases that > users would care about. Now, allowing PARAM_EXTERN means that we > allow any expression to be detected as a squashable thing, and this > kinds of breaks the assumption of IsSquashableConst() where we want > only constants to be allowed because EXECUTE parameters can be any > kind of Expr nodes. At least that's the intention of the code on > HEAD. > > Now, I am not actually objecting about PARAM_EXTERN included or not if > there's a consensus behind it and my arguments are considered as not > relevant. The patch is written so as it claims that a PARAM_EXTERN > implies the expression to be a Const, but it may not be so depending > on what the execution path is given for the parameter. Or at least > the patch could be clearer and rename the parts about the "Const" > squashable APIs around queryjumblefuncs.c. > > [...] > > The PARAM_EXTERN part has been mentioned a couple of weeks ago here, > btw: > https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=VpEA@mail.gmail.com In fact, this has been discussed much earlier in the thread above, as essentially the same implementation with T_Params, which is submitted here, was part of the original patch. The concern was always whether or not it will break any assumption about query identification, because this way much broader scope of expressions will be considered equivalent for query id computation purposes. At the same time after thinking about this concern more, I presume it already happens at a smaller scale -- when two queries happen to have the same number of parameters, they will be indistinguishable even if parameters are different in some way. > To be honest, the situation of HEAD makes me question whether we are > using the right approach for this facility. I did mention a couple of > months ago about an alternative, but it comes down to accept that any > expressions would be normalized, unfortunately I never got down to > study it in details as this touches around expr_list in the parser: we > could detect in the parser the start and end locations of an > expression list in a query string, then group all of them together > based on their location in the string. This would be also cheaper > than going through all the elements in the list, tweaking things when > dealing with a subquery.. Not entirely sure how that would work exactly, but after my experiments with the squashing patch I found it could be very useful to be able to identify the end location of an expression list in the parser.
I spent a few hours looking into this today and to your points below: > > I agree that the current solution we have in the tree feels incomplete > > because we are not taking into account the most common cases that > > users would care about. Now, allowing PARAM_EXTERN means that we > > allow any expression to be detected as a squashable thing, and this > > kinds of breaks the assumption of IsSquashableConst() where we want > > only constants to be allowed because EXECUTE parameters can be any > > kind of Expr nodes. At least that's the intention of the code on > > HEAD. > > > > Now, I am not actually objecting about PARAM_EXTERN included or not if > > there's a consensus behind it and my arguments are considered as not > > relevant. The patch is written so as it claims that a PARAM_EXTERN > > implies the expression to be a Const, but it may not be so depending > > on what the execution path is given for the parameter. Or at least > > the patch could be clearer and rename the parts about the "Const" > > squashable APIs around queryjumblefuncs.c. > > > > [...] > > > > The PARAM_EXTERN part has been mentioned a couple of weeks ago here, > > btw: > > https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=VpEA@mail.gmail.com > > In fact, this has been discussed much earlier in the thread above, as > essentially the same implementation with T_Params, which is submitted > here, was part of the original patch. The concern was always whether or > not it will break any assumption about query identification, because > this way much broader scope of expressions will be considered equivalent > for query id computation purposes. > > At the same time after thinking about this concern more, I presume it > already happens at a smaller scale -- when two queries happen to have > the same number of parameters, they will be indistinguishable even if > parameters are different in some way. I don't think limiting this feature to Const only will suffice. I think what we should really allow the broader scope of expressions that are allowed via prepared statements, and this will make this implementation consistent between prepared vs non-prepared statements. I don't see why not. In fact, when we are examining the ArrayExpr, I think the only thing we should not squash is if we find a Sublink ( i.e. SELECT statement inside the array ). > > To be honest, the situation of HEAD makes me question whether we are > > using the right approach for this facility. I did mention a couple of > > months ago about an alternative, but it comes down to accept that any > > expressions would be normalized, unfortunately I never got down to > > study it in details as this touches around expr_list in the parser: we > > could detect in the parser the start and end locations of an > > expression list in a query string, then group all of them together > > based on their location in the string. This would be also cheaper > > than going through all the elements in the list, tweaking things when > > dealing with a subquery.. > > Not entirely sure how that would work exactly, but after my experiments > with the squashing patch I found it could be very useful to be able to > identify the end location of an expression list in the parser. I also came to the same conclusion, that we should track the start '(' and end ')' location of a expression list to allow us to hide the fields. But, I will look into other approaches as well. I am really leaning towards that we should revert this feature as the limitation we have now with parameters is a rather large one and I think we need to go back and address this issue. -- Sami
On Thu, May 01, 2025 at 03:57:16PM -0500, Sami Imseih wrote: > I think what we should really allow the broader scope of expressions that > are allowed via prepared statements, and this will make this implementation > consistent between prepared vs non-prepared statements. I don't see why > not. In fact, when we are examining the ArrayExpr, I think the only > thing we should > not squash is if we find a Sublink ( i.e. SELECT statement inside the array ). Likely so. I don't have anything else than Sublink in mind that would be worth a special case.. > I am really leaning towards that we should revert this feature as the > limitation we have now with parameters is a rather large one and I think > we need to go back and address this issue. I am wondering if this would not be the best move to do on HEAD. Let's see where the discussion drives us. -- Michael
Вложения
> On Fri, May 02, 2025 at 07:10:19AM GMT, Michael Paquier wrote: > > I am really leaning towards that we should revert this feature as the > > limitation we have now with parameters is a rather large one and I think > > we need to go back and address this issue. > > I am wondering if this would not be the best move to do on HEAD. > Let's see where the discussion drives us. Squashing constants was ment to be a first step towards doing the same for other types of queries (params, rte_values), reverting it to implement everything at once makes very little sense to me.
On Fri, May 02, 2025 at 09:13:39AM +0200, Dmitry Dolgov wrote: > Squashing constants was ment to be a first step towards doing the same > for other types of queries (params, rte_values), reverting it to > implement everything at once makes very little sense to me. That depends. If we conclude that tracking this information through the parser based on the start and end positions in a query string for a set of values is more relevant, then we would be redesigning the facility from the ground, so the old approach would not be really relevant.. -- Michael
Вложения
> On Fri, May 02, 2025 at 04:18:37PM GMT, Michael Paquier wrote: > On Fri, May 02, 2025 at 09:13:39AM +0200, Dmitry Dolgov wrote: > > Squashing constants was ment to be a first step towards doing the same > > for other types of queries (params, rte_values), reverting it to > > implement everything at once makes very little sense to me. > > That depends. If we conclude that tracking this information through > the parser based on the start and end positions in a query string > for a set of values is more relevant, then we would be redesigning the > facility from the ground, so the old approach would not be really > relevant.. If I understand you correctly, changing the way how element list is identified is not going to address the question whether or not to squash parameters, right?
On 2025-May-02, Michael Paquier wrote: > That depends. If we conclude that tracking this information through > the parser based on the start and end positions in a query string > for a set of values is more relevant, then we would be redesigning the > facility from the ground, so the old approach would not be really > relevant.. I disagree that a revert is warranted for this reason. If you want to change the implementation later, that's fine, as long as the user interface doesn't change. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
> On Thu, May 01, 2025 at 09:55:47PM GMT, Dmitry Dolgov wrote: > > On Thu, May 01, 2025 at 09:29:13AM GMT, Michael Paquier wrote: > > > > I agree that the current solution we have in the tree feels incomplete > > because we are not taking into account the most common cases that > > users would care about. Now, allowing PARAM_EXTERN means that we > > allow any expression to be detected as a squashable thing, and this > > kinds of breaks the assumption of IsSquashableConst() where we want > > only constants to be allowed because EXECUTE parameters can be any > > kind of Expr nodes. At least that's the intention of the code on > > HEAD. > > > > Now, I am not actually objecting about PARAM_EXTERN included or not if > > there's a consensus behind it and my arguments are considered as not > > relevant. The patch is written so as it claims that a PARAM_EXTERN > > implies the expression to be a Const, but it may not be so depending > > on what the execution path is given for the parameter. Or at least > > the patch could be clearer and rename the parts about the "Const" > > squashable APIs around queryjumblefuncs.c. > > > > [...] > > > > The PARAM_EXTERN part has been mentioned a couple of weeks ago here, > > btw: > > https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=VpEA@mail.gmail.com > > In fact, this has been discussed much earlier in the thread above, as > essentially the same implementation with T_Params, which is submitted > here, was part of the original patch. The concern was always whether or > not it will break any assumption about query identification, because > this way much broader scope of expressions will be considered equivalent > for query id computation purposes. > > At the same time after thinking about this concern more, I presume it > already happens at a smaller scale -- when two queries happen to have > the same number of parameters, they will be indistinguishable even if > parameters are different in some way. Returning to the topic of whether to squash list of Params. Originally squashing of Params wasn't included into the squashing patch due to concerns from reviewers about treating quite different queries as the same for the purposes of query identification. E.g. there is some assumption somewhere, which will be broken if we treat query with a list of integer parameters same as a query with a list of float parameters. For the sake of making progress I've decided to postpone answering this question and concentrate on more simple scenario. Now, as the patch was applied, I think it's a good moment to reflect on those concerns. It's not enough to say that we don't see any problems with squashing of Param, some more sound argumentation is needed. So, what will happen if parameters are squashed as constants? 1. One obvious impact is that more queries, that were considered distinct before, will have the same normalized query and hence the entry in pg_stat_statements. Since a Param could be pretty much anything, this can lead to a situation when two queries with quiet different performance profiles (e.g. one contains a parameter value, which is a heavy function, another one doesn't) are matched to one entry, making it less useful. But at the same time this already can happen if those two queries have the same number of parameters, since query parametrizing is intrinsically lossy in this sense. The only thing we do by squashing such queries is we loose information about the number of parameters, not properties of the parameters themselves. 2. Another tricky scenario is when queryId is used by some extension, which in turn makes assumption about it that are going to be rendered incorrect by squashing. The only type of assumptions I can imagine falling into this category is anything about equivalence of queries. For example, an extension can capture two queries, which have the same normalized entry in pgss, and assume all properties of those queries are the same. It's worth noting that normalized query is not transitive, i.e. if a query1 has the normalized version query_norm, and a query2 has the same normalized version query_norm, it doesn't mean query1 is equivalent query2 in all senses (e.g. they could have list of parameter values with different type and the same size). That means that such assumptions are already faulty, and could work most of the time only because it takes queries with a list of the same size to break the assumption. Squashing such queries will make them wrong more often. One can argue that we might want to be friendly to such extensions, and do not "break" them even further. But I don't think it's worth it, as number of such extensions is most likely low, if any. One more extreme case would be when an extension assumes that queries with the same entry in pgss have the same number of parameters, but I don't see how such assumption could be useful. 3. More annoying is the consequence that parameters are going to be treated as constants in pg_stat_statements. While mostly harmless, that would mean they're going to be replaced in the same way as constants. This means that the parameter order is going to be lost, e.g.: SELECT * FROM test_squash WHERE data IN ($4, $3, $2, $1) \bind 1 2 3 4 -- output SELECT * FROM test_squash WHERE data IN ($1 /*, ... */) SELECT * FROM test_squash WHERE data IN ($1, $2, $3, $4) AND id IN ($5, $6, $7, $8) \bind 1 2 3 4 5 6 7 8 -- output SELECT * FROM test_squash WHERE data IN ($1 /*, ... */) AND id IN ($2 /*, ... */) This representation could be confusing of course. It could be either explained in the documentation, or LocationLen has to be extended to carry information about whether it's a constant or a parameter, and do not replace the latter. In any case, anything more than the first parameter number will be lost, but it's probably not so dramatic. At the end of the day, I think the value of squashing for parameters outweighs the problems described above. As long as there is an agreement about that, it's fine by me. I've attached the more complete version of the patch (but without modifying LocationLen to not replace Param yet) in case if such agreemeng will be achieved.
Вложения
On Fri, 2 May 2025 14:56:56 +0200 Álvaro Herrera <alvherre@kurilemu.de> wrote: > On 2025-May-02, Michael Paquier wrote: > > > That depends. If we conclude that tracking this information through > > the parser based on the start and end positions in a query string > > for a set of values is more relevant, then we would be redesigning > > the facility from the ground, so the old approach would not be > > really relevant.. > > I disagree that a revert is warranted for this reason. If you want to > change the implementation later, that's fine, as long as the user > interface doesn't change. > FWIW, i'm +1 on leaving it in pg18. Prepared statements often look a little different in other ways, and there are a bunch of other quirks in how queryid's are calculated too. Didn't there used to be something with CALL being handled as a utility statement making stored procs look different from functions? -- To know the thoughts and deeds that have marked man's progress is to feel the great heart throbs of humanity through the centuries; and if one does not feel in these pulsations a heavenward striving, one must indeed be deaf to the harmonies of life. Helen Keller. Let Us Have Faith. Doubleday, Doran & Company, 1940.
Michael Paquier <michael@paquier.xyz> writes: > On Thu, May 01, 2025 at 03:57:16PM -0500, Sami Imseih wrote: >> I think what we should really allow the broader scope of expressions that >> are allowed via prepared statements, and this will make this implementation >> consistent between prepared vs non-prepared statements. I don't see why >> not. In fact, when we are examining the ArrayExpr, I think the only >> thing we should >> not squash is if we find a Sublink ( i.e. SELECT statement inside the array ). > Likely so. I don't have anything else than Sublink in mind that would > be worth a special case.. I think this is completely wrong. As simple examples, there is nothing even a little bit comparable between the behaviors of t1.x IN (1, 2, 3) t1.x IN (1, 2, t2.y) t1.x IN (1, 2, random()) Squashing these to look the same would be doing nobody any favors. I do agree that treating PARAM_EXTERN Params the same as constants for this purpose is a reasonable thing to do, on three arguments: 1. A PARAM_EXTERN Param actually behaves largely the same as a Const so far as a query is concerned: it does not change value across the execution of the query. (This is not true of other kinds of Params.) 2. It's very much dependent on the client-side stack whether a given value that is constant in the mind of the application will be passed to the backend as a Const or a Param. (This is okay because #1.) 3. Even if the value is passed as a Param, the planner might replace it by a Const by means of generating a custom query plan. So the boundary between PARAM_EXTERN Params and Consts is actually mighty squishy, and thus I think it makes sense for pg_stat_statements to mash them together. But this logic does not extend to Vars or function calls or much of anything else. Maybe in the future we could have a discussion about whether expressions involving only Params, Consts, and immutable functions (say, "$1 + 1") could be mashed as though they were constants, on the grounds that they'd have been reduced to a single constant if the planner had chosen to generate a custom plan. But I think it's too late to consider that for v18. I'd be okay with the rule "treat any list of Consts and PARAM_EXTERN Params the same as any other" for v18. I also agree with Alvaro that this discussion doesn't justify a revert. If the pre-v18 behavior wasn't chiseled on stone tablets, the new behavior isn't either. We can improve it some more later. regards, tom lane
Hi Dmitry, On Sun, May 4, 2025 at 6:19 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Thu, May 01, 2025 at 09:55:47PM GMT, Dmitry Dolgov wrote: > > > On Thu, May 01, 2025 at 09:29:13AM GMT, Michael Paquier wrote: > > > > > > I agree that the current solution we have in the tree feels incomplete > > > because we are not taking into account the most common cases that > > > users would care about. Now, allowing PARAM_EXTERN means that we > > > allow any expression to be detected as a squashable thing, and this > > > kinds of breaks the assumption of IsSquashableConst() where we want > > > only constants to be allowed because EXECUTE parameters can be any > > > kind of Expr nodes. At least that's the intention of the code on > > > HEAD. > > > > > > Now, I am not actually objecting about PARAM_EXTERN included or not if > > > there's a consensus behind it and my arguments are considered as not > > > relevant. The patch is written so as it claims that a PARAM_EXTERN > > > implies the expression to be a Const, but it may not be so depending > > > on what the execution path is given for the parameter. Or at least > > > the patch could be clearer and rename the parts about the "Const" > > > squashable APIs around queryjumblefuncs.c. > > > > > > [...] > > > > > > The PARAM_EXTERN part has been mentioned a couple of weeks ago here, > > > btw: > > > https://www.postgresql.org/message-id/CAA5RZ0tu6_KRiYJCFptf4_--wjFSu9cZMj1XNmOCqTNxu=VpEA@mail.gmail.com > > > > In fact, this has been discussed much earlier in the thread above, as > > essentially the same implementation with T_Params, which is submitted > > here, was part of the original patch. The concern was always whether or > > not it will break any assumption about query identification, because > > this way much broader scope of expressions will be considered equivalent > > for query id computation purposes. > > > > At the same time after thinking about this concern more, I presume it > > already happens at a smaller scale -- when two queries happen to have > > the same number of parameters, they will be indistinguishable even if > > parameters are different in some way. > > Returning to the topic of whether to squash list of Params. > > Originally squashing of Params wasn't included into the squashing patch due to > concerns from reviewers about treating quite different queries as the same for > the purposes of query identification. E.g. there is some assumption somewhere, > which will be broken if we treat query with a list of integer parameters same > as a query with a list of float parameters. For the sake of making progress > I've decided to postpone answering this question and concentrate on more simple > scenario. Now, as the patch was applied, I think it's a good moment to reflect > on those concerns. It's not enough to say that we don't see any problems with > squashing of Param, some more sound argumentation is needed. So, what will > happen if parameters are squashed as constants? > > 1. One obvious impact is that more queries, that were considered distinct > before, will have the same normalized query and hence the entry in > pg_stat_statements. Since a Param could be pretty much anything, this can lead > to a situation when two queries with quiet different performance profiles (e.g. > one contains a parameter value, which is a heavy function, another one doesn't) > are matched to one entry, making it less useful. > > But at the same time this already can happen if those two queries have the same > number of parameters, since query parametrizing is intrinsically lossy in this > sense. The only thing we do by squashing such queries is we loose information > about the number of parameters, not properties of the parameters themselves. > > 2. Another tricky scenario is when queryId is used by some extension, which in > turn makes assumption about it that are going to be rendered incorrect by > squashing. The only type of assumptions I can imagine falling into this > category is anything about equivalence of queries. For example, an extension > can capture two queries, which have the same normalized entry in pgss, and > assume all properties of those queries are the same. > > It's worth noting that normalized query is not transitive, i.e. if a query1 has > the normalized version query_norm, and a query2 has the same normalized version > query_norm, it doesn't mean query1 is equivalent query2 in all senses (e.g. > they could have list of parameter values with different type and the same > size). That means that such assumptions are already faulty, and could work most > of the time only because it takes queries with a list of the same size to break > the assumption. Squashing such queries will make them wrong more often. > > One can argue that we might want to be friendly to such extensions, and do not > "break" them even further. But I don't think it's worth it, as number of such > extensions is most likely low, if any. One more extreme case would be when an > extension assumes that queries with the same entry in pgss have the same number > of parameters, but I don't see how such assumption could be useful. > > 3. More annoying is the consequence that parameters are going to be treated as > constants in pg_stat_statements. While mostly harmless, that would mean they're > going to be replaced in the same way as constants. This means that the > parameter order is going to be lost, e.g.: > > SELECT * FROM test_squash WHERE data IN ($4, $3, $2, $1) \bind 1 2 3 4 > -- output > SELECT * FROM test_squash WHERE data IN ($1 /*, ... */) > > SELECT * FROM test_squash WHERE data IN ($1, $2, $3, $4) > AND id IN ($5, $6, $7, $8) \bind 1 2 3 4 5 6 7 8 > -- output > SELECT * FROM test_squash WHERE data IN ($1 /*, ... */) > AND id IN ($2 /*, ... */) > > This representation could be confusing of course. It could be either explained > in the documentation, or LocationLen has to be extended to carry information > about whether it's a constant or a parameter, and do not replace the latter. In > any case, anything more than the first parameter number will be lost, but it's > probably not so dramatic. > > At the end of the day, I think the value of squashing for parameters outweighs > the problems described above. As long as there is an agreement about that, it's > fine by me. I've attached the more complete version of the patch (but without > modifying LocationLen to not replace Param yet) in case if such agreemeng will > be achieved. Would it make sense to rename `RecordConstLocation` to something like `RecordExpressionLocation` instead? - /* Array of locations of constants that should be removed */ + /* Array of locations of constants that should be removed and parameters */ LocationLen *clocations; should be + /* Array of locations of constants and parameters that should be removed */ You could also consider renaming `clocations` to `elocations`, this may introduce some additional churn though. -- Regards Junwang Zhao
> On Tue, May 06, 2025 at 11:50:07PM GMT, Junwang Zhao wrote: > Would it make sense to rename `RecordConstLocation` to something like > `RecordExpressionLocation` instead? Yeah, naming is hard. RecordExpressionLocation is somehow more vague, but I see what you mean, maybe something along these lines would be indeed a better fit. > - /* Array of locations of constants that should be removed */ > + /* Array of locations of constants that should be removed and parameters */ > LocationLen *clocations; > > should be > > + /* Array of locations of constants and parameters that should be removed */ That was clumsy but intentional, because contrary to constants parameters do not need to be removed. I guess I have to change the wording a bit to make it clear.
> I also agree with Alvaro that this discussion doesn't justify a > revert. If the pre-v18 behavior wasn't chiseled on stone tablets, > the new behavior isn't either. We can improve it some more later. As I was looking further into what we currently have in v18 and HEAD the normalization could break if we pass a function. For example, """ select where 1 in (1, 2, int4(1)); """ the normalized string is, """ select where $1 in ($2 /*, ... */)) """ Notice the extra close parenthesis that is added after the comment. This is because although int4(1) is a function call it is rewritten as a Const and that breaks the assumptions being made by the location of the last expression. Also, something like: """ select where 1 in (1, 2, cast(4 as int)); """ is normalized as: """ select where $1 in ($2 /*, ... */ as int)) """ I don't think the current state is acceptable, if it results in pg_s_s storing an invalid normalized version of the sql. Now, with the attached v2 supporting external params, we see other normalization anomalies such as """ postgres=# select where $1 in ($3, $2) and 1 in ($4, cast($5 as int)) \bind 0 1 2 3 4 postgres-# ; -- (0 rows) postgres=# select toplevel, query, calls from pg_stat_statements; toplevel | query | calls ----------+-------------------------------------------------------------------------+------- t | select where $1 in ($2 /*, ... */) and $3 in ($4 /*, ... */($5 as int)) | 1 (1 row) """ Without properly accounting for the boundaries of the list of expressions, i.e., the start and end positions of '(' and ')' or '[' and ']' and normalizing the expressions in between, it will be very difficult for the normalization to behave sanely. thoughts? -- Sami Imseih Amazon Web Services (AWS)
> On Tue, May 06, 2025 at 01:32:48PM GMT, Sami Imseih wrote: > > I also agree with Alvaro that this discussion doesn't justify a > > revert. If the pre-v18 behavior wasn't chiseled on stone tablets, > > the new behavior isn't either. We can improve it some more later. > > As I was looking further into what we currently have in v18 and HEAD > the normalization could break if we pass a function. > > [...] > > Without properly accounting for the boundaries of the list of expressions, i.e., > the start and end positions of '(' and ')' or '[' and ']' and normalizing the > expressions in between, it will be very difficult for the normalization to > behave sanely. I don't think having the end location in this case would help -- when it comes to ParseFuncOrColumn, looks like for coerce functions it just replaces the original FuncCall with the argument expression. Meaning that when jumbling we have only the coerce argument expression (Const), which ends before the closing brace, not the parent expression. Maybe it would be possible to address thins in not too complicated way in fill_in_constant_lengths, since it already operates with parsed tokens.
> > Without properly accounting for the boundaries of the list of expressions, i.e., > > the start and end positions of '(' and ')' or '[' and ']' and normalizing the > > expressions in between, it will be very difficult for the normalization to > > behave sanely. > > I don't think having the end location in this case would help -- when it > comes to ParseFuncOrColumn, looks like for coerce functions it just > replaces the original FuncCall with the argument expression. Meaning > that when jumbling we have only the coerce argument expression (Const), > which ends before the closing brace, not the parent expression. If we are picking up the start and end points from gram.c and we add these positions to A_Expr or A_ArrayExpr and then make them available to ArrayExpr, then we know the exact boundary of the IN list. Even if a function call is simplified down to a constant, it will not really matter because we are going to normalize between the original opening and closing parentheses of the IN list. (Actually, we can even track the actual textual starting and end point of a List as well) Attached ( not in patch form ) is the idea for this. ``` postgres=# select where 1 in (1, int4(1)); -- (1 row) postgres=# select where 1 in (1, int4($1::int)) \bind 1 postgres-# ; -- (1 row) postgres=# select toplevel, query, calls from pg_stat_statements; toplevel | query | calls ----------+------------------------------------+------- t | select where $1 in ($2 /*, ... */) | 2 (1 row) ``` What do you think? -- Sami Imseih
Вложения
On Tue, May 06, 2025 at 01:32:48PM -0500, Sami Imseih wrote: > Without properly accounting for the boundaries of the list of expressions, i.e., > the start and end positions of '(' and ')' or '[' and ']' and normalizing the > expressions in between, it will be very difficult for the normalization to > behave sanely. FWIW, this is exactly the kind of issues we have spent time on when improving the location detection of sub-queries for some DDL patterns, and the parser is the only path I am aware of where this can be done sanely because the extra characters that may, or may not, be included in some of the expressions would be naturally discarded based on the @ locations we retrieve. For reference, this story is part of 499edb09741b, more precisely around this part of the thread: https://www.postgresql.org/message-id/CACJufxF9hqyfmKEdpiG%3DPbrGdKVNP2BQjHFJh4q6639sV7NmvQ%40mail.gmail.com (FWIW, I've seen assumptions around the detection of specific locations done outside the parser in pg_hint_plan as well, that did not finish well because the code makes assumptions that natural parsers are just better at because they're designed to detect such cases.) -- Michael
Вложения
> On Tue, May 06, 2025 at 03:01:32PM GMT, Sami Imseih wrote: > > > Without properly accounting for the boundaries of the list of expressions, i.e., > > > the start and end positions of '(' and ')' or '[' and ']' and normalizing the > > > expressions in between, it will be very difficult for the normalization to > > > behave sanely. > > > > I don't think having the end location in this case would help -- when it > > comes to ParseFuncOrColumn, looks like for coerce functions it just > > replaces the original FuncCall with the argument expression. Meaning > > that when jumbling we have only the coerce argument expression (Const), > > which ends before the closing brace, not the parent expression. > > If we are picking up the start and end points from gram.c and we add these > positions to A_Expr or A_ArrayExpr and then make them available to ArrayExpr, > then we know the exact boundary of the IN list. Even if a function > call is simplified down > to a constant, it will not really matter because we are going to normalize > between the original opening and closing parentheses of the IN list. > What do you think? Ah, I see what you mean. I think the idea is fine, it will simplify certain things as well as address the issue. But I'm afraid adding start/end location to A_Expr is a bit too invasive, as it's being used for many other purposes. How about introducing a new expression for this purposes, and use it only in in_expr/array_expr, and wrap the corresponding expressions into it? This way the change could be applied in a more targeted fashion.
On Thu, May 8, 2025 at 2:36 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Thu, May 08, 2025 at 02:22:00PM GMT, Michael Paquier wrote: > > On Wed, May 07, 2025 at 10:41:22AM +0200, Dmitry Dolgov wrote: > > > Ah, I see what you mean. I think the idea is fine, it will simplify > > > certain things as well as address the issue. But I'm afraid adding > > > start/end location to A_Expr is a bit too invasive, as it's being used > > > for many other purposes. How about introducing a new expression for this > > > purposes, and use it only in in_expr/array_expr, and wrap the > > > corresponding expressions into it? This way the change could be applied > > > in a more targeted fashion. > > > > Yes, that feels invasive. The use of two static variables to track > > the start and the end positions in an expression list can also be a > > bit unstable to rely on, I think. It seems to me that this part > > could be handled in a new Node that takes care of tracking the two > > positions, instead, be it a start/end couple or a location/length > > couple? That seems necessary to have when going through > > jumbleElements(). > > To clarify, I had in mind something like in the attached patch. The > idea is to make start/end location capturing relatively independent from > the constants squashing. The new parsing node conveys the location > information, which is then getting transformed to be a part of an > ArrayExpr. It's done for in_expr only here, something similar would be > needed for array_expr as well. Feedback is appreciated. Thanks! I took a quick look at v1-0001 and it feels like a much better approach than the quick hack I put together earlier. I will look thoroughly. -- Sami
Hi Dmitry, On Fri, May 9, 2025 at 3:36 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Thu, May 08, 2025 at 02:22:00PM GMT, Michael Paquier wrote: > > On Wed, May 07, 2025 at 10:41:22AM +0200, Dmitry Dolgov wrote: > > > Ah, I see what you mean. I think the idea is fine, it will simplify > > > certain things as well as address the issue. But I'm afraid adding > > > start/end location to A_Expr is a bit too invasive, as it's being used > > > for many other purposes. How about introducing a new expression for this > > > purposes, and use it only in in_expr/array_expr, and wrap the > > > corresponding expressions into it? This way the change could be applied > > > in a more targeted fashion. > > > > Yes, that feels invasive. The use of two static variables to track > > the start and the end positions in an expression list can also be a > > bit unstable to rely on, I think. It seems to me that this part > > could be handled in a new Node that takes care of tracking the two > > positions, instead, be it a start/end couple or a location/length > > couple? That seems necessary to have when going through > > jumbleElements(). > > To clarify, I had in mind something like in the attached patch. The > idea is to make start/end location capturing relatively independent from > the constants squashing. The new parsing node conveys the location > information, which is then getting transformed to be a part of an > ArrayExpr. It's done for in_expr only here, something similar would be > needed for array_expr as well. Feedback is appreciated. +/* + * A wrapper expression to record start and end location + */ +typedef struct LocationExpr +{ + NodeTag type; + + /* the node to be wrapped */ + Node *expr; + /* token location, or -1 if unknown */ + ParseLoc start_location; + /* token location, or -1 if unknown */ + ParseLoc end_location; +} LocationExpr; Why not a location and a length, it should be more natural, it seems we use this convention in some existing nodes, like RawStmt, InsertStmt etc. -- Regards Junwang Zhao
On Fri, May 09, 2025 at 11:05:43AM +0800, Junwang Zhao wrote: > Why not a location and a length, it should be more natural, it > seems we use this convention in some existing nodes, like > RawStmt, InsertStmt etc. These are new concepts as of Postgres 18 (aka only on HEAD), chosen mainly to match with the internals of pg_stat_statements as far as I recall. Doing the same here would not hurt, but it may be better depending on the cases to rely on a start/end. I suspect that switching from one to the other should not change much the internal squashing logic. -- Michael
Вложения
On Fri, May 9, 2025 at 1:35 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, May 09, 2025 at 11:05:43AM +0800, Junwang Zhao wrote: > > Why not a location and a length, it should be more natural, it > > seems we use this convention in some existing nodes, like > > RawStmt, InsertStmt etc. > > These are new concepts as of Postgres 18 (aka only on HEAD), chosen > mainly to match with the internals of pg_stat_statements as far as I > recall. Doing the same here would not hurt, but it may be better > depending on the cases to rely on a start/end. ISTM that for string manipulation, start_pos/length are more appropriate, start/end are often better suited for iterator use, where start refers to the first element and end marks the position one past the last element. Just my opinion, I can live with either way though. > I suspect that > switching from one to the other should not change much the internal > squashing logic. Yeah, not much difference, one can easily be computed from the other. > -- > Michael -- Regards Junwang Zhao
> On Fri, May 09, 2025 at 02:35:33PM GMT, Michael Paquier wrote: > On Fri, May 09, 2025 at 11:05:43AM +0800, Junwang Zhao wrote: > > Why not a location and a length, it should be more natural, it > > seems we use this convention in some existing nodes, like > > RawStmt, InsertStmt etc. > > These are new concepts as of Postgres 18 (aka only on HEAD), chosen > mainly to match with the internals of pg_stat_statements as far as I > recall. Doing the same here would not hurt, but it may be better > depending on the cases to rely on a start/end. I suspect that > switching from one to the other should not change much the internal > squashing logic. Right, switching from start/length to start/end wouldn't change much for squashing. I didn't have any strong reason to go with start/end from my side, so if start/length is more aligned with other nodes, let's change that.
> On Fri, May 09, 2025 at 08:47:58AM GMT, Michael Paquier wrote: > SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; > - query | calls > -----------------------------------------------------+------- > - SELECT ARRAY[$1 /*, ... */] | 1 > - SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 > + query | calls > +-------------------------------------------------------+------- > + SELECT ARRAY[$1, $2, $3, $4, $5, $6, $7, $8, $9, $10] | 1 > + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 > (2 rows) > > Yes, we are going to need more than that for such cases if we want to > cover all the ground we're aiming for. > > Putting that aside, the test coverage for ARRAY[] elements is also > very limited on HEAD with one single test only with a set of > constants. We really should improve that, tracking more patterns and > more mixed combinations to see what gets squashed and what is not. So > this should be extended with more cases, including expressions, > parameters and sublinks, with checks on pg_stat_statements.calls to > see how the counters are aggregated. That's going to be important > when people play with this code to track how things change when > manipulating the element jumbling. I'd suggest to do that separately > of the rest. Agree, I'll try to extend number of test cases here as a separate patch.
> > To clarify, I had in mind something like in the attached patch. The > > idea is to make start/end location capturing relatively independent from > > the constants squashing. The new parsing node conveys the location > > information, which is then getting transformed to be a part of an > > ArrayExpr. It's done for in_expr only here, something similar would be > > needed for array_expr as well. Feedback is appreciated. > > Thanks! I took a quick look at v1-0001 and it feels like a much better approach > than the quick hack I put together earlier. I will look thoroughly. I took a look at v1-0001 and I am wondering if this can be further simplified. We really need a new Node just to wrap the start/end locations of the List coming from the in_expr and this node should not really be needed past parsing. array_expr is even simpler because we have the boundaries available when makeAArrayExpr is called. So, I think we can create a new parse node ( parsenode.h ) that will only be used in parsing (and gram.c only ) to track the start/end locations and List and based on this node we can create A_ArrayExpr and A_Expr with the List of boundaries, and then all we have to do is update ArrayExpr with the boundaries during the respective transformXExpr call. This seems like a much simpler approach that also addresses Michael's concern of defining static variables in gram.y to track the boundaries. what do you think? -- Sami
> On Fri, May 09, 2025 at 12:47:19PM GMT, Sami Imseih wrote: > So, I think we can create a new parse node ( parsenode.h ) that will only be > used in parsing (and gram.c only ) to track the start/end locations > and List and > based on this node we can create A_ArrayExpr and A_Expr with the List > of boundaries, > and then all we have to do is update ArrayExpr with the boundaries during > the respective transformXExpr call. This seems like a much simpler approach > that also addresses Michael's concern of defining static variables in gram.y to > track the boundaries. The static variables was only part of the concern, another part was using A_Expr to carry this information, which will have impact on lots of unrelated code.
> > On Fri, May 09, 2025 at 12:47:19PM GMT, Sami Imseih wrote: > > So, I think we can create a new parse node ( parsenode.h ) that will only be > > used in parsing (and gram.c only ) to track the start/end locations > > and List and > > based on this node we can create A_ArrayExpr and A_Expr with the List > > of boundaries, > > and then all we have to do is update ArrayExpr with the boundaries during > > the respective transformXExpr call. This seems like a much simpler approach > > that also addresses Michael's concern of defining static variables in gram.y to > > track the boundaries. > > The static variables was only part of the concern, another part was > using A_Expr to carry this information, which will have impact on lots > of unrelated code. What would be the problem if A_Expr carries an extra pointer to a List? It already had other fields, rexpr, lexpr and location that could be no-op. Also, LocationExpr is not really an expression node, but a wrapper to an expression node, so I think it's wrong to define it as a Node and be required to add the necessary handling for it in nodeFuncs.c. I think we can just define it as a struct in gram.y so it can carry the locations of the expression and then set the List of the location boundaries in A_Expr and A_ArrayExpr. right? typedef struct LocationExpr { Node *expr; ParseLoc start_location; ParseLoc end_location; } LocationExpr; -- Sami
> On Mon, May 12, 2025 at 06:40:43PM GMT, Sami Imseih wrote: > > The static variables was only part of the concern, another part was > > using A_Expr to carry this information, which will have impact on lots > > of unrelated code. > > What would be the problem if A_Expr carries an extra pointer to a List? > It already had other fields, rexpr, lexpr and location that could be no-op. They can be empty sometimes, but the new fields will be empty 99% of the time. This is a clear sign to me that this informaton does not belong to a node for "infix, prefix, and postfix expressions", don't you think?
> On Tue, May 20, 2025 at 06:30:25AM GMT, Michael Paquier wrote: > On Mon, May 12, 2025 at 06:40:43PM -0400, Sami Imseih wrote: > > Also, LocationExpr is not really an expression node, but a wrapper to > > an expression node, so I think it's wrong to define it as a Node and be > > required to add the necessary handling for it in nodeFuncs.c. I think we > > can just define it as a struct in gram.y so it can carry the locations of the > > expression and then set the List of the location boundaries in > > A_Expr and A_ArrayExpr. right? > > Right. LocationExpr is not a full Node, so if we can do these > improvements without it we have less maintenance to worry about across > the board with less code paths. At the end, I think that we should > try to keep the amount of work done by PGSS as minimal as possible. > > I was a bit worried about not using a Node but Sami has reminded me > last week that we already have in gram.y the concept of using some > private structures to track intermediate results done by the parsing > that we sometimes do not want to push down to the code calling the > parser. If we can do the same, the result could be nicer. I believe it's worth to not only to keep amount of work to support LocationExpr as minimal as possible, but also impact on the existing code. What I see as a problem is keeping such specific information as the location boundaries in such a generic expression as A_Expr, where it will almost never be used. Do I get it right, you folks are ok with that? At the same time AFAICT there isn't much more code paths to worry about in case of a LocationExpr as a node -- in the end all options would have to embed the location information into ArrayExpr during transformation, independently from how this information was conveyed. Aside that the only extra code we've got is node functions (exprType, etc). Is there anything I'm missing here? > By the way, the new test cases for ARRAY lists are sent in the last > patch of the series posted on this thread: > https://www.postgresql.org/message-id/7zbzwk4btnxoo4o3xbtzefoqvht54cszjj4bol22fmej5nmgkf@dbcn4wtakw4y > > These should be first in the list, IMO, so as it is possible to track > what the behavior was before the new logic as of HEAD, and what the > behavior would become after the new logic. Sure, I can reshuffle that. BTW, I'm going to be away for a couple of weeks soon. So if you want to decide one way or another soonish, let's do it now.
> On Tue, May 20, 2025 at 11:03:52AM GMT, Dmitry Dolgov wrote: > > By the way, the new test cases for ARRAY lists are sent in the last > > patch of the series posted on this thread: > > https://www.postgresql.org/message-id/7zbzwk4btnxoo4o3xbtzefoqvht54cszjj4bol22fmej5nmgkf@dbcn4wtakw4y > > > > These should be first in the list, IMO, so as it is possible to track > > what the behavior was before the new logic as of HEAD, and what the > > behavior would become after the new logic. > > Sure, I can reshuffle that. Here is it, but the results are pretty much expected.
Вложения
> On Tue, May 20, 2025 at 04:50:12PM GMT, Sami Imseih wrote: > > At the same time AFAICT there isn't much more code paths > > to worry about in case of a LocationExpr as a node > > I can imagine there are others like value expressions, > row expressions, json array expressions, etc. that we may > want to also normalize. Exactly. When using a node, one can explicitly wrap whatever is needed into it, while otherwise one would need to find a new way to piggy back on A_Expr in a new context. > There are other examples of fields that are minimally used in other structs. > Here is one I randomly spotted in SelectStmt such as SortClause, Limit options, > etc. The way I see it, there is a difference -- I assume those structures were designed for such cases, where the location range would be just slapped on top of A_Expr. > Attached is a sketch of what I mean. There is a private struct that tracks > the list boundaries and this can wrap in_expr or whatever else makes > sense in the future. Just fyi, I don't think this thread is attached to any CF item, meaning it will not be pulled by the CF bot. In that case feel free to post diffs in the patch format. I'll take a look at the proposed change, but a bit later.
> On Wed, May 21, 2025 at 08:22:19PM GMT, Sami Imseih wrote: > > > > At the same time AFAICT there isn't much more code paths > > > > to worry about in case of a LocationExpr as a node > > > > > > I can imagine there are others like value expressions, > > > row expressions, json array expressions, etc. that we may > > > want to also normalize. > > > Exactly. When using a node, one can explicitly wrap whatever is needed > > into it, while otherwise one would need to find a new way to piggy back > > on A_Expr in a new context. > > Looking at the VALUES expression case, we will need to carry the info > with SelectStmt and ultimately to RangeTblEntry which is where the > values_list is, so either approach we take RangeTblEntry will need the > LocationExpr pointer or the additional ParseLoc info I am suggesting. > A_Expr is not used in the values list case. Right, that's precisely my point -- introducing a new node will allow to to use the same generalized mechanism in such scenarios as well, instead of every time inventing something new. > > I'll take a look at the proposed change, but a bit later. > > Here is a v4 to compare with v3. > > 0001- is the infrastructure to track the boundaries > 0002- the changes to jumbling Just to call this out, I don't think there is an agreement on squashing Params, which you have added into 0002. Let's discuss this change separately from the 18 open item. --- Here is a short summary of the open item: * An issue has been discovered with the squashing feature in 18, which can lead to invalid normalized queries in pg_stat_statement. * The proposed fix extends gram.y functionality capturing the end location for list expressions to address that. * There is a disagreement on how exactly to capture the location, the options are introducing a new node LocationExpr or piggy back on an existing A_Expr. I find the former more flexible and less invasive, but looks like there are also other opinions. Now, both flavour of the proposed solution could be still concidered too invasive to be applied as a bug fix. I personally don't see it like this, but I'm obviously biased. This leads us to following decisions to be made: * Is modifying parser (either adding a new node or modifying an existing one) acceptable at this stage? I guess it would be enough to collect couple of votes yes/no in this thread. * If it's not acceptable, the feature could be reverted in 18, and the fix could be applied to the master branch only. I'm fine with both outcomes (apply the fix to both 18 and master, or revert in 18 and apply the fix on master), and leave the decision to Álvaro (sorry for causing all the troubles). It's fair to say that reverting the feature will be the least risky move.
On 2025-May-22, Dmitry Dolgov wrote: > Just to call this out, I don't think there is an agreement on squashing > Params, which you have added into 0002. Actually I think we do have agreement on squashing PARAM_EXTERN Params. https://postgr.es/m/3086744.1746500983@sss.pgh.pa.us > Now, both flavour of the proposed solution could be still concidered too > invasive to be applied as a bug fix. I personally don't see it like > this, but I'm obviously biased. This leads us to following decisions to > be made: > > * Is modifying parser (either adding a new node or modifying an existing > one) acceptable at this stage? I guess it would be enough to collect > couple of votes yes/no in this thread. IMO adding a struct as suggested is okay, especially if it reduces the overall code complexity. But we don't want a node, just a bare struct. Adding a node would be more troublesome. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
> IMO adding a struct as suggested is okay, especially if it reduces the > overall code complexity. But we don't want a node, just a bare struct. > Adding a node would be more troublesome. In v4, a new private struct is added in gram.y, but we are also adding additional fields to track the expression boundaries to the required nodes. -- Sami
> For example with bind queries like that: > select where $1 in ($3, $2) and 1 in ($4, cast($5 as int)) > \bind 0 1 2 3 4 > > Should we have a bit more coverage, where we use multiple IN and/or > ARRAY lists with constants and/or external parameters? I will add more test coverage. All the tests we have for constants should also have a external parameter counterpart. > v4-0003 with the extra tests for ARRAY can be applied first, with the > test output slightly adjusted and the casts showing up. That was my mistake in rearranging the v3-0001 as v4-0003. I will fix in the next revision. > Now, looking > independently at v4-0001, it is a bit hard to say what's the direct > benefit of this patch, because nothing in the tests of pgss change > after applying it. Could the benefit of this patch be demonstrated so > as it is possible to compare what's the current vs the what-would-be > new behavior? You're right, this should not be an independent patch. I had intended to eventually merge these v4-0001 and v4-0002 but felt it was cleaner to review separately. I'll just combine them in the next rev. > The patterns generated when using casts is still a bit funny, but > perhaps nobody will bother much about the result generated as these > are uncommon. For example, this gets squashed, with the end of the > cast included: > Q: select where 2 in (1, 4) and 1 in (5, (cast(7 as int)), 6, cast(8 as int)); > R: select where $1 in ($2 /*, ... */) and $3 in ($4 /*, ... */ as int)) > > This does not get squashed: > Q: select where 2 in (1, 4) and > 1 in (5, cast(7 as int), 6, (cast(8 as int)), 9, 10, (cast(8 as text))::int); > R: select where $1 in ($2 /*, ... */) and > $3 in ($4, cast($5 as int), $6, (cast($7 as int)), $8, $9, (cast($10 as text))::int) > > This is the kind of stuff that should also have coverage for, IMO, or > we will never keep track of what the existing behavior is, and if > things break in some way in the future. This is interesting actually. This is the behavior on HEAD, and I don't get why the first list with the casts does not get squashed, while the second one does. I will check IsSquashableConst tomorrow unless Dmitry gets to it first. ``` test=# select where 2 in (1, 4) and 1 in (5, cast(7 as int), 6, (cast(8 as int)), 9, 10, (cast(8 as text))::int); -- (0 rows) test=# select where 1 in (5, cast(7 as int), 6); -- (0 rows) test=# select queryid, substr(query, 1, 100) as query from pg_stat_statements; queryid | query ----------------------+----------------------------------------------------------------------------------- ------------------- 2125518472894925252 | select where $1 in ($2 /*, ... */) and $3 in ($4, cast($5 as int), $6, (cast($7 as int)), $8, $9, (c -4436613157077978160 | select where $1 in ($2 /*, ... */) ``` > FWIW, with v4-0002 applied, I am seeing one diff in the dml tests, > where a IN list is not squashed for pgss_dml_tab. hmm, I did not observe the same diff. -- Sami
> On Thu, May 22, 2025 at 10:23:31PM GMT, Sami Imseih wrote: > > This does not get squashed: > > Q: select where 2 in (1, 4) and > > 1 in (5, cast(7 as int), 6, (cast(8 as int)), 9, 10, (cast(8 as text))::int); > > R: select where $1 in ($2 /*, ... */) and > > $3 in ($4, cast($5 as int), $6, (cast($7 as int)), $8, $9, (cast($10 as text))::int) > > This is interesting actually. This is the behavior on HEAD, and I don't get why > the first list with the casts does not get squashed, while the second one does. > I will check IsSquashableConst tomorrow unless Dmitry gets to it first. IsSquashableConst has intentionally a limited set of test for "constantness", in particular it does not recurse. The case above (cast(8 as text))::int features two CoerceViaIO expressions one inside another, hence IsSquashableConst returns false.
> > On Thu, May 22, 2025 at 10:23:31PM GMT, Sami Imseih wrote: > > > This does not get squashed: > > > Q: select where 2 in (1, 4) and > > > 1 in (5, cast(7 as int), 6, (cast(8 as int)), 9, 10, (cast(8 as text))::int); > > > R: select where $1 in ($2 /*, ... */) and > > > $3 in ($4, cast($5 as int), $6, (cast($7 as int)), $8, $9, (cast($10 as text))::int) > > > > This is interesting actually. This is the behavior on HEAD, and I don't get why > > the first list with the casts does not get squashed, while the second one does. > > I will check IsSquashableConst tomorrow unless Dmitry gets to it first. > > IsSquashableConst has intentionally a limited set of test for > "constantness", in particular it does not recurse. The case above > > (cast(8 as text))::int > > features two CoerceViaIO expressions one inside another, hence > IsSquashableConst returns false. Should we be doing something like this? to unwrap RelabelType or CoerceViaIO until we have a different type of node to check for later on. We can guard the loop and break out after x amount of times as well. At minimum, we should try to unwrap at least 2 times for some of the common real-world scenarios. What do you think? ``` while (IsA(element, RelabelType) || IsA(element, CoerceViaIO)) { if (IsA(element, RelabelType)) element = (Node *) ((RelabelType *) element)->arg; else if (IsA(element, CoerceViaIO)) element = (Node *) ((CoerceViaIO *) element)->arg; } ``` -- Sami
> On Fri, May 23, 2025 at 09:05:54AM GMT, Sami Imseih wrote: > > > On Thu, May 22, 2025 at 10:23:31PM GMT, Sami Imseih wrote: > > > > This does not get squashed: > > > > Q: select where 2 in (1, 4) and > > > > 1 in (5, cast(7 as int), 6, (cast(8 as int)), 9, 10, (cast(8 as text))::int); > > > > R: select where $1 in ($2 /*, ... */) and > > > > $3 in ($4, cast($5 as int), $6, (cast($7 as int)), $8, $9, (cast($10 as text))::int) > > > > > > This is interesting actually. This is the behavior on HEAD, and I don't get why > > > the first list with the casts does not get squashed, while the second one does. > > > I will check IsSquashableConst tomorrow unless Dmitry gets to it first. > > > > IsSquashableConst has intentionally a limited set of test for > > "constantness", in particular it does not recurse. The case above > > > > (cast(8 as text))::int > > > > features two CoerceViaIO expressions one inside another, hence > > IsSquashableConst returns false. > > Should we be doing something like this? to unwrap RelabelType or > CoerceViaIO until we have a different type of node to check > for later on. We can guard the loop and break out after x amount > of times as well. At minimum, we should try to unwrap at least > 2 times for some of the common real-world scenarios. > > What do you think? > > ``` > while (IsA(element, RelabelType) || IsA(element, CoerceViaIO)) > { > if (IsA(element, RelabelType)) > element = (Node *) ((RelabelType *) element)->arg; > else if (IsA(element, CoerceViaIO)) > element = (Node *) ((CoerceViaIO *) element)->arg; > } > ``` I think it's better to recursively call IsSquashableConst on the nested expression (arg or args for FuncExpr). Something like that was done in the original patch version and was concidered too much at that time, but since it looks like all the past concerns are lifted, why not. Do not forget check_stack_depth.
> On Fri, May 23, 2025 at 04:29:45PM +0200, Dmitry Dolgov wrote: > > I think it's better to recursively call IsSquashableConst on the nested > > expression (arg or args for FuncExpr). Something like that was done in > > the original patch version and was concidered too much at that time, but > > since it looks like all the past concerns are lifted, why not. Do not > > forget check_stack_depth. > > AFAIK, we have already a couple of check_stack_depth() calls during > some node transformations after-parsing. At this level of the code > that would be a new thing.. I think the recursion will simplify the logic inside IsSquashableConstants. I will probably add that as a separate patch that maybe will get applied to HEAD only. Something I want agreement on is the following. Since we assign new parameter symbols based on the highest external param from the original query, as stated in the docs [0] "The parameter symbols used to replace constants in representative query texts start from the next number after the highest $n parameter in the original query text", we could have gaps in assigning symbol values, such as the case below. ``` test=# select where 1 in ($1, $2, $3) and 1 = $4 test-# \bind 1 2 3 4 test-# ; -- (0 rows) test=# select query from pg_stat_statements; query ------------------------------------------------ select where $5 in ($6 /*, ... */) and $7 = $4 ``` I don't think there is much we can do here, without introducing some serious complexity. I think the docs make this scenario clear. Thoughts? [0] https://www.postgresql.org/docs/current/pgstatstatements.html -- Sami
On Fri, May 23, 2025 at 08:05:47PM -0500, Sami Imseih wrote: > Since we assign new parameter symbols based on the highest external param > from the original query, as stated in the docs [0] "The parameter > symbols used to replace > constants in representative query texts start from the next number after the > highest $n parameter in the original query text", we could have gaps > in assigning > symbol values, such as the case below. > > ``` > test=# select where 1 in ($1, $2, $3) and 1 = $4 > test-# \bind 1 2 3 4 > test-# ; > -- > (0 rows) > > test=# select query from pg_stat_statements; > query > ------------------------------------------------ > select where $5 in ($6 /*, ... */) and $7 = $4 > ``` > > I don't think there is much we can do here, without introducing some serious > complexity. I think the docs make this scenario clear. In v17, we are a bit smarter with the numbering, with a normalization giving the following, starting at $1: select where $5 in ($1, $2, $3) and $6 = $4 So your argument about the $n parameters is kind of true, but I think the numbering logic in v17 to start at $1 is a less-confusing result. I would imagine that the squashed logic should give the following result on HEAD in this case if we want a maximum of consistency with the squashing of the IN elements taken into account: select where $3 in ($1 /*, ... */) and $4 = $2 Starting the count of the parameters at $4 would be strange. -- Michael
Вложения
> In v17, we are a bit smarter with the numbering, with a normalization > giving the following, starting at $1: > select where $5 in ($1, $2, $3) and $6 = $4 > > So your argument about the $n parameters is kind of true, but I think > the numbering logic in v17 to start at $1 is a less-confusing result. > I would imagine that the squashed logic should give the following > result on HEAD in this case if we want a maximum of consistency with > the squashing of the IN elements taken into account: > select where $3 in ($1 /*, ... */) and $4 = $2 > > Starting the count of the parameters at $4 would be strange. yeah, I think the correct answer is we need to handle 2 cases. 1. If we don't have a squashed list, then we just do what we do now. 2. If we have 1 or more squashed lists, then we can't guarantee the $n parameter as was supplied by the user and we simply rename the $n starting from 1. therefore, a user supplied query like this: ``` select where $5 in ($1, $2, $3) and $6 = $4 and 1 = 2 ``` will be normalized to: ``` select where $1 in ($2 /*...*/) and $3 = $4 and $5 = $6 ``` To accomplish this, we will need to track the locations of external parameters to support the 2nd case, because we need to re-write the original location of the parameter with the new value. I played around with this this morning and it works as I described above. Any concerns with the behavior described above? -- Sami
On 2025-May-24, Sami Imseih wrote: > therefore, a user supplied query like this: > ``` > select where $5 in ($1, $2, $3) and $6 = $4 and 1 = 2 > ``` > > will be normalized to: > ``` > select where $1 in ($2 /*...*/) and $3 = $4 and $5 = $6 > ``` Hmm, interesting. I think this renumbering should not be a problem in practice; users with unordered parameters have little room to complain if the param numbers change on query normalization. At least that's how it seems to me. If renumbering everything in physical order makes the code simpler, then I don't disagree. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
I realized this thread did not have a CF entry, so here it is https://commitfest.postgresql.org/patch/5801/ -- Sami
Hello, I've spent a bunch of time looking at this series and here's my take on the second one. (The testing patch is unchanged from Sami's). The third patch (for PARAM_EXTERNs) should be a mostly trivial rebase on top of these two. I realized that the whole in_expr production in gram.y is pointless, and the whole private struct that was added was unnecessary. A much simpler solution is to remove in_expr, expand its use in a_expr to the two possibilities, and with that we can remove the need for a new struct. I also added a recursive call in IsSquashableExpression to itself. The check for stack depth can be done without throwing an error. I tested this by adding stack bloat in that function. I also renamed it to IsSquashableConstant. This changes one of the tests, because a cast sequence like 42::int::bigint::int is considered squashable. Other than that, the changes are cosmetic. Barring objections, I'll push this soon, then look at rebasing 0003 on top, which I expect to be an easy job. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Вложения
> I've spent a bunch of time looking at this series and here's my take on > the second one. Thanks! > I realized that the whole in_expr production in gram.y is pointless, and > the whole private struct that was added was unnecessary. A much simpler > solution is to remove in_expr, expand its use in a_expr to the two > possibilities, and with that we can remove the need for a new struct. Nice simplification. > I also added a recursive call in IsSquashableExpression to itself. The I agree with this. I was thinking about a follow-up patch for this based on the discussion above, but why not just add it now. > Barring objections, I'll push this soon, then look at rebasing 0003 on > top, which I expect to be an easy job. LGTM. -- Sami
On Mon, Jun 09, 2025 at 12:44:59PM +0200, Alvaro Herrera wrote: > I also added a recursive call in IsSquashableExpression to itself. The > check for stack depth can be done without throwing an error. I tested > this by adding stack bloat in that function. I also renamed it to > IsSquashableConstant. This changes one of the tests, because a cast > sequence like 42::int::bigint::int is considered squashable. > > Other than that, the changes are cosmetic. > > Barring objections, I'll push this soon, then look at rebasing 0003 on > top, which I expect to be an easy job. v9-0002 is failing in the CI for the freebsd task: https://github.com/michaelpq/postgres/runs/43784034162 Here is the link to the diffs, also attached to this message: https://api.cirrus-ci.com/v1/artifact/task/5378459897167872/testrun/build/testrun/pg_stat_statements/regress/regression.diffs I am also able to reproduce these failures locally, FWIW. For example, with a IN clause made of integer constants gets converted to an ArrayExpr, but in _jumbleElements() we fail to call RecordConstLocation() and the list is not squashed. I think that this is can be reproduced by -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST that I always include in my builds. The freebsd task uses the same with debug_copy_parse_plan_trees=on, debug_write_read_parse_plan_trees=on and debug_raw_expression_coverage_test=on. -- Michael
Вложения
On 2025-Jun-10, Michael Paquier wrote: > I think that this is can be reproduced by > -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES > -DRAW_EXPRESSION_COVERAGE_TEST that I always include in my builds. > The freebsd task uses the same with debug_copy_parse_plan_trees=on, > debug_write_read_parse_plan_trees=on and > debug_raw_expression_coverage_test=on. Ah, of course, I forgot to add read/write support for A_Expr. Pushed a new one, running at https://cirrus-ci.com/build/6249249819590656 -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "We’ve narrowed the problem down to the customer’s pants being in a situation of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
On Tue, Jun 10, 2025 at 07:25:27PM +0200, Alvaro Herrera wrote: > On 2025-Jun-10, Michael Paquier wrote: >> I think that this is can be reproduced by >> -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES >> -DRAW_EXPRESSION_COVERAGE_TEST that I always include in my builds. >> The freebsd task uses the same with debug_copy_parse_plan_trees=on, >> debug_write_read_parse_plan_trees=on and >> debug_raw_expression_coverage_test=on. > > Ah, of course, I forgot to add read/write support for A_Expr. Pushed a > new one, running at > https://cirrus-ci.com/build/6249249819590656 Ah, right. I completely forgot that we have a custom read/write function for this node. Yes, things should be OK once the function is updated. Thanks. -- Michael
Вложения
Hello, I have pushed that now, and here's a rebase of patch 0003 to add support for PARAM_EXTERN. I'm not really sure about this one yet ... -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo"
Вложения
Hello I spent a much longer time staring at this patch than I wanted to, and at a point I almost wanted to boot the whole thing to pg19, but because upthread we already had an agreement that we should get it in for this cycle, I decided that the best course of action was to just move forward with it. My reluctance mostly comes from this bit in generate_normalized_query: + /* + * If we have an external param at this location, but no lists are + * being squashed across the query, then we skip here; this will make + * us print print the characters found in the original query that + * represent the parameter in the next iteration (or after the loop is + * done), which is a bit odd but seems to work okay in most cases. + */ + if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists) + continue; Sami's patch didn't have a comment here and it was not immediately obvious what was going on; moreover I was quite surprised at what happened if I removed it: for example, one query text (in test level_tracking.sql) changes from - 2 | 2 | SELECT (i + $2)::INTEGER LIMIT $3 into + 2 | 2 | SELECT ($2 + $3)::INTEGER LIMIT $4 and if I understand this correctly, the reason is that the query is being executed from an SQL function, CREATE FUNCTION PLUS_ONE(i INTEGER) RETURNS INTEGER AS $$ SELECT (i + 1.0)::INTEGER LIMIT 1 $$ LANGUAGE SQL; so the 'i' is actually a parameter, so it makes sense that it gets turned into a parameter in the normalized query. With the 'if' test I mentioned above, we print it as 'i' literally only because we 'continued' and the next time through the loop we print the text from the original query. This may be considered not entirely correct ... note that the constants in the query are shown as parameters, and that the numbers do not start from 1. (Obviously, a few other queries also change.) I decided to move forward with it anyway because this weirdness seems more contained and less damaging than the unhelpfulness of the unpatched behavior. We may want to revisit this in pg19 -- or, if we're really unconfortable with this, we could even decide to revert this commit in pg18 and try again with Param support in pg19. But I'd like to move on from this and my judgement was that the situation with patch is better than without. I added one more commit, which iterates to peel as many layers of CoerceViaIO and RelabelType as there are around an expression. I noticed this lack while reading Sami's patch for that function, which just simplified the coding therein. For normal cases this would iterate just once (because repeated layers of casts are likely very unusual), so I think it's okay to do that. There was some discussion about handling this via recursion but it died inconclusively; I added recursive handling of FuncExpr's arguments in 0f65f3eec478, which was different, but I think handling this case by iterating is better than recursing. With these commits, IMO the open item can now be closed. Even if we ultimately end up reverting any of this, we would probably punt support of Params to pg19, so the open item would be gone anyway. Lastly, I decided not to do a catversion bump. As far as I can tell, changes in the jumbling functions do not need them. I tried an 'installcheck' run with a datadir initdb'd with the original code, and it works fine. I also tried an 'installcheck' with pg_stat_statements installed, and was surprised to realize that the Query Id reports in EXPLAIN make a large number of tests fail. If I take those lines from the original code into the expected output, and then run the tests with the new code, I notice that a few queries have changed queryId. I suppose this was to be expected, and shouldn't harm anything. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "It takes less than 2 seconds to get to 78% complete; that's a good sign. A few seconds later it's at 90%, but it seems to have stuck there. Did somebody make percentages logarithmic while I wasn't looking?" http://smylers.hates-software.com/2005/09/08/1995c749.html
On 2025-Jun-25, Michael Paquier wrote: > On Tue, Jun 24, 2025 at 07:45:15PM +0200, Alvaro Herrera wrote: > > + /* > > + * If we have an external param at this location, but no lists are > > + * being squashed across the query, then we skip here; this will make > > + * us print print the characters found in the original query that > > + * represent the parameter in the next iteration (or after the loop is > > + * done), which is a bit odd but seems to work okay in most cases. > > + */ > > + if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists) > > + continue; > > + * us print print the characters found in the original query that > > The final commit includes this comment, with a s/print print/print/ > required. Ugh. Fixed, thanks for noticing that. > > Lastly, I decided not to do a catversion bump. As far as I can tell, > > changes in the jumbling functions do not need them. I tried an > > 'installcheck' run with a datadir initdb'd with the original code, and > > it works fine. > > This reminds me of 4c7cd07aa62a and this thread: > https://www.postgresql.org/message-id/1364409.1727673407@sss.pgh.pa.us > > Doesn't the change in the Param structure actually require one because > it can change the representation of some SQL functions? I am not > completely sure. Hmm, but the Param structure didn't actually change; only its jumbling function did (and others that rely on LocationLen). So I think what could happen here if there's no catversion is that somebody has a pg_stat_statements populated with query Ids for some queries that have a different queryIds when computed with the new code. So they're going to have duplicates in pg_stat_statements. I think this is a pretty minor problem, so I'm not inclined to do a catversion bump for it. Anyway we have one due to 0cd69b3d7ef3, so it's moot now. (But it's a good discussion to have, for the future.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Niemand ist mehr Sklave, als der sich für frei hält, ohne es zu sein." Nadie está tan esclavizado como el que se cree libre no siéndolo (Johann Wolfgang von Goethe)