Обсуждение: [HACKERS] what's up with IDENTIFIER_LOOKUP_EXPR?
plpgsql has an enum called IdentifierLookup which includes a value IDENTIFIER_LOOKUP_EXPR which is declared like this: IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ It regrettably does not explain what exactly is special about it, and AFAICT, neither does any other comment. If I replace every usage of IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression tests pass nonetheless. It was introduced by 01f7d29902cb27fb698e5078d72cb837398e074c, committed by Tom in 2010: commit 01f7d29902cb27fb698e5078d72cb837398e074c Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun Jan 10 17:15:18 2010 +0000 Improve plpgsql's handling of record field references by forcing all potential field references in SQL expressions to have RECFIELD datum-array entries at parse time. If it turns outthat the reference is actually to a SQL column, the RECFIELD entry is useless, but it costs little. This allows us to get rid of the previous use of FieldSelect applied to a whole-row Param for the record variable; which was not only slower than a direct RECFIELD reference, but failed for references to systemcolumns of a trigger's NEW or OLD record. Per report and fix suggestion from Dean Rasheed. The rule, as far as I can tell from reading the code, is that IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up nothing. But it's not clear to me exactly what the motivation for that is. plpgsql_parse_word says: /* * We should do nothing in DECLARE sections. In SQL expressions, there's * no need to do anything either ---lookup will happen when the * expression is compiled. */ ...but that doesn't really explain why we go out of our way to have this third state, because the wording implies that it doesn't particularly matter one way or the other. Any help appreciated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > plpgsql has an enum called IdentifierLookup which includes a value > IDENTIFIER_LOOKUP_EXPR which is declared like this: > IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ > It regrettably does not explain what exactly is special about it, and > AFAICT, neither does any other comment. If I replace every usage of > IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression > tests pass nonetheless. AFAICS, the only place where IDENTIFIER_LOOKUP_EXPR acts differently from IDENTIFIER_LOOKUP_NORMAL is plpgsql_parse_word(), and what it does is to prevent a lookup in plpgsql's namestack from occurring, so that an identifier is reported as "not recognized" even if there is a matching variable. In the two places that currently select IDENTIFIER_LOOKUP_EXPR, that seems to be only an optimization, because they will act the same anyway for all the token types that plpgsql_parse_word could return. I don't remember if there originally was a bigger reason than that. But it seems reasonable to be able to signal the lookup machinery that we're in a SQL expression rather than someplace else; in principle that could have larger implications than it does right now. > The rule, as far as I can tell from reading the code, is that > IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and > triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only > double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up > nothing. But it's not clear to me exactly what the motivation for > that is. plpgsql_parse_word says: The doubleword and tripleword cases are slightly different: lookup can't be turned off completely, because we need to create matching RECFIELD datums for use later. It's still true that we don't especially care what token is returned, but the side-effect of creating a RECFIELD entry is important. Possibly we could shave a few more cycles by making the code know that explicitly, but it didn't seem worth the complexity at the time. Feel free to improve the comments if this bugs you ... regards, tom lane
On Thu, May 4, 2017 at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> plpgsql has an enum called IdentifierLookup which includes a value >> IDENTIFIER_LOOKUP_EXPR which is declared like this: >> IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ >> It regrettably does not explain what exactly is special about it, and >> AFAICT, neither does any other comment. If I replace every usage of >> IDENTIFIER_LOOKUP_EXPR with IDENTIFIER_LOOKUP_NORMAL, the regression >> tests pass nonetheless. > > AFAICS, the only place where IDENTIFIER_LOOKUP_EXPR acts differently from > IDENTIFIER_LOOKUP_NORMAL is plpgsql_parse_word(), and what it does is to > prevent a lookup in plpgsql's namestack from occurring, so that an > identifier is reported as "not recognized" even if there is a matching > variable. In the two places that currently select IDENTIFIER_LOOKUP_EXPR, > that seems to be only an optimization, because they will act the same > anyway for all the token types that plpgsql_parse_word could return. > I don't remember if there originally was a bigger reason than that. > But it seems reasonable to be able to signal the lookup machinery that > we're in a SQL expression rather than someplace else; in principle that > could have larger implications than it does right now. Thanks for the explanation. >> The rule, as far as I can tell from reading the code, is that >> IDENTIFIER_LOOKUP_NORMAL looks up words, double-words (e.g. x.y), and >> triple-words (e.g x.y.z), IDENTIFIER_LOOKUP_EXPR looks up only >> double-words and triple-words, and IDENTIFIER_LOOKUP_DECLARE looks up >> nothing. But it's not clear to me exactly what the motivation for >> that is. plpgsql_parse_word says: > > The doubleword and tripleword cases are slightly different: lookup can't > be turned off completely, because we need to create matching RECFIELD > datums for use later. It's still true that we don't especially care what > token is returned, but the side-effect of creating a RECFIELD entry is > important. Possibly we could shave a few more cycles by making the code > know that explicitly, but it didn't seem worth the complexity at the time. The PLPGSQL_DTYPE_* constants are another thing that's not really documented. You've mentioned that we should get rid of PLPGSQL_DTYPE_ROW in favor of, uh, whatever's better than that, but it's not clear to me what that really means, why one way is better than the other way, or what is involved. I'm not really clear on what a PLpgSQL_datum_type is in general, or what any of the specific types actually mean. I'll guess that PLPGSQL_DTYPE_VAR is a variable, but beyond that... > Feel free to improve the comments if this bugs you ... I might try to do that at some point, but I don't think my understanding of all this is clear enough to attempt it at this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The PLPGSQL_DTYPE_* constants are another thing that's not really > documented. Yeah :-(. Complain to Jan sometime. > You've mentioned that we should get rid of > PLPGSQL_DTYPE_ROW in favor of, uh, whatever's better than that, but > it's not clear to me what that really means, why one way is better > than the other way, or what is involved. I'm not really clear on what > a PLpgSQL_datum_type is in general, or what any of the specific types > actually mean. I'll guess that PLPGSQL_DTYPE_VAR is a variable, but > beyond that... Well, from memory ... PLpgSQL_datum is really a symbol table entry. The conflict against what we mean by "Datum" elsewhere is pretty unfortunate. VAR - variable of scalar type (well, any non-composite type --- arrays and ranges are this kind of PLpgSQL_datum too, for instance). REC - variable of composite type, stored as a HeapTuple. RECFIELD - reference to a field of a REC variable. The datum includes the field name and a link to the datum for the parent REC variable. Notice this implies a runtime lookup of the field name whenever we're accessing the datum; which sucks for performance but it makes life a lot easier when you think about the possibility of the composite type changing underneath you. ARRAYELEM - reference to an element of an array. The datum includes a subscript expression and a link to the datum for the parent array variable (which can be a VAR, and I think a RECFIELD too). ROW - this is where it gets fun. A ROW is effectively a variable of a possibly-anonymous composite type, and it is defined by a list (in its own datum) of links to PLpgSQL_datums representing the individual columns. Typically the member datums would be VARs but that's not the only possibility. As I mentioned earlier, the case that ROW is actually well adapted for is multiple targets in INTO and similar constructs. For example, if you have SELECT ...blah blah... INTO a,b,c then the target of the PLpgSQL_stmt_execsql is represented as a single ROW datum whose members are the datums for a, b, and c. That's totally determined by the text of the function and can't change under us. However ... somebody thought it'd be cute to use the ROW infrastructure for variables of named composite types, too. So if you have DECLARE foo some_composite_type; then the name "foo" refers to a ROW datum, and the plpgsql compiler generates additional anonymous VAR datums, one for each declared column in some_composite_type, which become the members of the ROW datum. The runtime representation is actually that each field value is stored separately in its datum, as though it were an independent VAR. Field references "foo.col1" are not compiled into RECFIELD datums; we just look up the appropriate member datum during compile and make the expression tree point to that datum directly. So, this representation is great for speed of access and modification of individual fields of the composite variable. It sucks when you want to assign to the composite as a whole or retrieve its value as a whole, because you have to deconstruct or reconstruct a tuple to do that. (The REC/RECFIELD approach has approximately the opposite strengths and weaknesses.) Also, dealing with changes in the named composite type is a complete fail, because we've built its structure into the function's symbol table at parse time. I forget why there's a dtype for EXPR. regards, tom lane
On Thu, May 4, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > PLpgSQL_datum is really a symbol table entry. The conflict against what > we mean by "Datum" elsewhere is pretty unfortunate. Yeah. That's particularly bad because datum is a somewhat vague word under the best of circumstances (like "thing"). Maybe I'm missing something, but isn't it more like a parse tree than a symbol table entry? The symbol table entries seem to be based on PLpgSQL_nsitem_type, not PLpgSQL_datum, and they only come in the flavors you'd expect to find in a symbol table: label, var, row, rec. The datums on the other hand seem to consist of every kind of thing that PLpgSQL might be expected to interpret itself, either as an rvalue or as an lvalue. > ROW - this is where it gets fun. A ROW is effectively a variable > of a possibly-anonymous composite type, and it is defined by a list > (in its own datum) of links to PLpgSQL_datums representing the > individual columns. Typically the member datums would be VARs > but that's not the only possibility. > > As I mentioned earlier, the case that ROW is actually well adapted > for is multiple targets in INTO and similar constructs. For example, > if you have > > SELECT ...blah blah... INTO a,b,c > > then the target of the PLpgSQL_stmt_execsql is represented as a single > ROW datum whose members are the datums for a, b, and c. That's totally > determined by the text of the function and can't change under us. > > However ... somebody thought it'd be cute to use the ROW infrastructure > for variables of named composite types, too. So if you have > > DECLARE foo some_composite_type; > > then the name "foo" refers to a ROW datum, and the plpgsql compiler > generates additional anonymous VAR datums, one for each declared column > in some_composite_type, which become the members of the ROW datum. > The runtime representation is actually that each field value is stored > separately in its datum, as though it were an independent VAR. Field > references "foo.col1" are not compiled into RECFIELD datums; we just look > up the appropriate member datum during compile and make the expression > tree point to that datum directly. Ugh. > So, this representation is great for speed of access and modification > of individual fields of the composite variable. It sucks when you > want to assign to the composite as a whole or retrieve its value as > a whole, because you have to deconstruct or reconstruct a tuple to > do that. (The REC/RECFIELD approach has approximately the opposite > strengths and weaknesses.) Also, dealing with changes in the named > composite type is a complete fail, because we've built its structure > into the function's symbol table at parse time. It would probably be possible to come up with a representation that allowed both things to be efficient, the way a slot can contain either a Datum/isnull array or a heap tuple or both. Call the resulting data structure a record slot. foo.col1 could retain the original column name and also a cache containing a pointer to the record slot and a column offset within the slot, so that it can say e.g. assign_record_slot_column(slot, col, val). But it could also register the reference with the slot, so that if the slot structure changes, the cached slot and column offset (or at least the column offset) are cleared and get recomputed on next access. I'm not volunteering to do the work, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 4, 2017 at 6:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> PLpgSQL_datum is really a symbol table entry. The conflict against what >> we mean by "Datum" elsewhere is pretty unfortunate. > Yeah. That's particularly bad because datum is a somewhat vague word > under the best of circumstances (like "thing"). Maybe I'm missing > something, but isn't it more like a parse tree than a symbol table > entry? The symbol table entries seem to be based on > PLpgSQL_nsitem_type, not PLpgSQL_datum, and they only come in the > flavors you'd expect to find in a symbol table: label, var, row, rec. Right, there's this separate name lookup table that maps from names to PLpgSQL_datums. I personally wouldn't call the PLpgSQL_datum items parse trees, since they're mostly pretty flat. But if you want to think of them that way I can't stop you. One other important, and unfortunate, thing is that the PLpgSQL_datums aren't (usually) read-only --- they not only hold static symbol-table-like info but also the actual current values, for those datums that correspond directly to a stored variable value (VAR and REC, but not ROW or RECFIELD or ARRAYELEM). This seems messy to me, and it forces a rather expensive copy step during plpgsql function startup. I'd like to see those duties separated, so that the actual r/w state is just arrays of Datum/isnull that we can trivially initialize during function start. >> So, this representation is great for speed of access and modification >> of individual fields of the composite variable. It sucks when you >> want to assign to the composite as a whole or retrieve its value as >> a whole, because you have to deconstruct or reconstruct a tuple to >> do that. (The REC/RECFIELD approach has approximately the opposite >> strengths and weaknesses.) Also, dealing with changes in the named >> composite type is a complete fail, because we've built its structure >> into the function's symbol table at parse time. > It would probably be possible to come up with a representation that > allowed both things to be efficient, Yeah, perhaps. It would be good to take two steps back and reconsider the whole data structure. > I'm not volunteering to do the work, though. It's not at the top of my priority list either, but I'd like to see it happen sometime. regards, tom lane