Обсуждение: Text <-> C string
Hi hackers, I've noticed that there is a lot of code, particularly in src/backend, that goes through the motions of making a text datum into a cstring to perform some work on it, and likewise for making a cstring into a text datum. Is there not a nice macro somewhere to handle this consistently? And if not, shouldn't there be? I noticed a comment for StrNCpy() in src/include/c.h that seems related: /* BTW: when you need to copy a non-null-terminated string (like a text datum) and add a null, do not do it with StrNCpy [snip] Do it honestly with "memcpy(dst,src,len); dst[len] = '\0'; instead." Okay, I can see why using StrNCpy is a bad idea, but why not "Do it honestly with TEXT_CSTRING(src, dst)", or similar? Surely having the exact same four lines of code written out in dozens of places is a Bad Thing, but perhaps there is some reasoning behind this that I am missing? Thanks for your time, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > Surely having the exact same four lines of code written out in dozens > of places is a Bad Thing, but perhaps there is some reasoning behind > this that I am missing? The canonical way to do it is with DatumGetCString(DirectFunctionCall1(textout, t)) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 9/22/07, Gregory Stark <stark@enterprisedb.com> wrote: > The canonical way to do it is with > > DatumGetCString(DirectFunctionCall1(textout, t)) Ah, I see. Thanks. In that case, would it be helpful if I submitted a patch for the various code fragments that do this locally, updating them to use DatumGetCString?
"Brendan Jurd" <direvus@gmail.com> writes: > On 9/22/07, Gregory Stark <stark@enterprisedb.com> wrote: >> The canonical way to do it is with >> >> DatumGetCString(DirectFunctionCall1(textout, t)) > > Ah, I see. Thanks. > > In that case, would it be helpful if I submitted a patch for the > various code fragments that do this locally, updating them to use > DatumGetCString? I would be interested in seeing just a list of such places if you have it handy. I don't think we consider it wrong to violate the text data type abstraction barrier like you describe though. I'm interested because any such code is possibly either failing to take into account toasted data or is unnecessarily detoasting packed varlenas. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 9/22/07, Gregory Stark <stark@enterprisedb.com> wrote: > The canonical way to do it is with > > DatumGetCString(DirectFunctionCall1(textout, t)) I just noticed a couple of macros defined in src/include/tsearch/ts_utils.h: #define TextPGetCString(t) DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(t))) #define CStringGetTextP(c) DatumGetTextP(DirectFunctionCall1(textin, CStringGetDatum(c))) Seems these would actually be convenient in quite a lot of places in the backend. Is there any downside to moving these two into src/include/postgres.h?
Well, a couple of specific cases that I came across are quote_identifier() in src/backend/utils/adt/quote.c, and do_to_timestamp() in src/backend/utils/adt/formatting.c (line 3349). I was getting a rough notion of how common the duplication was using $ egrep -Rn -C 2 'memcpy.*VARDATA' src/backend Not all of these are genuine duplications of textout and textin (you have to eyeball them individually to work that out) but it's a reasonable starting point. The files matched under src/backend are as follows. src/backend/libpq/be-fsstubs.c src/backend/utils/mb/mbutils.c src/backend/utils/adt/timestamp.c src/backend/utils/adt/nabstime.c src/backend/utils/adt/xml.c src/backend/utils/adt/quote.c src/backend/utils/adt/oracle_compat.c src/backend/utils/adt/varchar.c src/backend/utils/adt/ruleutils.c src/backend/utils/adt/varlena.c src/backend/utils/adt/tsginidx.c src/backend/utils/adt/cash.c src/backend/utils/adt/date.c src/backend/utils/adt/genfile.c src/backend/utils/adt/network.c src/backend/utils/adt/selfuncs.c src/backend/utils/adt/formatting.c src/backend/utils/adt/version.c src/backend/utils/adt/pgstatfuncs.c src/backend/access/heap/tuptoaster.c src/backend/access/common/heaptuple.c src/backend/storage/large_object/inv_api.c src/backend/executor/execQual.c src/backend/catalog/pg_conversion.c On 9/22/07, Gregory Stark <stark@enterprisedb.com> wrote: > > "Brendan Jurd" <direvus@gmail.com> writes: > > > On 9/22/07, Gregory Stark <stark@enterprisedb.com> wrote: > >> The canonical way to do it is with > >> > >> DatumGetCString(DirectFunctionCall1(textout, t)) > > > > Ah, I see. Thanks. > > > > In that case, would it be helpful if I submitted a patch for the > > various code fragments that do this locally, updating them to use > > DatumGetCString? > > I would be interested in seeing just a list of such places if you have it > handy. I don't think we consider it wrong to violate the text data type > abstraction barrier like you describe though. > > I'm interested because any such code is possibly either failing to take into > account toasted data or is unnecessarily detoasting packed varlenas. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
"Brendan Jurd" <direvus@gmail.com> writes: > I just noticed a couple of macros defined in src/include/tsearch/ts_utils.h: > #define TextPGetCString(t) > DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(t))) > #define CStringGetTextP(c) DatumGetTextP(DirectFunctionCall1(textin, > CStringGetDatum(c))) > Seems these would actually be convenient in quite a lot of places in > the backend. Is there any downside to moving these two into > src/include/postgres.h? I think if you look around you'll find several similar things in various contrib modules. It would make some sense to try to unify all this. I'm not particularly for making it macros in postgres.h though --- that's no help if the macros require referencing stuff in builtins.h. On grounds of code-space savings I think it might be worth making these things be simple functions declared in builtins.h; that would also make it much easier to change their implementations. regards, tom lane
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Brendan Jurd" <direvus@gmail.com> writes: > > I just noticed a couple of macros defined in src/include/tsearch/ts_utils.h: > > > #define TextPGetCString(t) > > DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(t))) > > #define CStringGetTextP(c) DatumGetTextP(DirectFunctionCall1(textin, > > CStringGetDatum(c))) > > I think if you look around you'll find several similar things in various > contrib modules. It would make some sense to try to unify all this. > I'm not particularly for making it macros in postgres.h though --- > that's no help if the macros require referencing stuff in builtins.h. > > On grounds of code-space savings I think it might be worth making > these things be simple functions declared in builtins.h; that would > also make it much easier to change their implementations. You're right about finding similar things in various places. Even varlena.c has a set of these macros (PG_TEXT_GET_STR etc), but it doesn't look they've really been utilised. I'm happy to take a swing at this. Declaring in builtins.h makes sense. The thing that's got me confused at the moment is what naming convention to use for the functions. Looking in builtins.h you might get the impression that we use lower_underscore for functions that are called via fmgr, UPPER_UNDERSCORE for macros and CamelCase for ordinary internal C functions, but there are plenty of exceptions to disprove that rule. I see camel cased macros and lowercased internal functions. Camel cased identifiers sometimes start with uppercase, sometimes lowercase. So the name for the text -> cstring function could be any of: text_cstr text_to_cstr textToCString TextToCString Is there any kind of authoritative naming convention I can refer to? Thanks for your time, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > The thing that's got me confused at the moment is what naming > convention to use for the functions. Well, almost any convention you like has some precedent somewhere in the PG code, given all the contributors over the years. Almost the only thing we actively discourage is Hungarian notation, and I think there's even some of that in some corners. Personally I would vote against something like TextPGetCString because it would look like one of the family of macros that are named FooGetBar. Maybe use text_to_cstring and cstring_to_text? It's not real important though. regards, tom lane
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > On grounds of code-space savings I think it might be worth making > these things be simple functions declared in builtins.h; that would > also make it much easier to change their implementations. I've noticed that this pattern isn't exclusive to the text type; other varlena types like bytea and xmltype seem to have a common requirement to translate to and fro C strings for various jobs. Does it make sense to go one level lower, and make these functions work for any varlena? So far, I've got the following functions doing the work: char * text_cstring(text *t) char * text_cstring_limit(text *t, int len) text * cstring_text(char *s) It wouldn't be difficult at this point to make those functions 'varlena' rather than 'text', and then bytea and xmltype (and any other future types that want to inherit from varlena) can take advantage of them. Thanks for your time, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > So far, I've got the following functions doing the work: > char * text_cstring(text *t) > char * text_cstring_limit(text *t, int len) > text * cstring_text(char *s) > It wouldn't be difficult at this point to make those functions > 'varlena' rather than 'text', and then bytea and xmltype (and any > other future types that want to inherit from varlena) can take > advantage of them. Mmm, but the conversions are generally not identical --- for instance, bytea needs to do escaping/de-escaping, and I doubt that XML will stick to dumb flat-string representation for long, and for that matter text itself is likely to change someday for better locale support. Where the representations and conversions *are* identical, one can just cast. I'd vote for keeping the names focused on text ... regards, tom lane
As discussed on -hackers, I'm trying to get rid of some redundant code by creating a widely useful set of functions to convert between text and C string in the backend. The new extern functions, declared in include/utils/builtins.h and defined in backend/utils/adt/varlena.c, are: char * text_cstring(const text *t) char * text_cstring_limit(const text *t, int len) text * cstring_text(const char *s) text * cstring_text_limit(const char *s, int len) Within varlena.c, the actual conversions are performed by: char * do_text_cstring(const text *t, const int len) text * do_cstring_text(const char *s, const int len) These functions now do the work for the fmgr functions textin and textout, as well as being directly accessible by backend code. I've searched through the backend for any code which converted between text and C string manually (with memcpy and VARDATA), replacing with calls to one of the four new functions as appropriate. I came across some areas which were using the same, or similar, conversion technique on other varlena data types, such as bytea or xmltype. In cases where the conversion was completely identical I used the new functions. In cases with any differences (even if they seemed minor) I played it safe and left them alone. I'd now like to submit my work so far for review. This patch compiled cleanly on Linux and passed all parallel regression tests. It appears to be performance-neutral based on a few rough tests; I haven't tried to profile the changes in detail. There is still a lot of code out there using DirectFunctionCall1 to call text(in|out)). I've decided to wait for some community feedback on the patch as it stands before replacing those calls. There are a great many, and it would be a shame to have to go through them more than once. I would naively expect that replacing fmgr calls with direct calls would lead to a performance gain (no fmgr overhead), but honestly I'm not sure whether that would actually make a difference. Thanks for your time, BJ
Вложения
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Brendan Jurd wrote: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) > > Within varlena.c, the actual conversions are performed by: > > char * do_text_cstring(const text *t, const int len) > text * do_cstring_text(const char *s, const int len) > > These functions now do the work for the fmgr functions textin and > textout, as well as being directly accessible by backend code. > > I've searched through the backend for any code which converted between > text and C string manually (with memcpy and VARDATA), replacing with > calls to one of the four new functions as appropriate. > > I came across some areas which were using the same, or similar, > conversion technique on other varlena data types, such as bytea or > xmltype. In cases where the conversion was completely identical I > used the new functions. In cases with any differences (even if they > seemed minor) I played it safe and left them alone. > > I'd now like to submit my work so far for review. This patch compiled > cleanly on Linux and passed all parallel regression tests. It appears > to be performance-neutral based on a few rough tests; I haven't tried > to profile the changes in detail. > > There is still a lot of code out there using DirectFunctionCall1 to > call text(in|out)). I've decided to wait for some community feedback > on the patch as it stands before replacing those calls. There are a > great many, and it would be a shame to have to go through them more > than once. > > I would naively expect that replacing fmgr calls with direct calls > would lead to a performance gain (no fmgr overhead), but honestly I'm > not sure whether that would actually make a difference. > > Thanks for your time, > BJ [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Brendan Jurd wrote: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) > > Within varlena.c, the actual conversions are performed by: > > char * do_text_cstring(const text *t, const int len) > text * do_cstring_text(const char *s, const int len) > > These functions now do the work for the fmgr functions textin and > textout, as well as being directly accessible by backend code. > > I've searched through the backend for any code which converted between > text and C string manually (with memcpy and VARDATA), replacing with > calls to one of the four new functions as appropriate. > > I came across some areas which were using the same, or similar, > conversion technique on other varlena data types, such as bytea or > xmltype. In cases where the conversion was completely identical I > used the new functions. In cases with any differences (even if they > seemed minor) I played it safe and left them alone. > > I'd now like to submit my work so far for review. This patch compiled > cleanly on Linux and passed all parallel regression tests. It appears > to be performance-neutral based on a few rough tests; I haven't tried > to profile the changes in detail. > > There is still a lot of code out there using DirectFunctionCall1 to > call text(in|out)). I've decided to wait for some community feedback > on the patch as it stands before replacing those calls. There are a > great many, and it would be a shame to have to go through them more > than once. > > I would naively expect that replacing fmgr calls with direct calls > would lead to a performance gain (no fmgr overhead), but honestly I'm > not sure whether that would actually make a difference. > > Thanks for your time, > BJ [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Brendan Jurd" <direvus@gmail.com> writes: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) I started to look at applying this patch and immediately decided that I didn't like these names --- it's exceeding un-obvious which direction of conversion any one of the functions performs. Seems like every time you wanted to call one, you'd be going back to look at the source code to remember which to use. What do people think of text_to_cstring? Or should we go with TextPGetCString for consistency with the Datum-whacking macros? (I seem to recall having argued against the latter idea, but am reconsidering.) Or something else? BTW, I suspect that the _limit functions are mostly useless --- a quick look through the patch did not suggest that any of the places using them really needed a limit. The point of, for instance, TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and if you're going to pay for a palloc anyway then you might as well forget it. (There might be some value in a strlcpy equivalent that copies from a text datum into a limited-size caller-supplied buffer, but that's not what we've got here.) regards, tom lane
On Wed, Mar 19, 2008 at 12:51:35PM -0400, Tom Lane wrote: > "Brendan Jurd" <direvus@gmail.com> writes: > > char * text_cstring(const text *t) > > What do people think of text_to_cstring? I tend to put things the other way around in my code, i.e: char * cstring_of_text(const text *t) mainly because things read more easily---type definitions of new variables are next to the first part of the word. char * str = cstring_of_text(src_text); vs. char * str = text_to_cstring(src_text); I think I got my original inspiration for doing it this way around from the Caml language. Sam
On Wed, 19 Mar 2008, Sam Mason <sam@samason.me.uk> writes: > ... > char * str = cstring_of_text(src_text); > ... > > I think I got my original inspiration for doing it this way around from > the Caml language. Also, used in Common Lisp as class accessors: char *s = cstring_of(text); text *t = text_of(cstring); But I'd vote for TextPGetCString style Tom suggested for the eye-habit compatibility with the rest of the code. Regards.
Volkan YAZICI <yazicivo@ttmail.com> writes: > But I'd vote for TextPGetCString style Tom suggested for the eye-habit > compatibility with the rest of the code. If there are not additional votes, I'll go with TextPGetCString and CStringGetTextP. regards, tom lane
On 20/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I started to look at applying this patch and immediately decided that > I didn't like these names --- it's exceeding un-obvious which direction > of conversion any one of the functions performs. Seems like every time > you wanted to call one, you'd be going back to look at the source code > to remember which to use. > That's a fair criticism. I wanted to make the function names as compact as possible, but your comment about the directionality of the functions rings true. > What do people think of text_to_cstring? Or should we go with > TextPGetCString for consistency with the Datum-whacking macros? (I seem > to recall having argued against the latter idea, but am reconsidering.) > Or something else? > Your original argument against FooGetBar was that it would be *too* consistent with the Datum macros, leading people to think that these functions actually were macros. As long as we don't want people getting confused about the function/macro distinction, that argument still makes sense to me, and I'd be more inclined towards the foo_to_bar() convention. > BTW, I suspect that the _limit functions are mostly useless --- > a quick look through the patch did not suggest that any of the places > using them really needed a limit. The point of, for instance, > TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and > if you're going to pay for a palloc anyway then you might as well > forget it. What about callers like dotrim() in oracle_compat.c, which only want to copy characters from the source string up to a particular length? Doesn't that indicate a legitimate requirement for a cstring_to_text_limit() (the call site was palloc'ing the text value anyway)? On the other hand, we do have those call sites (TZ_STRLEN_MAX is a good example) where the caller just wanted to use a local buffer. In which case your strlcpy-equivalent function would probably be the right thing to offer. > (There might be some value in a strlcpy equivalent that > copies from a text datum into a limited-size caller-supplied buffer, > but that's not what we've got here.) > Perhaps we keep cstring_to_text_limit(), but make text_to_cstring_limit() behave like strlcpy() instead? One of the questions in the original patch submission was whether it would be worth changing all those DirectFunctionCall(textin) and (textout) calls to use the new functions. Is it worthwhile avoiding the fmgr overhead? Thanks for taking the time to review my patch and provide this feedback. I'm happy to send in an updated version once we settle on the naming convention for the functions. Last time I looked, the codebase had shifted quite a bit since I originally wrote the patch. So it probably needs some work to apply cleanly on the latest sources anyway. Regards, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > One of the questions in the original patch submission was whether it > would be worth changing all those DirectFunctionCall(textin) and > (textout) calls to use the new functions. Is it worthwhile avoiding > the fmgr overhead? I think that's worth doing just on notational clarity grounds. The small cycle savings doesn't excite me, but understanding DirectFunctionCall1(textin, CStringGetDatum(foo)) just involves more different bits of trivia than cstring_to_text(foo). > Last time I looked, the codebase had shifted quite a bit since I > originally wrote the patch. So it probably needs some work to apply > cleanly on the latest sources anyway. Yeah, with wide-impact patches like this you are always going to have that problem. One point though is that we don't have to improve every call site at the same time. I'd be inclined to put in the new functions and hit some representative sample of utils/adt/ files in the first commit, and then incrementally fix other stuff. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Volkan YAZICI <yazicivo@ttmail.com> writes: >> But I'd vote for TextPGetCString style Tom suggested for the eye-habit >> compatibility with the rest of the code. > > If there are not additional votes, I'll go with TextPGetCString > and CStringGetTextP. I would have voted for text_to_cstring etc. I can see the logic for the above but it's just such a pain to type... Fwiw I didn't actually find text_cstring confusing because all our sql cast functions are defined that way. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > Volkan YAZICI <yazicivo@ttmail.com> writes: > >> But I'd vote for TextPGetCString style Tom suggested for the eye-habit > >> compatibility with the rest of the code. > > > > If there are not additional votes, I'll go with TextPGetCString > > and CStringGetTextP. > > I would have voted for text_to_cstring etc. I can see the logic for the above > but it's just such a pain to type... +1 for text_to_cstring et al. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Gregory Stark wrote: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> If there are not additional votes, I'll go with TextPGetCString >>> and CStringGetTextP. >> >> I would have voted for text_to_cstring etc. I can see the logic for the above >> but it's just such a pain to type... > +1 for text_to_cstring et al. Well, that's all right with me too. It occurs to me that this proposal is still leaving something on the table. Consider a SQL-callable function that takes a text argument and wants to turn it into a C string for processing. With the proposal as it stands you'd have to do something like text *t = PG_GETARG_TEXT_P(0);char *c = text_to_cstring(t); Now you might be smart enough to optimize that to text *t = PG_GETARG_TEXT_PP(0);char *c = text_to_cstring(t); which would avoid a useless copy for short-header text datums, but it's still leaking an extra copy of the text if the input is compressed or toasted out-of-line. I'm imagining instead char *c = PG_GETARG_TEXT_AS_CSTRING(0); This would expand to a call on say text_datum_to_cstring(Datum d) which would detoast, convert, and then free the detoasted copy if needed. On the other hand, it could be argued that this is usually a waste of effort. (I frequently point out to people that retail pfree's in SQL-callable functions are often less efficient than letting the next context reset clean up.) And one thing I don't like about this notation is that the declared type of the local variable no longer matches up with the SQL-level declaration of the function. Comments anyone? regards, tom lane
I've been working some more on Brendan Jurd's patch to simplify text <-> C string conversions. It seems we have consensus on the names for the base operations: extern text *cstring_to_text(const char *s); extern char *text_to_cstring(const text *t); Brendan's patch also included "cstring_text_limit(const char *s, int len)" which was defined as copying Min(len, strlen(s)) bytes. I didn't find this to be particularly useful. In the first place, all potential callers are passing the exact desired length, so the strlen() call is just a waste of cycles. In the second place, at least some callers pass text that is not embedded in a known-to-be-null-terminated string (it could be a section of a text datum instead); which means there is a nonzero chance of the strlen running off the end of memory and dumping core. So I propose instead extern text *cstring_to_text_with_len(const char *s, int len); which just takes the given length as gospel. Brendan had also proposed "text_to_cstring_limit(const text *t, int len)" with similar Min() semantics, but what this was doing was replacing copies into limited-size local buffers with a palloc. If we did that we might as well just use text_to_cstring. What I think is more useful is a strlcpy()-like function that copies into a caller-supplied buffer of limited size. For lack of a better idea I propose defining it *exactly* like strlcpy: extern size_t textlcpy(char *dst, const text *src, size_t siz); I've also found that there are lots and lots of places where the text end of the conversion needs to be a Datum not a text *, so it seems worthwhile to introduce a couple of macros to minimize notation in that case: #define CStringGetTextDatum(s) PointerGetDatum(cstring_to_text(s)) #define TextDatumGetCString(d) text_to_cstring((text *) DatumGetPointer(d)) Lastly, the originally submitted text-to-something functions would work correctly on plain and 1-byte-header datums, but not on compressed or toasted-out-of-line datums. There are a whole lot of places where that's not good enough. Rather than expecting the caller to use the right detoasting macro everywhere, it seems best to make these functions cope with any variant. That also avoids memory leakage by allowing the intermediate copy to be pfree'd. (I had suggested that the pfree might be pointless, but I reconsidered --- if the text object is large enough to be compressed or toasted, we're talking about at least several K, so it's worth not leaking.) In short, the infrastructure I'm currently testing is the above definitions with the attached implementation. Last call for objections ... regards, tom lane /** cstring_to_text** Create a text value from a null-terminated C string.** The new text value is freshly palloc'd witha full-size VARHDR.*/ text * cstring_to_text(const char *s) {return cstring_to_text_with_len(s, strlen(s)); } /** cstring_to_text_with_len** Same as cstring_to_text except the caller specifies the string length;* the string need notbe null_terminated.*/ text * cstring_to_text_with_len(const char *s, int len) {text *result = (text *) palloc(len + VARHDRSZ); SET_VARSIZE(result, len + VARHDRSZ);memcpy(VARDATA(result), s, len); return result; } /** text_to_cstring** Create a palloc'd, null-terminated C string from a text value.** We support being passed a compressedor toasted text value.* This is a bit bogus since such values shouldn't really be referred to as* "text *", butit seems useful for robustness. If we didn't handle that* case here, we'd need another routine that did, anyway.*/ char * text_to_cstring(const text *t) {char *result;text *tunpacked = pg_detoast_datum_packed((struct varlena *) t);int len = VARSIZE_ANY_EXHDR(tunpacked); result = (char *) palloc(len + 1);memcpy(result, VARDATA_ANY(tunpacked), len);result[len] = '\0'; if (tunpacked != t) pfree(tunpacked);return result; } /** textlcpy --- exactly like strlcpy(), except source is a text value.** Copy src to string dst of size siz. At mostsiz-1 characters* will be copied. Always NUL terminates (unless siz == 0).* Returns strlen(src); if retval >= siz,truncation occurred.** We support being passed a compressed or toasted text value.* This is a bit bogus since such valuesshouldn't really be referred to as* "text *", but it seems useful for robustness. If we didn't handle that* case here,we'd need another routine that did, anyway.*/ size_t textlcpy(char *dst, const text *src, size_t siz) {text *srcunpacked = pg_detoast_datum_packed((struct varlena *) src);size_t srclen = VARSIZE_ANY_EXHDR(srcunpacked); if (siz > 0){ siz--; if (siz >= srclen) siz = srclen; else /* ensure truncation is encoding-safe*/ siz = pg_mbcliplen(VARDATA_ANY(srcunpacked), srclen, siz); memcpy(dst, VARDATA_ANY(srcunpacked),siz); dst[siz] = '\0';} if (srcunpacked != src) pfree(srcunpacked); return srclen; }
On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brendan's patch also included "cstring_text_limit(const char *s, int len)" > which was defined as copying Min(len, strlen(s)) bytes. I didn't find > this to be particularly useful. In the first place, all potential > callers are passing the exact desired length, so the strlen() call is > just a waste of cycles. In the second place, at least some callers pass > text that is not embedded in a known-to-be-null-terminated string (it > could be a section of a text datum instead); which means there is a > nonzero chance of the strlen running off the end of memory and dumping > core. So I propose instead > > extern text *cstring_to_text_with_len(const char *s, int len); > That all makes sense to me. I think the new name is good. It's pretty long, but I'm not seeing a shorter name that accurately describes the function. > which just takes the given length as gospel. Brendan had also proposed > "text_to_cstring_limit(const text *t, int len)" with similar Min() > semantics, but what this was doing was replacing copies into > limited-size local buffers with a palloc. If we did that we might > as well just use text_to_cstring. What I think is more useful is > a strlcpy()-like function that copies into a caller-supplied buffer > of limited size. For lack of a better idea I propose defining it > *exactly* like strlcpy: > > extern size_t textlcpy(char *dst, const text *src, size_t siz); > I'm all for providing a function with this behaviour, but is textlcpy() a bit ambiguous? It's not clear from the name whether the function copies text -> text, text -> cstring or cstring -> text. In fact, if I didn't already know better I'd probably assume that the function copied text -> text with length, in the same way strlcpy copies string -> string. A text_to_cstring_with_len() or text_to_cstring_limit() might be more to the point, and more consistent with the other functions in the family. On the other hand, maybe some difference in naming would help make it obvious to callers that, unlike its siblings, textlcpy() takes the destination string as an argument rather than returning it. text_to_cstring_lcpy()? > I've also found that there are lots and lots of places where the > text end of the conversion needs to be a Datum not a text *, > so it seems worthwhile to introduce a couple of macros to minimize > notation in that case: > > #define CStringGetTextDatum(s) PointerGetDatum(cstring_to_text(s)) > #define TextDatumGetCString(d) text_to_cstring((text *) DatumGetPointer(d)) > Yes, I recall coming across a number of sites where these macros would come in handy. > Lastly, the originally submitted text-to-something functions would > work correctly on plain and 1-byte-header datums, but not on > compressed or toasted-out-of-line datums. There are a whole lot of > places where that's not good enough. Rather than expecting the caller > to use the right detoasting macro everywhere, it seems best to make > these functions cope with any variant. That also avoids memory > leakage by allowing the intermediate copy to be pfree'd. (I had > suggested that the pfree might be pointless, but I reconsidered --- > if the text object is large enough to be compressed or toasted, > we're talking about at least several K, so it's worth not leaking.) > Excellent. My patch didn't contemplate dealing with compressed/toasted datums because, quite frankly, I didn't know *how* to deal with them correctly. Much to learn about varlenas, I still have. Cheers, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... What I think is more useful is >> a strlcpy()-like function that copies into a caller-supplied buffer >> of limited size. For lack of a better idea I propose defining it >> *exactly* like strlcpy: >> >> extern size_t textlcpy(char *dst, const text *src, size_t siz); > I'm all for providing a function with this behaviour, but is > textlcpy() a bit ambiguous? Fair enough, I'm not wedded to that name. Search-and-replace is still easy enough at this point ... > A text_to_cstring_with_len() or text_to_cstring_limit() might be more > to the point, and more consistent with the other functions in the > family. Hmm. The thing that's bothering me is that the length is the size of the *destination*, which is not like cstring_to_text_with_len, so using a closely similar name might be confusing. Of those two I'd go with text_to_cstring_limit. Another thought that comes to mind is void text_to_cstring_buffer(const text *src, char *dst, size_t dst_len) Anyone have other ideas? regards, tom lane
Tom Lane wrote: > > A text_to_cstring_with_len() or text_to_cstring_limit() might be more > > to the point, and more consistent with the other functions in the > > family. > > Hmm. The thing that's bothering me is that the length is the size > of the *destination*, which is not like cstring_to_text_with_len, > so using a closely similar name might be confusing. Of those two > I'd go with text_to_cstring_limit. Another thought that comes to > mind is > > void text_to_cstring_buffer(const text *src, char *dst, size_t dst_len) I think I like buffer. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane escribió: > "Brendan Jurd" <direvus@gmail.com> writes: > > A text_to_cstring_with_len() or text_to_cstring_limit() might be more > > to the point, and more consistent with the other functions in the > > family. > > Hmm. The thing that's bothering me is that the length is the size > of the *destination*, which is not like cstring_to_text_with_len, > so using a closely similar name might be confusing. Of those two > I'd go with text_to_cstring_limit. Another thought that comes to > mind is > > void text_to_cstring_buffer(const text *src, char *dst, size_t dst_len) text_to_cstring_buffer seems okay. I did wonder for a bit whether it should be void text_to_cstring_buffer(const text *src, char *buf, size_t buf_len) but then the src/dst pair seems better than src/buf. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
> > extern text *cstring_to_text_with_len(const char *s, int len); buffer_to_text ??? Regards Pavel Stehule
"Brendan Jurd" <direvus@gmail.com> writes: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. Applied with revisions --- the functions were modified as per recent discussion, and I fixed a lot more potential call sites. There are no textout/textin calls left, but I may have missed some places that were doing it the hard way with direct palloc/memcpy manipulations. It might be worth trolling all the VARDATA() references to see if any more are easily replaceable. I notice in particular that xfunc.sgml contains sample C functions to copy and concatenate text. While these aren't directly replaceable with the new functions, I wonder whether we ought to change the examples to make them less certain to break if we ever change text's representation. regards, tom lane
On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There are no textout/textin calls left, but I may have missed some > places that were doing it the hard way with direct palloc/memcpy > manipulations. It might be worth trolling all the VARDATA() references > to see if any more are easily replaceable. > I think there are perhaps a dozen such sites, and I hope to submit a cleanup patch that gets rid of these soon. I did come across one site I'm not sure about in utils/adt/xml.c:291 /* * We need a null-terminated string to pass to parse_xml_decl(). Rather * than make a separate copy, make the temporary result one byte bigger * than it needs to be. */ result = palloc(nbytes + 1 + VARHDRSZ); SET_VARSIZE(result, nbytes + VARHDRSZ); memcpy(VARDATA(result), str, nbytes); str = VARDATA(result); str[nbytes] = '\0'; The author seemed pretty sure he didn't want to make a separate copy of the string, so I'm thinking there's not a whole lot I can do to improve this site. > I notice in particular that xfunc.sgml contains sample C functions to > copy and concatenate text. While these aren't directly replaceable > with the new functions, I wonder whether we ought to change the examples > to make them less certain to break if we ever change text's > representation. > Yes, these sample C functions are shown in tutorial/funcs.c and funcs_new.c as well. I agree that the examples could do with changing, but don't have any keen insight on what to change them to.
"Brendan Jurd" <direvus@gmail.com> writes: > On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There are no textout/textin calls left, but I may have missed some >> places that were doing it the hard way with direct palloc/memcpy >> manipulations. It might be worth trolling all the VARDATA() references >> to see if any more are easily replaceable. > I think there are perhaps a dozen such sites, and I hope to submit a > cleanup patch that gets rid of these soon. > I did come across one site I'm not sure about in utils/adt/xml.c:291 I intentionally didn't touch xml.c, nor anyplace that is not dealing in text, even if it happens to be binary-compatible with text. regards, tom lane
On 29/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I intentionally didn't touch xml.c, nor anyplace that is not dealing > in text, even if it happens to be binary-compatible with text. > Hmm, okay. My original submission did include a few such changes; for example, in xml_in and xml_out_internal I saw that the conversion between xmltype and cstring was identical to the text conversion, so I went ahead and used the text functions. Feedback upthread suggested that it was okay to go ahead with casting in identical cases. [1] I saw that these changes made it into the commit, so I assumed that it was the right call. If we don't want to meddle with xmltype/bytea/VarChar at all, we'll have to revert those changes, and I'll have to seriously scale back the cleanup patch I was about to post. Regards, BJ [1] http://archives.postgresql.org/pgsql-hackers/2007-09/msg01094.php
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, Mar 29, 2008 at 5:40 AM, Brendan Jurd wrote: > On 29/03/2008, Tom Lane wrote: > > I intentionally didn't touch xml.c, nor anyplace that is not dealing > > in text, even if it happens to be binary-compatible with text. > > > > Hmm, okay. My original submission did include a few such changes; for > example, in xml_in and xml_out_internal I saw that the conversion > between xmltype and cstring was identical to the text conversion, so I > went ahead and used the text functions. Feedback upthread suggested > that it was okay to go ahead with casting in identical cases. [1] > > I saw that these changes made it into the commit, so I assumed that it > was the right call. > > If we don't want to meddle with xmltype/bytea/VarChar at all, we'll > have to revert those changes, and I'll have to seriously scale back > the cleanup patch I was about to post. Still not sure where we stand on the above. To cast, or not to cast? Cheers, BJ - -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBaFy5YBsbHkuyV0RAsMmAKDHaEb7aMudKJgVxcf5RRcOtAJ+bwCgivLI 5B3xJze46NGWjEyOpq/TSaY= =BObd - -----END PGP SIGNATURE----- -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBaIl5YBsbHkuyV0RArDDAJ0QThLXAhzy+NX+2YsF+q4z/sIy1QCeJPiW s/rVns3FFQVP5g9DTOshDfQ= =4Tdh -----END PGP SIGNATURE-----
"Brendan Jurd" <direvus@gmail.com> writes: >> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll >> have to revert those changes, and I'll have to seriously scale back >> the cleanup patch I was about to post. > Still not sure where we stand on the above. To cast, or not to cast? I dunno. I know there was previously some handwaving about representing XML values in some more intelligent fashion than a plain text string, but I have no idea if anyone is likely to do anything about it in the foreseeable future. regards, tom lane
Tom Lane wrote: > "Brendan Jurd" <direvus@gmail.com> writes: > >>> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll >>> have to revert those changes, and I'll have to seriously scale back >>> the cleanup patch I was about to post. >>> > > >> Still not sure where we stand on the above. To cast, or not to cast? >> > > I dunno. I know there was previously some handwaving about representing > XML values in some more intelligent fashion than a plain text string, > but I have no idea if anyone is likely to do anything about it in the > foreseeable future. > > > It seems very unlikely to me. cheers andrew
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Apr 17, 2008 at 1:16 AM, Tom Lane wrote: > "Brendan Jurd" writes: > > >> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll > >> have to revert those changes, and I'll have to seriously scale back > >> the cleanup patch I was about to post. > > > Still not sure where we stand on the above. To cast, or not to cast? > > I dunno. I know there was previously some handwaving about representing > XML values in some more intelligent fashion than a plain text string, > but I have no idea if anyone is likely to do anything about it in the > foreseeable future. > Well ... if somebody does want to change the representation of xml down the road, he's going to have to touch all the sites where the code converts to and from cstring anyway, right? So the only real difference this will make is that, instead of having to replace four lines of VARDATA/memcpy per site, he'll have to replace a single function call. That doesn't seem like a negative. With that in mind, please find attached my followup patch. It cleans up another 21 sites manually copying between cstring and varlena, for a net reduction of 115 lines of code. I didn't attempt to work through every reference to VARDATA, but I did look at every hit from a `grep -Rn VARDATA . | grep memcpy`. All regression tests passed (contrib tests included) on gentoo. Patch added to commitfest queue. Cheers, BJ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBm105YBsbHkuyV0RAmqfAKCfyNyFdciqX4QV81sG9MhPt+KXuACfe694 3d/ICZF6yqV6K20X3TVX+So= =CvRM -----END PGP SIGNATURE-----
Вложения
"Brendan Jurd" <direvus@gmail.com> writes: > Well ... if somebody does want to change the representation of xml > down the road, he's going to have to touch all the sites where the > code converts to and from cstring anyway, right? True. > With that in mind, please find attached my followup patch. It cleans > up another 21 sites manually copying between cstring and varlena, for > a net reduction of 115 lines of code. I applied most of this, but not the parts that were dependent on the assumption that bytea and text are the same. That is unlikely to remain true if we ever get around to putting locale/encoding information into text values. Furthermore it's a horrid type pun, because bytea can contain embedded nulls whereas cstrings can't; you were making unwarranted assumptions about whether, say, cstring_to_text_with_len would allow embedded nulls to go by. And lastly, quite a few of those changes were just plain broken, eg several places in selfuncs.c where you allowed strlen() to be applied to a "bytea converted to cstring". regards, tom lane