Обсуждение: The "char" type versus non-ASCII characters

Поиск
Список
Период
Сортировка

The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
[ breaking off a different new thread ]

Chapman Flack <chap@anastigmatix.net> writes:
> Then there's "char". It's category S, but does not apply the server
> encoding. You could call it an 8-bit int type, but it's typically used
> as a character, making it well-defined for ASCII values and not so
> for others, just like SQL_ASCII encoding. You could as well say that
> the "char" type has a defined encoding of SQL_ASCII at all times,
> regardless of the database encoding.

This reminds me of something I've been intending to bring up, which
is that the "char" type is not very encoding-safe.  charout() for
example just regurgitates the single byte as-is.  I think we deemed
that okay the last time anyone thought about it, but that was when
single-byte encodings were the mainstream usage for non-ASCII data.
If you're using UTF8 or another multi-byte server encoding, it's
quite easy to get an invalidly-encoded string this way, which at
minimum is going to break dump/restore scenarios.

I can think of at least three ways we might address this:

* Forbid all non-ASCII values for type "char".  This results in
simple and portable semantics, but it might break usages that
work okay today.

* Allow such values only in single-byte server encodings.  This
is a bit messy, but it wouldn't break any cases that are not
problematic already.

* Continue to allow non-ASCII values, but change charin/charout,
char_text, etc so that the external representation is encoding-safe
(perhaps make it an octal or decimal number).

Either of the first two ways would have to contemplate what to do
with disallowed values that snuck into the DB via pg_upgrade.
That leads me to think that the third way might be the most
preferable, even though it's not terribly backward-compatible.

There's a nearby issue that we might do something about at the
same time, which is that chartoi4() and i4tochar() think that
the byte value of a "char" is signed, while all the other
operations treat it as unsigned.  I wouldn't be too surprised if
this behavior is the direct cause of the bug fixed in a6bd28beb.
The issue vanishes if we forbid non-ASCII values, but otherwise
I'd be inclined to change these functions to treat the byte
values as unsigned.

Thoughts?

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Andrew Dunstan
Дата:
On 12/3/21 14:12, Tom Lane wrote:
> [ breaking off a different new thread ]
>
> Chapman Flack <chap@anastigmatix.net> writes:
>> Then there's "char". It's category S, but does not apply the server
>> encoding. You could call it an 8-bit int type, but it's typically used
>> as a character, making it well-defined for ASCII values and not so
>> for others, just like SQL_ASCII encoding. You could as well say that
>> the "char" type has a defined encoding of SQL_ASCII at all times,
>> regardless of the database encoding.
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe.  charout() for
> example just regurgitates the single byte as-is.  I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.
>
> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char".  This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings.  This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).
>
> Either of the first two ways would have to contemplate what to do
> with disallowed values that snuck into the DB via pg_upgrade.
> That leads me to think that the third way might be the most
> preferable, even though it's not terribly backward-compatible.
>


I don't like #2. Is #3 going to change the external representation only
for non-ASCII values? If so, that seems OK.  Changing it for ASCII
values seems ugly. #1 is the simplest to implement and to understand,
and I suspect it would break very little in practice, but others might
disagree with that assessment.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 12/3/21 14:12, Tom Lane wrote:
>> I can think of at least three ways we might address this:
>> 
>> * Forbid all non-ASCII values for type "char".  This results in
>> simple and portable semantics, but it might break usages that
>> work okay today.
>> 
>> * Allow such values only in single-byte server encodings.  This
>> is a bit messy, but it wouldn't break any cases that are not
>> problematic already.
>> 
>> * Continue to allow non-ASCII values, but change charin/charout,
>> char_text, etc so that the external representation is encoding-safe
>> (perhaps make it an octal or decimal number).

> I don't like #2.

Yeah, it's definitely messy --- for example, maybe é works in
a latin1 database but is rejected when you try to restore into
a DB with utf8 encoding.

> Is #3 going to change the external representation only
> for non-ASCII values? If so, that seems OK.

Right, I envisioned that ASCII behaves the same but we'd use
a numeric representation for high-bit-set values.  These
cases could be told apart fairly easily by charin(), since
the numeric representation would always be three digits.

> #1 is the simplest to implement and to understand,
> and I suspect it would break very little in practice, but others might
> disagree with that assessment.

We'd still have to decide what to do with pg_upgrade'd
non-ASCII values, so there's messiness there too.
Having charout() throw an error seems not very nice.

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Andrew Dunstan
Дата:
On 12/3/21 14:42, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 12/3/21 14:12, Tom Lane wrote:
>>> I can think of at least three ways we might address this:
>>>
>>> * Forbid all non-ASCII values for type "char".  This results in
>>> simple and portable semantics, but it might break usages that
>>> work okay today.
>>>
>>> * Allow such values only in single-byte server encodings.  This
>>> is a bit messy, but it wouldn't break any cases that are not
>>> problematic already.
>>>
>>> * Continue to allow non-ASCII values, but change charin/charout,
>>> char_text, etc so that the external representation is encoding-safe
>>> (perhaps make it an octal or decimal number).
>> Is #3 going to change the external representation only
>> for non-ASCII values? If so, that seems OK.
> Right, I envisioned that ASCII behaves the same but we'd use
> a numeric representation for high-bit-set values.  These
> cases could be told apart fairly easily by charin(), since
> the numeric representation would always be three digits.


OK, this seems the most attractive. Can we also allow 2 hex digits?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 12/3/21 14:42, Tom Lane wrote:
>> Right, I envisioned that ASCII behaves the same but we'd use
>> a numeric representation for high-bit-set values.  These
>> cases could be told apart fairly easily by charin(), since
>> the numeric representation would always be three digits.

> OK, this seems the most attractive. Can we also allow 2 hex digits?

I think we should pick one base and stick to it.  I don't mind
hex if you have a preference for that.

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Chapman Flack
Дата:
On 12/03/21 14:12, Tom Lane wrote:
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe.  charout() for
> example just regurgitates the single byte as-is.

I wonder if maybe what to do about that lies downstream of some other
thought about encoding-related type properties.

ISTM we don't, at present, have a clear story for types that have an
encoding (or repertoire) property that isn't one of (inapplicable,
server_encoding).

And yet such things exist, and more such things could or should exist
(NCHAR, healthier versions of xml or json, ...). "char" is an existing
example, because its current behavior is exactly as if it declared
"I am one byte of SQL_ASCII regardless of server setting".

Which is no trouble at all when the server setting is also SQL_ASCII.
But what does it mean when the server setting and the inherent
repertoire property of a type can be different? The present answer
isn't pretty.

When can charout() be called? typoutput functions don't have any
'internal' parameters, so nothing stops user code from calling them;
I don't know how often that's done, and that's a complication.
The canonical place for it to be called is inside printtup(), when
the client driver has requested format 0 for that attribute.

Up to that point, we could have known it was a type with SQL_ASCII
wired in, but after charout() we have a cstring, and printtup treats
that type as having the server encoding, and it goes through encoding
conversion from that to the client encoding in pq_sendcountedtext.

Indeed, cstring behaves completely as if it is a type with the server
encoding. If you send a cstring with format 1 rather than format 0,
while it is no longer subject to the encoding conversion done in
pq_sendcountedtext, it will dutifully perform the same conversion
in its own cstring_send. unknownsend is the same way.

But of course a "char" column in format 1 would never go through cstring;
char_send would be called, and just plop the byte in the buffer unchanged
(which is the same operation as an encoding conversion from SQL_ASCII
to anything).

Ever since I figured out I have to look at the send/recv functions
for a type to find out if it is encoding-dependent, I have to walk myself
through those steps again every time I forget why that is. Having
the type's character-encoding details show up in its send/recv functions
and not in its in/out functions never stops being counterintuitive to me.
But for server-encoding-dependent types, that's how it is: you don't
see it in the typoutput function, because on the format-0 path,
the transcoding happens in pq_sendcountedtext. But on the format-1 path,
the same transcoding happens, this time under the type's own control
in its typsend function.

That was the second thing that surprised me: we have what we call
a text and a binary path, but for an encoding-dependent type, neither
one is a path where transcoding doesn't happen!

The difference is, the format-0 transcoding is applied blindly,
in pq_sendcountedtext, with no surviving information about the data
type (which has become cstring by that point). In contrast, on the
format-1 path, the type's typsend is in control. In theory, that would
allow type-aware conversion; a smarter xml_send could use &#n; form
for characters that won't go in the client encoding, while the blind
pq transcoding on format 0 would just botch the data.

XML, in an ideal world, might live on disk in a form that cares nothing
for the server encoding, and be sent directly over the wire to a client
(it declares what encoding it's in) and presented to the application
over an XML-aware API that isn't hamstrung by the client's default
text encoding either.

But in the present world, we have somehow arrived at a setup where
there are only two paths that can take, and either one is a funnel
that can only be passed by data that survives both the client and
the server encoding.

The FE/BE docs have said "Text has format code zero, binary has format
code one, and all other format codes are reserved for future definition"
ever since 7.4. Maybe the time will come for a format 2, where you say
"here's an encoding ID and some bytes"?

This rambled on a bit far afield from "what should charout do with
non-ASCII values?". But honestly, either nobody is storing non-ASCII
values in "char", and we could make any choice there and nothing would
break, or somebody is doing that and their stuff would be broken by any
choice of change.

So, is the current "char" situation so urgent that it demands some
one-off solution be chosen for it, or could it be neglected with minimal
risk until someday we've defined what "this datatype has encoding X that's
different from the server encoding" means, and that takes care of it?

Regards,
-Chap



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Chapman Flack <chap@anastigmatix.net> writes:
> On 12/03/21 14:12, Tom Lane wrote:
>> This reminds me of something I've been intending to bring up, which
>> is that the "char" type is not very encoding-safe.  charout() for
>> example just regurgitates the single byte as-is.

> I wonder if maybe what to do about that lies downstream of some other
> thought about encoding-related type properties.

As you mentioned upthread, it's probably wrong to think of "char" as
character data at all.  The catalogs use it as a poor man's enum type,
and it's just for convenience that we assign readable ASCII codes for
the enum values of a given column.  The only reason to think of it as
encoding-dependent would be if you have ambitions to store a non-ASCII
character in a "char".  But I think that's something we want to
strongly discourage, even if we don't prohibit it altogether.  The
whole point of the type is to be one byte, so only in legacy encodings
can it possibly represent a non-ASCII character.

So I'm visualizing it as a uint8 that we happen to like to store
ASCII codes in, and that's what prompts the thought of using a
numeric representation for non-ASCII values.  I think you're just
in for pain if you want to consider such values as character data
rather than numbers.

> ... "char" is an existing
> example, because its current behavior is exactly as if it declared
> "I am one byte of SQL_ASCII regardless of server setting".

But it's not quite that.  If we treated it as SQL_ASCII, we'd refuse
to convert it to some other encoding unless the value passes encoding
verification, which is exactly what charout() is not doing.

> Indeed, cstring behaves completely as if it is a type with the server
> encoding.

Yup, cstring is definitely presumed to be in the server's encoding.

> So, is the current "char" situation so urgent that it demands some
> one-off solution be chosen for it, or could it be neglected with minimal
> risk until someday we've defined what "this datatype has encoding X that's
> different from the server encoding" means, and that takes care of it?

I'm not willing to leave it broken in the rather faint hope that
someday there will be a more general solution, especially since
I don't buy the premise that "char" ought to participate in any
such solution.

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Chapman Flack
Дата:
On 12/04/21 11:34, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> "I am one byte of SQL_ASCII regardless of server setting".
> 
> But it's not quite that.  If we treated it as SQL_ASCII, we'd refuse
> to convert it to some other encoding unless the value passes encoding
> verification, which is exactly what charout() is not doing.

Ah, good point. I remembered noticing pg_do_encoding_conversion returning
the src pointer unchanged when SQL_ASCII is involved, but see that it does
verify the dest_encoding when SQL_ASCII is the source.

> encoding-dependent would be if you have ambitions to store a non-ASCII
> character in a "char".  But I think that's something we want to
> strongly discourage, even if we don't prohibit it altogether. ...
> So I'm visualizing it as a uint8 that we happen to like to store
> ASCII codes in, and that's what prompts the thought of using a
> numeric representation for non-ASCII values.

I'm in substantial agreement, though I also see that it is nearly always
set from a quoted literal, and tested against a quoted literal, and calls
itself "char", so I guess I am thinking for consistency's sake it might
be better not to invent some all-new convention for its text representation,
but adopt something that's already familiar, like bytea escaped format.
So it would always look and act like a one-octet bytea. Maybe have charin
accept either bytea-escaped or bytea-hex form too. (Or, never mind; when
restricted to one octet, bytea-hex and the \xhh bytea-escape form are
indistinguishable anyway.)

Then for free we get the property that if somebody today uses 'ű' as
an enum value, it might start appearing as '\xfb' now in dumps, etc.,
but their existing CASE WHEN thing = 'ű' code doesn't stop working
(as long as they haven't done something silly like change the encoding),
and they have the flexibility to update it to WHEN thing = '\xfb' as
time permits if they choose. If they don't, they accept the risk that
by switching to another encoding in the future, they may either see
their existing tests stop matching, or their existing literals fail
to parse, but there won't be invalidly-encoded strings created.

> Yup, cstring is definitely presumed to be in the server's encoding.

Without proposing to change it, I observe that by defining both cstring
and unknown in this way (with the latter being expressly the type of
any literal from the client destined for a type we don't know yet), we're
a bit painted into the corner as far as supporting types like NCHAR.
(I suppose clients could be banned from sending such values as literals,
and required to use extended form and bind them with a binary message.)
It's analogous to the way format-0 and format-1 both act as filters that
no encoding-dependent data can squish through without surviving both the
client and the server encoding, even if it is of a type that's defined
to be independent of either.

Regards,
-Chap



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Chapman Flack <chap@anastigmatix.net> writes:
> On 12/04/21 11:34, Tom Lane wrote:
>> So I'm visualizing it as a uint8 that we happen to like to store
>> ASCII codes in, and that's what prompts the thought of using a
>> numeric representation for non-ASCII values.

> I'm in substantial agreement, though I also see that it is nearly always
> set from a quoted literal, and tested against a quoted literal, and calls
> itself "char", so I guess I am thinking for consistency's sake it might
> be better not to invent some all-new convention for its text representation,
> but adopt something that's already familiar, like bytea escaped format.
> So it would always look and act like a one-octet bytea.

Hmm.  I don't have any great objection to that ... except that
I observe that bytea rejects a bare backslash:

regression=# select '\'::bytea;
ERROR:  invalid input syntax for type bytea

which would be incompatible with "char"'s existing behavior.  But as
long as we don't do that, I'd be okay with having high-bit-set char
values map to backslash-followed-by-three-octal-digits, which is
what bytea escape format would produce.

> Maybe have charin
> accept either bytea-escaped or bytea-hex form too.

That seems like more complexity than is warranted, although I suppose
that allowing easy interchange between char and bytea is worth
something.

One other point in this area is that charin does not currently object
to multiple input characters, it just discards the extra:

regression=# select 'foo'::"char";
 char 
------
 f
(1 row)

I think that was justified by analogy to

regression=# select 'foo'::char(1);
 bpchar 
--------
 f
(1 row)

but I think it would be a bad idea to preserve it once we introduce
any sort of mapping, because it'd mask mistakes.  So I'm envisioning
that charin should accept any single-byte string (including non-ASCII,
for backwards compatibility), but for multi-byte input throw an error
if it doesn't look like whatever numeric-ish mapping we settle on.

>> Yup, cstring is definitely presumed to be in the server's encoding.

> Without proposing to change it, I observe that by defining both cstring
> and unknown in this way (with the latter being expressly the type of
> any literal from the client destined for a type we don't know yet), we're
> a bit painted into the corner as far as supporting types like NCHAR.

Yeah, I'm not sure what to do about that.  We convert the query text
to server encoding before ever attempting to parse it, and I don't
think I want to contemplate trying to postpone that (... especially
not if the client encoding is an unsafe one like SJIS, as you
probably could not avoid SQL-injection hazards).  So an in-line
literal in some other encoding is basically impossible, or at least
pointless.  I'm inclined to think that NCHAR is another one in a
rather long list of not-that-well-thought-out SQL features.

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Chapman Flack
Дата:
On 12/05/21 12:01, Tom Lane wrote:
> regression=# select '\'::bytea;
> ERROR:  invalid input syntax for type bytea
> 
> which would be incompatible with "char"'s existing behavior.  But as
> long as we don't do that, I'd be okay with having high-bit-set char
> values map to backslash-followed-by-three-octal-digits, which is
> what bytea escape format would produce.

Is that a proposal to change nothing about the current treatment
of values < 128, or just to avoid rejecting bare '\'?

It seems defensible to relax the error treatment of bare backslash,
as it isn't inherently ambiguous; it functions more as an "are you sure
you weren't starting to write an escape sequence here?" check. If it's
a backslash with nothing after it and you assume the user wrote what
they meant, then it's not hard to tell what they meant.

If there's a way to factor out and reuse the good parts of byteain,
that would mean '\\' would also be accepted to mean a backslash,
and the \r \n \t usual escapes would be accepted too, and \ooo and
\xhh.

>> Maybe have charin
>> accept either bytea-escaped or bytea-hex form too.
> 
> That seems like more complexity than is warranted

I think it ends up being no more complexity at all, because a single
octet in bytea-hex form looks like \xhh, which is exactly what
a single \xhh in bytea-escape form looks like.

I suppose it's important to consider what comparisons like c = '\'
and c = '\\' mean, which should be just fine when the type analysis
produces char = char or char = unknown, but could be surprising if it
doesn't.

Regards,
-Chap



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Chapman Flack <chap@anastigmatix.net> writes:
> On 12/05/21 12:01, Tom Lane wrote:
>> regression=# select '\'::bytea;
>> ERROR:  invalid input syntax for type bytea
>> 
>> which would be incompatible with "char"'s existing behavior.  But as
>> long as we don't do that, I'd be okay with having high-bit-set char
>> values map to backslash-followed-by-three-octal-digits, which is
>> what bytea escape format would produce.

> Is that a proposal to change nothing about the current treatment
> of values < 128, or just to avoid rejecting bare '\'?

I intended to change nothing about charin's treatment of ASCII
characters, nor anything about bytea's behavior.  I don't think
we should relax the error checks in the latter.  That does mean
that backslash becomes a problem for the idea of transparent
conversion from char to bytea or vice versa.  We could think
about emitting backslash as '\\' in charout, I suppose.  I'm
not really convinced though that bytea compatibility is worth
changing a case that's non-problematic today.

> If there's a way to factor out and reuse the good parts of byteain,
> that would mean '\\' would also be accepted to mean a backslash,
> and the \r \n \t usual escapes would be accepted too, and \ooo and
> \xhh.

Uh, what?

regression=# select '\n'::bytea;
ERROR:  invalid input syntax for type bytea

But I doubt that sharing code here would be worth the trouble.
The vast majority of byteain is concerned with managing the
string length, which is a nonissue for charin.

> I think it ends up being no more complexity at all, because a single
> octet in bytea-hex form looks like \xhh, which is exactly what
> a single \xhh in bytea-escape form looks like.

I'm confused by this statement too.  AFAIK the alternatives in
bytea are \xhh or \ooo:

regression=# select '\xEE'::bytea;
 bytea 
-------
 \xee
(1 row)

regression=# set bytea_output to escape;
SET
regression=# select '\xEE'::bytea;
 bytea 
-------
 \356
(1 row)

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Chapman Flack
Дата:
On 12/05/21 14:51, Tom Lane wrote:
> Uh, what?
> 
> regression=# select '\n'::bytea;
> ERROR:  invalid input syntax for type bytea

Wow, I was completely out to lunch there somehow. Sorry. None of those
other escaped forms are known to byteain, other than '\\' and ''''
according to table 8.7. I can't even explain why I thought that.

> I'm confused by this statement too.  AFAIK the alternatives in
> bytea are \xhh or \ooo:

Here I think I can at least tell where I went wrong; I saw both an
octal and a hex column in table 8.7, which I saw located under the
"bytea escape format" heading, and without testing carefully enough,
I assumed it was telling me that either format would be recognized on
input, which would certainly be possible, but clearly I was carrying
over too many assumptions from other escape formats where I'm used to
that being the case. If I wanted to prevent another reader making my
exact mistake, I might re-title those two table columns to be
"In bytea escape format" and "In bytea hex format" to make it more clear
the table is combining information for both formats.

I'm sure I did test SELECT '\x41'::bytea, but that proved nothing,
being simply interpreted as the hex input format. I should have
tried SELECT 'A\x41'::bytea, and would have immediately seen it rejected.

I've just looked at datatypes.sgml, where I was expecting to see that
table 8.7 actually falls outside of the sect2 for "bytea escape format",
and that I had simply misinterpreted it because the structural nesting
isn't obvious in the rendered HTML.

But what I found was that the table actually /is/ nested inside the
"bytea escape format" section, and in the generated HTML it is within
the div for that section, and the table's own div has the ID
DATATYPE-BINARY-SQLESC.

The change history there appears complex. The table already existed
at the time of a2a8c7a, which made a "bytea escape format" sect2 out
of the existing text that included the table, and added a separate
"bytea hex format" sect2. But the table at that point showed only the
input and output representations for the escape format, didn't say
anything about hex format, and wasn't touched in that commit.

Nine years later, f77de4b changed the values in the rightmost column
to hex form, but only because that was then the "output representation"
column and the default output format had been changed to hex.

Five months after that, f10a20e changed the heading of that column
from "output representation" to "hex representation", probably because
the values in that column by then were hex. So it ended up as a table
that is structurally part of the "bytea escape format" section,
whose rightmost column shows a hex format, and therefore (ahem)
could suggest to a reader (who doesn't rush to psql and test it
thoroughly) that a hex format is accepted there.

Still, I could have avoided that if I had better tested my reading
first.

Regards,
-Chap



Re: The "char" type versus non-ASCII characters

От
Peter Eisentraut
Дата:
On 03.12.21 21:13, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 12/3/21 14:42, Tom Lane wrote:
>>> Right, I envisioned that ASCII behaves the same but we'd use
>>> a numeric representation for high-bit-set values.  These
>>> cases could be told apart fairly easily by charin(), since
>>> the numeric representation would always be three digits.
> 
>> OK, this seems the most attractive. Can we also allow 2 hex digits?
> 
> I think we should pick one base and stick to it.  I don't mind
> hex if you have a preference for that.

I think we could consider char to be a single-byte bytea and use the 
escape format of bytea for char.  That way there is some precedent and 
we don't add yet another encoding or escape format.



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> I think we could consider char to be a single-byte bytea and use the 
> escape format of bytea for char.  That way there is some precedent and 
> we don't add yet another encoding or escape format.

Do you want to take that as far as changing backslash to print
as '\\' ?

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
I wrote:
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>> I think we could consider char to be a single-byte bytea and use the
>> escape format of bytea for char.  That way there is some precedent and
>> we don't add yet another encoding or escape format.

> Do you want to take that as far as changing backslash to print
> as '\\' ?

This came up again today [1], so here's a concrete proposal.
Let's use \ooo for high-bit-set chars, but keep backslash as just
backslash (so it's only semi-compatible with bytea).

            regards, tom lane

[1] https://www.postgresql.org/message-id/CAFM5RapGbBQm%2BdH%3D7K80HcvBvEWiV5Tm7N%3DNRaYURfm98YWc8A%40mail.gmail.com

diff --git a/src/backend/utils/adt/char.c b/src/backend/utils/adt/char.c
index 0df41c2253..e50293bf14 100644
--- a/src/backend/utils/adt/char.c
+++ b/src/backend/utils/adt/char.c
@@ -20,6 +20,11 @@
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"

+#define ISOCTAL(c)   (((c) >= '0') && ((c) <= '7'))
+#define TOOCTAL(c)   ((c) + '0')
+#define FROMOCTAL(c) ((unsigned char) (c) - '0')
+
+
 /*****************************************************************************
  *     USER I/O ROUTINES                                                         *
  *****************************************************************************/
@@ -27,31 +32,53 @@
 /*
  *        charin            - converts "x" to 'x'
  *
- * Note that an empty input string will implicitly be converted to \0.
+ * This accepts the formats charout produces.  If we have multibyte input
+ * that is not in the form '\ooo', then we take its first byte as the value
+ * and silently discard the rest; this is a backwards-compatibility provision.
  */
 Datum
 charin(PG_FUNCTION_ARGS)
 {
     char       *ch = PG_GETARG_CSTRING(0);

+    if (strlen(ch) == 4 && ch[0] == '\\' &&
+        ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3]))
+        PG_RETURN_CHAR((FROMOCTAL(ch[1]) << 6) +
+                       (FROMOCTAL(ch[2]) << 3) +
+                       FROMOCTAL(ch[3]));
+    /* This will do the right thing for a zero-length input string */
     PG_RETURN_CHAR(ch[0]);
 }

 /*
  *        charout            - converts 'x' to "x"
  *
- * Note that if the char value is \0, the resulting string will appear
- * to be empty (null-terminated after zero characters).  So this is the
- * inverse of the charin() function for such data.
+ * The possible output formats are:
+ * 1. 0x00 is represented as an empty string.
+ * 2. 0x01..0x7F are represented as a single ASCII byte.
+ * 3. 0x80..0xFF are represented as \ooo (backslash and 3 octal digits).
+ * Case 3 is meant to match the traditional "escape" format of bytea.
  */
 Datum
 charout(PG_FUNCTION_ARGS)
 {
     char        ch = PG_GETARG_CHAR(0);
-    char       *result = (char *) palloc(2);
+    char       *result = (char *) palloc(5);

-    result[0] = ch;
-    result[1] = '\0';
+    if (IS_HIGHBIT_SET(ch))
+    {
+        result[0] = '\\';
+        result[1] = TOOCTAL(((unsigned char) ch) >> 6);
+        result[2] = TOOCTAL((((unsigned char) ch) >> 3) & 07);
+        result[3] = TOOCTAL(((unsigned char) ch) & 07);
+        result[4] = '\0';
+    }
+    else
+    {
+        /* This produces acceptable results for 0x00 as well */
+        result[0] = ch;
+        result[1] = '\0';
+    }
     PG_RETURN_CSTRING(result);
 }

@@ -176,15 +203,20 @@ Datum
 text_char(PG_FUNCTION_ARGS)
 {
     text       *arg1 = PG_GETARG_TEXT_PP(0);
+    char       *ch = VARDATA_ANY(arg1);
     char        result;

     /*
-     * An empty input string is converted to \0 (for consistency with charin).
-     * If the input is longer than one character, the excess data is silently
-     * discarded.
+     * Conversion rules are the same as in charin(), but here we need to
+     * handle the empty-string case honestly.
      */
-    if (VARSIZE_ANY_EXHDR(arg1) > 0)
-        result = *(VARDATA_ANY(arg1));
+    if (VARSIZE_ANY_EXHDR(arg1) == 4 && ch[0] == '\\' &&
+        ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3]))
+        result = (FROMOCTAL(ch[1]) << 6) +
+            (FROMOCTAL(ch[2]) << 3) +
+            FROMOCTAL(ch[3]);
+    else if (VARSIZE_ANY_EXHDR(arg1) > 0)
+        result = ch[0];
     else
         result = '\0';

@@ -195,13 +227,21 @@ Datum
 char_text(PG_FUNCTION_ARGS)
 {
     char        arg1 = PG_GETARG_CHAR(0);
-    text       *result = palloc(VARHDRSZ + 1);
+    text       *result = palloc(VARHDRSZ + 4);

     /*
-     * Convert \0 to an empty string, for consistency with charout (and
-     * because the text stuff doesn't like embedded nulls all that well).
+     * Conversion rules are the same as in charout(), but here we need to be
+     * honest about converting 0x00 to an empty string.
      */
-    if (arg1 != '\0')
+    if (IS_HIGHBIT_SET(arg1))
+    {
+        SET_VARSIZE(result, VARHDRSZ + 4);
+        (VARDATA(result))[0] = '\\';
+        (VARDATA(result))[1] = TOOCTAL(((unsigned char) arg1) >> 6);
+        (VARDATA(result))[2] = TOOCTAL((((unsigned char) arg1) >> 3) & 07);
+        (VARDATA(result))[3] = TOOCTAL(((unsigned char) arg1) & 07);
+    }
+    else if (arg1 != '\0')
     {
         SET_VARSIZE(result, VARHDRSZ + 1);
         *(VARDATA(result)) = arg1;

Re: The "char" type versus non-ASCII characters

От
Nikolay Shaplov
Дата:
В письме от пятница, 3 декабря 2021 г. 22:12:10 MSK пользователь Tom Lane
написал:
> which
> is that the "char" type is not very encoding-safe.  charout() for
> example just regurgitates the single byte as-is.  I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.

As I've mentioned in another thread I've been very surprised when I first  saw
"char" type name. And I was also very confused.

This leads me to an idea that may be as we fix "char" behaviour, we should also
change it's name to something more speaking for itself. Like ascii_char or
something like that.
Or better to add ascii_char with behaviour we need, update system tables with
it, and keep "char" with old behaviour in "deprecated" status in the case
somebody still using it. To give them time to change it to something more
decent: ascii_char or char(1).

I've also talked to a guy who knows postgres history very well, he told me
that "char" existed at least from portgres version 3.1, it also had "char16",
and in v.4  "char2", "char4", "char8" were added. But later on they was all
removed, and we have only "char".

Aslo "char" has nothing in common with SQL standard. Actually it looks very
unnaturally.  May be it is time to get rid of it too, if we are changing this
part of code...

> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char".  This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings.  This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).

This will give us steady #1 for ascii_char, and deprecation and removal of
"char" later on.

--
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su

Вложения

Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Nikolay Shaplov <dhyan@nataraj.su> writes:
> This leads me to an idea that may be as we fix "char" behaviour, we should also
> change it's name to something more speaking for itself.

I don't think this is going to happen.  It's especially not going to
happen in the back branches.  But in any case, what I'm looking for is
the minimum compatibility breakage needed to fix the encoding-unsafety
problem.  Renaming the type goes far beyond that.  It'd likely break
some client code that examines the system catalogs, for little gain.

            regards, tom lane



Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
I wrote:
> This came up again today [1], so here's a concrete proposal.
> Let's use \ooo for high-bit-set chars, but keep backslash as just
> backslash (so it's only semi-compatible with bytea).

Hearing no howls of protest, here's a fleshed out, potentially-committable
version.  I added some regression test coverage for the modified code.
(I also fixed an astonishingly obsolete comment about what the regular
char type does.)  I looked at the SGML docs too, but I don't think there
is anything to change there.  The docs say "single-byte internal type"
and are silent about "char" beyond that.  I think that's exactly where
we want to be: any more detail would encourage people to use the type,
which we don't really want.  Possibly we could change the text to
"single-byte internal type, meant to hold ASCII characters" but I'm
not sure that's better.

The next question is what to do with this.  I propose to commit it into
HEAD and v15 before next week's beta3 release.  If we don't get a lot
of pushback, we could consider back-patching further for the November
releases; but I'm hesitant to shove something like this into stable
branches with only a week's notice.

            regards, tom lane

diff --git a/src/backend/utils/adt/char.c b/src/backend/utils/adt/char.c
index 0df41c2253..e50293bf14 100644
--- a/src/backend/utils/adt/char.c
+++ b/src/backend/utils/adt/char.c
@@ -20,6 +20,11 @@
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"

+#define ISOCTAL(c)   (((c) >= '0') && ((c) <= '7'))
+#define TOOCTAL(c)   ((c) + '0')
+#define FROMOCTAL(c) ((unsigned char) (c) - '0')
+
+
 /*****************************************************************************
  *     USER I/O ROUTINES                                                         *
  *****************************************************************************/
@@ -27,31 +32,53 @@
 /*
  *        charin            - converts "x" to 'x'
  *
- * Note that an empty input string will implicitly be converted to \0.
+ * This accepts the formats charout produces.  If we have multibyte input
+ * that is not in the form '\ooo', then we take its first byte as the value
+ * and silently discard the rest; this is a backwards-compatibility provision.
  */
 Datum
 charin(PG_FUNCTION_ARGS)
 {
     char       *ch = PG_GETARG_CSTRING(0);

+    if (strlen(ch) == 4 && ch[0] == '\\' &&
+        ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3]))
+        PG_RETURN_CHAR((FROMOCTAL(ch[1]) << 6) +
+                       (FROMOCTAL(ch[2]) << 3) +
+                       FROMOCTAL(ch[3]));
+    /* This will do the right thing for a zero-length input string */
     PG_RETURN_CHAR(ch[0]);
 }

 /*
  *        charout            - converts 'x' to "x"
  *
- * Note that if the char value is \0, the resulting string will appear
- * to be empty (null-terminated after zero characters).  So this is the
- * inverse of the charin() function for such data.
+ * The possible output formats are:
+ * 1. 0x00 is represented as an empty string.
+ * 2. 0x01..0x7F are represented as a single ASCII byte.
+ * 3. 0x80..0xFF are represented as \ooo (backslash and 3 octal digits).
+ * Case 3 is meant to match the traditional "escape" format of bytea.
  */
 Datum
 charout(PG_FUNCTION_ARGS)
 {
     char        ch = PG_GETARG_CHAR(0);
-    char       *result = (char *) palloc(2);
+    char       *result = (char *) palloc(5);

-    result[0] = ch;
-    result[1] = '\0';
+    if (IS_HIGHBIT_SET(ch))
+    {
+        result[0] = '\\';
+        result[1] = TOOCTAL(((unsigned char) ch) >> 6);
+        result[2] = TOOCTAL((((unsigned char) ch) >> 3) & 07);
+        result[3] = TOOCTAL(((unsigned char) ch) & 07);
+        result[4] = '\0';
+    }
+    else
+    {
+        /* This produces acceptable results for 0x00 as well */
+        result[0] = ch;
+        result[1] = '\0';
+    }
     PG_RETURN_CSTRING(result);
 }

@@ -176,15 +203,20 @@ Datum
 text_char(PG_FUNCTION_ARGS)
 {
     text       *arg1 = PG_GETARG_TEXT_PP(0);
+    char       *ch = VARDATA_ANY(arg1);
     char        result;

     /*
-     * An empty input string is converted to \0 (for consistency with charin).
-     * If the input is longer than one character, the excess data is silently
-     * discarded.
+     * Conversion rules are the same as in charin(), but here we need to
+     * handle the empty-string case honestly.
      */
-    if (VARSIZE_ANY_EXHDR(arg1) > 0)
-        result = *(VARDATA_ANY(arg1));
+    if (VARSIZE_ANY_EXHDR(arg1) == 4 && ch[0] == '\\' &&
+        ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3]))
+        result = (FROMOCTAL(ch[1]) << 6) +
+            (FROMOCTAL(ch[2]) << 3) +
+            FROMOCTAL(ch[3]);
+    else if (VARSIZE_ANY_EXHDR(arg1) > 0)
+        result = ch[0];
     else
         result = '\0';

@@ -195,13 +227,21 @@ Datum
 char_text(PG_FUNCTION_ARGS)
 {
     char        arg1 = PG_GETARG_CHAR(0);
-    text       *result = palloc(VARHDRSZ + 1);
+    text       *result = palloc(VARHDRSZ + 4);

     /*
-     * Convert \0 to an empty string, for consistency with charout (and
-     * because the text stuff doesn't like embedded nulls all that well).
+     * Conversion rules are the same as in charout(), but here we need to be
+     * honest about converting 0x00 to an empty string.
      */
-    if (arg1 != '\0')
+    if (IS_HIGHBIT_SET(arg1))
+    {
+        SET_VARSIZE(result, VARHDRSZ + 4);
+        (VARDATA(result))[0] = '\\';
+        (VARDATA(result))[1] = TOOCTAL(((unsigned char) arg1) >> 6);
+        (VARDATA(result))[2] = TOOCTAL((((unsigned char) arg1) >> 3) & 07);
+        (VARDATA(result))[3] = TOOCTAL(((unsigned char) arg1) & 07);
+    }
+    else if (arg1 != '\0')
     {
         SET_VARSIZE(result, VARHDRSZ + 1);
         *(VARDATA(result)) = arg1;
diff --git a/src/test/regress/expected/char.out b/src/test/regress/expected/char.out
index 2d78f90f3b..ea9b0b8eeb 100644
--- a/src/test/regress/expected/char.out
+++ b/src/test/regress/expected/char.out
@@ -1,8 +1,8 @@
 --
 -- CHAR
 --
--- fixed-length by value
--- internally passed by value if <= 4 bytes in storage
+-- Per SQL standard, CHAR means character(1), that is a varlena type
+-- with a constraint restricting it to one character (not byte)
 SELECT char 'c' = char 'c' AS true;
  true
 ------
@@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL;
  abcd
 (4 rows)

+--
+-- Also test "char", which is an ad-hoc one-byte type.  It can only
+-- really store ASCII characters, but we allow high-bit-set characters
+-- to be accessed via bytea-like escapes.
+--
+SELECT 'a'::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\101'::"char";
+ char
+------
+ A
+(1 row)
+
+SELECT '\377'::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT 'a'::"char"::text;
+ text
+------
+ a
+(1 row)
+
+SELECT '\377'::"char"::text;
+ text
+------
+ \377
+(1 row)
+
+SELECT '\000'::"char"::text;
+ text
+------
+
+(1 row)
+
+SELECT 'a'::text::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\377'::text::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT ''::text::"char";
+ char
+------
+
+(1 row)
+
diff --git a/src/test/regress/expected/char_1.out b/src/test/regress/expected/char_1.out
index fa6644d692..ffd31551de 100644
--- a/src/test/regress/expected/char_1.out
+++ b/src/test/regress/expected/char_1.out
@@ -1,8 +1,8 @@
 --
 -- CHAR
 --
--- fixed-length by value
--- internally passed by value if <= 4 bytes in storage
+-- Per SQL standard, CHAR means character(1), that is a varlena type
+-- with a constraint restricting it to one character (not byte)
 SELECT char 'c' = char 'c' AS true;
  true
 ------
@@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL;
  abcd
 (4 rows)

+--
+-- Also test "char", which is an ad-hoc one-byte type.  It can only
+-- really store ASCII characters, but we allow high-bit-set characters
+-- to be accessed via bytea-like escapes.
+--
+SELECT 'a'::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\101'::"char";
+ char
+------
+ A
+(1 row)
+
+SELECT '\377'::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT 'a'::"char"::text;
+ text
+------
+ a
+(1 row)
+
+SELECT '\377'::"char"::text;
+ text
+------
+ \377
+(1 row)
+
+SELECT '\000'::"char"::text;
+ text
+------
+
+(1 row)
+
+SELECT 'a'::text::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\377'::text::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT ''::text::"char";
+ char
+------
+
+(1 row)
+
diff --git a/src/test/regress/expected/char_2.out b/src/test/regress/expected/char_2.out
index 09434a44cd..56818f824b 100644
--- a/src/test/regress/expected/char_2.out
+++ b/src/test/regress/expected/char_2.out
@@ -1,8 +1,8 @@
 --
 -- CHAR
 --
--- fixed-length by value
--- internally passed by value if <= 4 bytes in storage
+-- Per SQL standard, CHAR means character(1), that is a varlena type
+-- with a constraint restricting it to one character (not byte)
 SELECT char 'c' = char 'c' AS true;
  true
 ------
@@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL;
  abcd
 (4 rows)

+--
+-- Also test "char", which is an ad-hoc one-byte type.  It can only
+-- really store ASCII characters, but we allow high-bit-set characters
+-- to be accessed via bytea-like escapes.
+--
+SELECT 'a'::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\101'::"char";
+ char
+------
+ A
+(1 row)
+
+SELECT '\377'::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT 'a'::"char"::text;
+ text
+------
+ a
+(1 row)
+
+SELECT '\377'::"char"::text;
+ text
+------
+ \377
+(1 row)
+
+SELECT '\000'::"char"::text;
+ text
+------
+
+(1 row)
+
+SELECT 'a'::text::"char";
+ char
+------
+ a
+(1 row)
+
+SELECT '\377'::text::"char";
+ char
+------
+ \377
+(1 row)
+
+SELECT ''::text::"char";
+ char
+------
+
+(1 row)
+
diff --git a/src/test/regress/sql/char.sql b/src/test/regress/sql/char.sql
index 9c83c45e34..120fed53e5 100644
--- a/src/test/regress/sql/char.sql
+++ b/src/test/regress/sql/char.sql
@@ -2,8 +2,8 @@
 -- CHAR
 --

--- fixed-length by value
--- internally passed by value if <= 4 bytes in storage
+-- Per SQL standard, CHAR means character(1), that is a varlena type
+-- with a constraint restricting it to one character (not byte)

 SELECT char 'c' = char 'c' AS true;

@@ -71,3 +71,19 @@ DROP TABLE CHAR_TBL;
 INSERT INTO CHAR_TBL (f1) VALUES ('abcde');

 SELECT * FROM CHAR_TBL;
+
+--
+-- Also test "char", which is an ad-hoc one-byte type.  It can only
+-- really store ASCII characters, but we allow high-bit-set characters
+-- to be accessed via bytea-like escapes.
+--
+
+SELECT 'a'::"char";
+SELECT '\101'::"char";
+SELECT '\377'::"char";
+SELECT 'a'::"char"::text;
+SELECT '\377'::"char"::text;
+SELECT '\000'::"char"::text;
+SELECT 'a'::text::"char";
+SELECT '\377'::text::"char";
+SELECT ''::text::"char";

Re: The "char" type versus non-ASCII characters

От
Andrew Dunstan
Дата:
On 2022-07-31 Su 18:25, Tom Lane wrote:
> I wrote:
>> This came up again today [1], so here's a concrete proposal.
>> Let's use \ooo for high-bit-set chars, but keep backslash as just
>> backslash (so it's only semi-compatible with bytea).
> Hearing no howls of protest, here's a fleshed out, potentially-committable
> version.  I added some regression test coverage for the modified code.
> (I also fixed an astonishingly obsolete comment about what the regular
> char type does.)  I looked at the SGML docs too, but I don't think there
> is anything to change there.  The docs say "single-byte internal type"
> and are silent about "char" beyond that.  I think that's exactly where
> we want to be: any more detail would encourage people to use the type,
> which we don't really want.  Possibly we could change the text to
> "single-byte internal type, meant to hold ASCII characters" but I'm
> not sure that's better.
>
> The next question is what to do with this.  I propose to commit it into
> HEAD and v15 before next week's beta3 release.  If we don't get a lot
> of pushback, we could consider back-patching further for the November
> releases; but I'm hesitant to shove something like this into stable
> branches with only a week's notice.
>
>             


Maybe we should add some words to the docs explicitly discouraging its
use in user tables.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: The "char" type versus non-ASCII characters

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2022-07-31 Su 18:25, Tom Lane wrote:
>> ... I looked at the SGML docs too, but I don't think there
>> is anything to change there.  The docs say "single-byte internal type"
>> and are silent about "char" beyond that.  I think that's exactly where
>> we want to be: any more detail would encourage people to use the type,
>> which we don't really want.  Possibly we could change the text to
>> "single-byte internal type, meant to hold ASCII characters" but I'm
>> not sure that's better.

> Maybe we should add some words to the docs explicitly discouraging its
> use in user tables.

Hmm, I thought we already did --- but you're right, the intro para
for Table 8.5 only explicitly discourages use of "name".  We
probably want similar wording for both types.  Maybe like

    There are two other fixed-length character types in PostgreSQL, shown
    in Table 8.5.  Both are used in the system catalogs and are not
    intended for use in user tables.  The name type ...

            regards, tom lane