Re: Unicode string literals versus the world
От | Sam Mason |
---|---|
Тема | Re: Unicode string literals versus the world |
Дата | |
Msg-id | 20090416160808.GO12225@frubble.xen.chris-lamb.co.uk обсуждение исходный текст |
Ответ на | Re: Unicode string literals versus the world (Marko Kreen <markokr@gmail.com>) |
Список | pgsql-hackers |
On Thu, Apr 16, 2009 at 06:34:06PM +0300, Marko Kreen wrote: > Which hints that you can aswell enter the pairs directly: \uxx\uxx. > If I'd be language designer, I would not see any reason to disallow it. > > And anyway, at least mono seems to support it: > > using System; > public class HelloWorld { > public static void Main() { > Console.WriteLine("<\uD800\uDF02>\n"); > } > } > > It will output single UTF8 character. I think this should settle it. I don't have any .net stuff installed so can't test; but C# is defined to use UTF-16 as its internal representation so it would make sense if the above gets treated as a single character internally. However, if it used any other encoding the above should be treated as an error. > The de-facto about Postgres is stdstr=off. Even if not, E'' strings > are still better for various things, so it would be good if they also > aquired unicode-capabilities. OK, this seems independent of the U&'lit' discussion that started the thread. Note that PG already supports UTF8; if you want the character I've been using in my examples up-thread, you can do: SELECT E'\xF0\x90\x8C\x82'; I have a feeling that this is predicated on the server_encoding being set to "utf8" and this can only be done at database creation time. Another alternative would be to use the convert_from function, i.e: SELECT convert_from(E'\xF0\x90\x8C\x82', 'UTF8'); Never had to do this though, so there may be better options available. > Python's internal representation is *not* UTF-16, but plain UCS2/UCS4, > that is - plain 16 or 32-bit values. Seems your python is compiled with > UCS2, not UCS4. Cool, I didn't know that. I believe mine is UCS4 as I can do: ord(u'\U00010302') and I get 66306 back rather than an error. > As I understand, in UCS2 mode it simply takes surrogate > values as-is. UCS2 doesn't have surrogate pairs, or at least I believe it's considered a bug if you don't get an error when you present it with one. > From ord() docs: > > If a unicode argument is given and Python was built with UCS2 Unicode, > then the character’s code point must be in the range [0..65535] > inclusive; otherwise the string length is two, and a TypeError will > be raised. > > So only in UCS4 mode it detects surrogates and converts them to internal > representation. (Which in Postgres case would be UTF8.) I think you mean UTF-16 instead of UCS4; but otherwise, yes. > Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand > UTF16 but low-level string routines do not: > > print "<%s>" % u'\uD800\uDF02' > > seems to handle it properly. Yes, I get this as well. It's all a bit weird, which is why I was asking if "this a bug in Python, my understanding, or something else". When I do: python <<EOF | hexdump -C print u"\uD800\uDF02" EOF to see what it's doing I get an error which I'm not expecting, hence I think it's probably my understanding. -- Sam http://samason.me.uk/
В списке pgsql-hackers по дате отправления: