Re: Unicode string literals versus the world
От | Andrew Dunstan |
---|---|
Тема | Re: Unicode string literals versus the world |
Дата | |
Msg-id | 49E75003.9050003@dunslane.net обсуждение исходный текст |
Ответ на | Re: Unicode string literals versus the world (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > >> I'd never heard of UTF-16 surrogate pairs before this discussion and >> hence didn't realise that it's valid to have a surrogate pair in place >> of a single code point. The docs say that <D800 DF02> corresponds to >> U+10302, Python would appear to follow my intuitions in that: >> > > >> ord(u'\uD800\uDF02') >> > > >> results in an error instead of giving back 66306, as I'd expect. Is >> this a bug in Python, my understanding, or something else? >> > > I might be wrong, but I think surrogate pairs are expressly forbidden in > all representations other than UTF16/UCS2. We definitely forbid them > when validating UTF-8 strings --- that's per an RFC recommendation. > It sounds like Python is doing the same. > > > You mustn't encode the surrogate, but it's up to us how we allow people to designate a given code point. Frankly, I think we shouldn't provide for using surrogates at all. I would prefer something like \uXXXX for BMP items and \UXXXXXXXX as the straight 32bit designation of a higher codepoint. cheers andrew
В списке pgsql-hackers по дате отправления: