Re: String encoding during connection "handshake"
От | Trevor Talbot |
---|---|
Тема | Re: String encoding during connection "handshake" |
Дата | |
Msg-id | 90bce5730711281114p4d8720aeke7c69ee152c8d44e@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: String encoding during connection "handshake" (sulfinu@gmail.com) |
Список | pgsql-hackers |
On 11/28/07, sulfinu@gmail.com <sulfinu@gmail.com> wrote: > Yes, you support (and worry about) encodings simply because of a C limitation > dating from 1974, if I recall correctly... > In Java, for example, a "char" is a very well defined datum, namely a Unicode > point. While in C it can be some char or another (or an error!) depending on > what encoding was used. The only definition that stands up is that a "char" > is a byte. Its interpretation is unsure and unsafe (see my original problem). It's not really that simple. Java, for instance, does not actually support Unicode characters / codepoints at the base level; it merely deals in UTF-16 code units. (The critical difference is in surrogate pairs.) You're still stuck dealing with a specific encoding even in many modern languages. PostgreSQL's encoding support is not just about languages though, it's also about client convenience. It could simply choose a single encoding and parrot data to and from the client, but it also does on-the-fly conversion when a client requests it. It's a very useful feature, and many mature networked applications support similar things. An easy example is the World Wide Web itself. > I implied that a cluster should have a single encoding that covers the whole > Unicode set. That would certainly satisfy everybody. Note that it might not. Unicode does not encode *every* character, and in some cases there is no round-trip mapping between it and other character sets. The result could be a loss of semantic data. I suspect it actually would satisfy everyone in PostgreSQL's case, but it's not something you can assume without checking. > > This has nothing to do with C by the way. C has many features that > > allow you to work with different encodings. It just doesn't force you > > to use any particular one. > Yes, my point exactly! C forces you to worry about encoding. I mean, if you're > not an ASCII-only user ;) For a networked application, you're stuck worrying about the encoding regardless of language. UTF-8 is the most common Internet transport, for instance, but that's not the native internal encoding used by Java and most other Unicode processing platforms to date. That's fairly simple since it's still only a single character set, but if your application domain predates Unicode, you can't avoid dealing with the legacy encodings at some level anyway. As I implied earlier, I do think it would be worthwhile for PostgreSQL to move toward handling it better, so I'm not saying this is a bad idea. It's just that it's a much more complex topic than it might seem at first glance. I'm glad you got something working for you.
В списке pgsql-hackers по дате отправления: