Re: BUG #5532: Valid UTF8 sequence errors as invalid
От | Tom Lane |
---|---|
Тема | Re: BUG #5532: Valid UTF8 sequence errors as invalid |
Дата | |
Msg-id | 12210.1277916285@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | BUG #5532: Valid UTF8 sequence errors as invalid ("Michael Lewis" <mikelikespie@gmail.com>) |
Ответы |
Re: BUG #5532: Valid UTF8 sequence errors as invalid
|
Список | pgsql-bugs |
"Michael Lewis" <mikelikespie@gmail.com> writes: > I'm using Python to sanitize my logs from invalid UTF8 characters before > COPYing them into postgres. I came across this one sequence that seems to > be valid UTF8 (in the extended range I believe). It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequence beginning with ED must have a second byte in the range 80-9F to be legal, and this doesn't. The example you give would decode as U+DF2D, ie part of a surrogate pair, which is specifically disallowed in UTF8 --- you're supposed to code the original character directly, not via a surrogate pair. The primary reason for this rule is that otherwise there are multiple ways to encode the same character, which can be a security hazard. > It goes through both pythons encoding as well as iconv without an error You should file bugs against those tools. regards, tom lane
В списке pgsql-bugs по дате отправления: