Bug in COPY FROM backslash escaping multi-byte chars
От | Heikki Linnakangas |
---|---|
Тема | Bug in COPY FROM backslash escaping multi-byte chars |
Дата | |
Msg-id | a897f84f-8dca-8798-3139-07da5bb38728@iki.fi обсуждение исходный текст |
Ответы |
Re: Bug in COPY FROM backslash escaping multi-byte chars
|
Список | pgsql-hackers |
Hi, While playing with COPY FROM refactorings in another thread, I noticed corner case where I think backslash escaping doesn't work correctly. Consider the following input: \么.foo I hope that came through in this email correctly as UTF-8. The string contains a sequence of: backslash, multibyte-character and a dot. The documentation says: > Backslash characters (\) can be used in the COPY data to quote data > characters that might otherwise be taken as row or column delimiters So I believe escaping multi-byte characters is supposed to work, and it usually does. However, let's consider the same string in Big5 encoding (in hex escaped format): \x5ca45c2e666f6f The first byte 0x5c, is the backslash. The multi-byte character consists of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash. That confuses the parser in CopyReadLineText, so that you get an error: postgres=# create table copytest (t text); CREATE TABLE postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5'); ERROR: end-of-copy marker corrupt CONTEXT: COPY copytest, line 1 What happens is that when the parser sees the backslash, it looks ahead at the next byte, and when it's not a dot, it skips over it: > else if (!cstate->opts.csv_mode) > > /* > * If we are here, it means we found a backslash followed by > * something other than a period. In non-CSV mode, anything > * after a backslash is special, so we skip over that second > * character too. If we didn't do that \\. would be > * considered an eof-of copy, while in non-CSV mode it is a > * literal backslash followed by a period. In CSV mode, > * backslashes are not special, so we want to process the > * character after the backslash just like a normal character, > * so we don't increment in those cases. > */ > raw_buf_ptr++; However, in a multi-byte encoding that might "embed" ascii characters, it should skip over the next *character*, not byte. Attached is a pretty straightforward patch to fix that. Anyone see a problem with this? - Heikki
Вложения
В списке pgsql-hackers по дате отправления: