Hi,
While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:
\么.foo
I hope that came through in this email correctly as UTF-8. The string
contains a sequence of: backslash, multibyte-character and a dot.
The documentation says:
> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters
So I believe escaping multi-byte characters is supposed to work, and it
usually does.
However, let's consider the same string in Big5 encoding (in hex escaped
format):
\x5ca45c2e666f6f
The first byte 0x5c, is the backslash. The multi-byte character consists
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.
That confuses the parser in CopyReadLineText, so that you get an error:
postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR: end-of-copy marker corrupt
CONTEXT: COPY copytest, line 1
What happens is that when the parser sees the backslash, it looks ahead
at the next byte, and when it's not a dot, it skips over it:
> else if (!cstate->opts.csv_mode)
>
> /*
> * If we are here, it means we found a backslash followed by
> * something other than a period. In non-CSV mode, anything
> * after a backslash is special, so we skip over that second
> * character too. If we didn't do that \\. would be
> * considered an eof-of copy, while in non-CSV mode it is a
> * literal backslash followed by a period. In CSV mode,
> * backslashes are not special, so we want to process the
> * character after the backslash just like a normal character,
> * so we don't increment in those cases.
> */
> raw_buf_ptr++;
However, in a multi-byte encoding that might "embed" ascii characters,
it should skip over the next *character*, not byte.
Attached is a pretty straightforward patch to fix that. Anyone see a
problem with this?
- Heikki