Bug in COPY FROM backslash escaping multi-byte chars

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Bug in COPY FROM backslash escaping multi-byte chars
Дата
Msg-id a897f84f-8dca-8798-3139-07da5bb38728@iki.fi
обсуждение исходный текст
Ответы Re: Bug in COPY FROM backslash escaping multi-byte chars  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
Hi,

While playing with COPY FROM refactorings in another thread, I noticed 
corner case where I think backslash escaping doesn't work correctly. 
Consider the following input:

\么.foo

I hope that came through in this email correctly as UTF-8. The string 
contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it 
usually does.

However, let's consider the same string in Big5 encoding (in hex escaped 
format):

\x5ca45c2e666f6f

The first byte 0x5c, is the backslash. The multi-byte character consists 
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR:  end-of-copy marker corrupt
CONTEXT:  COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead 
at the next byte, and when it's not a dot, it skips over it:

>             else if (!cstate->opts.csv_mode)
> 
>                 /*
>                  * If we are here, it means we found a backslash followed by
>                  * something other than a period.  In non-CSV mode, anything
>                  * after a backslash is special, so we skip over that second
>                  * character too.  If we didn't do that \\. would be
>                  * considered an eof-of copy, while in non-CSV mode it is a
>                  * literal backslash followed by a period.  In CSV mode,
>                  * backslashes are not special, so we want to process the
>                  * character after the backslash just like a normal character,
>                  * so we don't increment in those cases.
>                  */
>                 raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters, 
it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a 
problem with this?

- Heikki

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: Dumping/restoring fails on inherited generated column
Следующее
От: Peter Eisentraut
Дата:
Сообщение: pg_dump: Add const decorations