Обсуждение: Bug in COPY FROM backslash escaping multi-byte chars
Hi, While playing with COPY FROM refactorings in another thread, I noticed corner case where I think backslash escaping doesn't work correctly. Consider the following input: \么.foo I hope that came through in this email correctly as UTF-8. The string contains a sequence of: backslash, multibyte-character and a dot. The documentation says: > Backslash characters (\) can be used in the COPY data to quote data > characters that might otherwise be taken as row or column delimiters So I believe escaping multi-byte characters is supposed to work, and it usually does. However, let's consider the same string in Big5 encoding (in hex escaped format): \x5ca45c2e666f6f The first byte 0x5c, is the backslash. The multi-byte character consists of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash. That confuses the parser in CopyReadLineText, so that you get an error: postgres=# create table copytest (t text); CREATE TABLE postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5'); ERROR: end-of-copy marker corrupt CONTEXT: COPY copytest, line 1 What happens is that when the parser sees the backslash, it looks ahead at the next byte, and when it's not a dot, it skips over it: > else if (!cstate->opts.csv_mode) > > /* > * If we are here, it means we found a backslash followed by > * something other than a period. In non-CSV mode, anything > * after a backslash is special, so we skip over that second > * character too. If we didn't do that \\. would be > * considered an eof-of copy, while in non-CSV mode it is a > * literal backslash followed by a period. In CSV mode, > * backslashes are not special, so we want to process the > * character after the backslash just like a normal character, > * so we don't increment in those cases. > */ > raw_buf_ptr++; However, in a multi-byte encoding that might "embed" ascii characters, it should skip over the next *character*, not byte. Attached is a pretty straightforward patch to fix that. Anyone see a problem with this? - Heikki
Вложения
On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Hi,
>
> While playing with COPY FROM refactorings in another thread, I noticed
> corner case where I think backslash escaping doesn't work correctly.
> Consider the following input:
>
> \么.foo
I've seen multibyte delimiters in the wild, so it's not as outlandish as it seems. The fix is simple enough, so +1.
--
John Naylor
EDB: http://www.enterprisedb.com
On 03/02/2021 15:38, John Naylor wrote: > On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > > > > Hi, > > > > While playing with COPY FROM refactorings in another thread, I noticed > > corner case where I think backslash escaping doesn't work correctly. > > Consider the following input: > > > > \么.foo > > I've seen multibyte delimiters in the wild, so it's not as outlandish as > it seems. We don't actually support multi-byte characters as delimiters or quote or escape characters: postgres=# copy copytest from 'foo' with (delimiter '么'); ERROR: COPY delimiter must be a single one-byte character > The fix is simple enough, so +1. Thanks, I'll commit and backpatch shortly. - Heikki
At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 03/02/2021 15:38, John Naylor wrote: > > On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi > > <mailto:hlinnaka@iki.fi>> wrote: > > > > > > Hi, > > > > > > While playing with COPY FROM refactorings in another thread, I noticed > > > corner case where I think backslash escaping doesn't work correctly. > > > Consider the following input: > > > > > > \么.foo > > I've seen multibyte delimiters in the wild, so it's not as outlandish > > as it seems. > > We don't actually support multi-byte characters as delimiters or quote > or escape characters: > > postgres=# copy copytest from 'foo' with (delimiter '么'); > ERROR: COPY delimiter must be a single one-byte character > > > The fix is simple enough, so +1. > > Thanks, I'll commit and backpatch shortly. I'm not sure the assumption in the second hunk always holds, but that's fine at least with Shift-JIS and -2004 since they are two-byte encoding. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 04/02/2021 03:50, Kyotaro Horiguchi wrote: > At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in >> On 03/02/2021 15:38, John Naylor wrote: >>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi >>> <mailto:hlinnaka@iki.fi>> wrote: >>> > >>> > Hi, >>> > >>> > While playing with COPY FROM refactorings in another thread, I noticed >>> > corner case where I think backslash escaping doesn't work correctly. >>> > Consider the following input: >>> > >>> > \么.foo >>> I've seen multibyte delimiters in the wild, so it's not as outlandish >>> as it seems. >> >> We don't actually support multi-byte characters as delimiters or quote >> or escape characters: >> >> postgres=# copy copytest from 'foo' with (delimiter '么'); >> ERROR: COPY delimiter must be a single one-byte character >> >>> The fix is simple enough, so +1. >> >> Thanks, I'll commit and backpatch shortly. > > I'm not sure the assumption in the second hunk always holds, but > that's fine at least with Shift-JIS and -2004 since they are two-byte > encoding. The assumption is that a multi-byte character cannot have a special meaning, as far as the loop in CopyReadLineText is concerned. The characters with special meaning are '\\', '\n' and '\r'. That hold regardless of encoding. Thinking about this a bit more, I think the attached patch is slightly better. Normally in the loop, raw_buf_ptr points to the next byte to consume, and 'c' is the last consumed byte. At the end of the loop, we check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd and 4th byte if necessary. The crux of the bug is that after the "raw_buf_ptr++;" to skip the character after the backslash, we left c to '\\', even though we already consumed the first byte of the next character. Because of that, the end-of-the-loop check didn't correctly treat it as a multi-byte character. So a more straightforward fix is to set 'c' to the byte we skipped over. - Heikki