Обсуждение: Bug in COPY FROM backslash escaping multi-byte chars

Поиск
Список
Период
Сортировка

Bug in COPY FROM backslash escaping multi-byte chars

От
Heikki Linnakangas
Дата:
Hi,

While playing with COPY FROM refactorings in another thread, I noticed 
corner case where I think backslash escaping doesn't work correctly. 
Consider the following input:

\么.foo

I hope that came through in this email correctly as UTF-8. The string 
contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it 
usually does.

However, let's consider the same string in Big5 encoding (in hex escaped 
format):

\x5ca45c2e666f6f

The first byte 0x5c, is the backslash. The multi-byte character consists 
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR:  end-of-copy marker corrupt
CONTEXT:  COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead 
at the next byte, and when it's not a dot, it skips over it:

>             else if (!cstate->opts.csv_mode)
> 
>                 /*
>                  * If we are here, it means we found a backslash followed by
>                  * something other than a period.  In non-CSV mode, anything
>                  * after a backslash is special, so we skip over that second
>                  * character too.  If we didn't do that \\. would be
>                  * considered an eof-of copy, while in non-CSV mode it is a
>                  * literal backslash followed by a period.  In CSV mode,
>                  * backslashes are not special, so we want to process the
>                  * character after the backslash just like a normal character,
>                  * so we don't increment in those cases.
>                  */
>                 raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters, 
it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a 
problem with this?

- Heikki

Вложения

Re: Bug in COPY FROM backslash escaping multi-byte chars

От
John Naylor
Дата:

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Hi,
>
> While playing with COPY FROM refactorings in another thread, I noticed
> corner case where I think backslash escaping doesn't work correctly.
> Consider the following input:
>
> \么.foo

I've seen multibyte delimiters in the wild, so it's not as outlandish as it seems. The fix is simple enough, so +1.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Bug in COPY FROM backslash escaping multi-byte chars

От
Heikki Linnakangas
Дата:
On 03/02/2021 15:38, John Naylor wrote:
> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > Hi,
>  >
>  > While playing with COPY FROM refactorings in another thread, I noticed
>  > corner case where I think backslash escaping doesn't work correctly.
>  > Consider the following input:
>  >
>  > \么.foo
> 
> I've seen multibyte delimiters in the wild, so it's not as outlandish as 
> it seems.

We don't actually support multi-byte characters as delimiters or quote 
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR:  COPY delimiter must be a single one-byte character

> The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

- Heikki



Re: Bug in COPY FROM backslash escaping multi-byte chars

От
Kyotaro Horiguchi
Дата:
At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 03/02/2021 15:38, John Naylor wrote:
> > On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
> > <mailto:hlinnaka@iki.fi>> wrote:
> >  >
> >  > Hi,
> >  >
> >  > While playing with COPY FROM refactorings in another thread, I noticed
> >  > corner case where I think backslash escaping doesn't work correctly.
> >  > Consider the following input:
> >  >
> >  > \么.foo
> > I've seen multibyte delimiters in the wild, so it's not as outlandish
> > as it seems.
> 
> We don't actually support multi-byte characters as delimiters or quote
> or escape characters:
> 
> postgres=# copy copytest from 'foo' with (delimiter '么');
> ERROR:  COPY delimiter must be a single one-byte character
> 
> > The fix is simple enough, so +1.
> 
> Thanks, I'll commit and backpatch shortly.

I'm not sure the assumption in the second hunk always holds, but
that's fine at least with Shift-JIS and -2004 since they are two-byte
encoding.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Bug in COPY FROM backslash escaping multi-byte chars

От
Heikki Linnakangas
Дата:
On 04/02/2021 03:50, Kyotaro Horiguchi wrote:
> At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
>> On 03/02/2021 15:38, John Naylor wrote:
>>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
>>> <mailto:hlinnaka@iki.fi>> wrote:
>>>   >
>>>   > Hi,
>>>   >
>>>   > While playing with COPY FROM refactorings in another thread, I noticed
>>>   > corner case where I think backslash escaping doesn't work correctly.
>>>   > Consider the following input:
>>>   >
>>>   > \么.foo
>>> I've seen multibyte delimiters in the wild, so it's not as outlandish
>>> as it seems.
>>
>> We don't actually support multi-byte characters as delimiters or quote
>> or escape characters:
>>
>> postgres=# copy copytest from 'foo' with (delimiter '么');
>> ERROR:  COPY delimiter must be a single one-byte character
>>
>>> The fix is simple enough, so +1.
>>
>> Thanks, I'll commit and backpatch shortly.
> 
> I'm not sure the assumption in the second hunk always holds, but
> that's fine at least with Shift-JIS and -2004 since they are two-byte
> encoding.

The assumption is that a multi-byte character cannot have a special 
meaning, as far as the loop in CopyReadLineText is concerned. The 
characters with special meaning are '\\', '\n' and '\r'. That hold 
regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly 
better. Normally in the loop, raw_buf_ptr points to the next byte to 
consume, and 'c' is the last consumed byte. At the end of the loop, we 
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd 
and 4th byte if necessary. The crux of the bug is that after the 
"raw_buf_ptr++;" to skip the character after the backslash, we left c to 
'\\', even though we already consumed the first byte of the next 
character. Because of that, the end-of-the-loop check didn't correctly 
treat it as a multi-byte character. So a more straightforward fix is to 
set 'c' to the byte we skipped over.

- Heikki

Вложения