Re: Bug in COPY FROM backslash escaping multi-byte chars
От | Heikki Linnakangas |
---|---|
Тема | Re: Bug in COPY FROM backslash escaping multi-byte chars |
Дата | |
Msg-id | c78113ac-8c25-5892-1269-d174d11e3a9f@iki.fi обсуждение исходный текст |
Ответ на | Re: Bug in COPY FROM backslash escaping multi-byte chars (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Список | pgsql-hackers |
On 04/02/2021 03:50, Kyotaro Horiguchi wrote: > At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in >> On 03/02/2021 15:38, John Naylor wrote: >>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi >>> <mailto:hlinnaka@iki.fi>> wrote: >>> > >>> > Hi, >>> > >>> > While playing with COPY FROM refactorings in another thread, I noticed >>> > corner case where I think backslash escaping doesn't work correctly. >>> > Consider the following input: >>> > >>> > \么.foo >>> I've seen multibyte delimiters in the wild, so it's not as outlandish >>> as it seems. >> >> We don't actually support multi-byte characters as delimiters or quote >> or escape characters: >> >> postgres=# copy copytest from 'foo' with (delimiter '么'); >> ERROR: COPY delimiter must be a single one-byte character >> >>> The fix is simple enough, so +1. >> >> Thanks, I'll commit and backpatch shortly. > > I'm not sure the assumption in the second hunk always holds, but > that's fine at least with Shift-JIS and -2004 since they are two-byte > encoding. The assumption is that a multi-byte character cannot have a special meaning, as far as the loop in CopyReadLineText is concerned. The characters with special meaning are '\\', '\n' and '\r'. That hold regardless of encoding. Thinking about this a bit more, I think the attached patch is slightly better. Normally in the loop, raw_buf_ptr points to the next byte to consume, and 'c' is the last consumed byte. At the end of the loop, we check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd and 4th byte if necessary. The crux of the bug is that after the "raw_buf_ptr++;" to skip the character after the backslash, we left c to '\\', even though we already consumed the first byte of the next character. Because of that, the end-of-the-loop check didn't correctly treat it as a multi-byte character. So a more straightforward fix is to set 'c' to the byte we skipped over. - Heikki
Вложения
В списке pgsql-hackers по дате отправления: