Idea for making COPY data Microsoft-proof
От | Tom Lane |
---|---|
Тема | Idea for making COPY data Microsoft-proof |
Дата | |
Msg-id | 25767.1013394173@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Idea for making COPY data Microsoft-proof
|
Список | pgsql-hackers |
Okay, I think I've seen one too many reports of corrupted pg_dump data caused by Microslothish newline-to-CR/LF translation. We need to find a way to make COPY data proof against that brain damage. The best idea that comes to mind is as follows: 1. During COPY OUT, convert carriage returns and newlines in data being copied out to the two-character sequences \ r and \ n. (This will already work properly with the existing COPY IN logic, since it recognizes those escape sequences.) 2. During COPY IN, discard \r if it appears just before \n. (This must occur before de-backslashing, so that the sequence \ \r \n will be interpreted as a quoted newline. That way, an existing dump file that represents a data newline as \ \n will be correctly read even if it's been Microsoft-munged to \ \r \n.) The second part of this would have a small backwards-compatibility problem: when reading an old dump file (one made by COPY OUT before the above change) it would be possible for it to discard a \r that is legitimately part of the data. That would only happen if the \r was the last real data character on the line. Notice that if the \r precedes a data \n character, the \n will be represented as \ \n by existing COPY OUT, or by \ n by COPY OUT with the above change; either way, the \r is followed by a backslash and will not be dropped. So this change to COPY IN would not cause it to fail with old data values containing \r\n sequences, only with those ending with a bare \r. We could provide a workaround for that case by offering a SET variable that enables or disables discarding of \r, though I'd want it to default to ON. If this seems like a reasonable approach, then I'd like to apply the COPY OUT part of the change immediately (ie, for 7.2.1). Converting \r and \n to \ r and \ n will not hurt anyone in either forward or backward direction, and if we do that then dumps made with 7.2.1 or later will work correctly with 7.3 regardless of how one sets the discard-\r variable. A stronger change would make COPY IN regard \n, \r \n, or \r as equivalent representations of newline. If we do this then we'll also prevent Unix-to-Mac newline conversions from breaking COPY data (Mac represents newlines by just \r). However, in this approach any old-style data \r would break, not only those at end of data line. So the SET variable to revert to the old COPY IN behavior would be needed in many more cases. Comments? Anyone see a better way? regards, tom lane
В списке pgsql-hackers по дате отправления: