Re: Perform COPY FROM encoding conversions in larger chunks
От | Heikki Linnakangas |
---|---|
Тема | Re: Perform COPY FROM encoding conversions in larger chunks |
Дата | |
Msg-id | 11d39e63-b80a-5f8d-8043-fff04201fadc@iki.fi обсуждение исходный текст |
Ответ на | Re: Perform COPY FROM encoding conversions in larger chunks (John Naylor <john.naylor@enterprisedb.com>) |
Ответы |
Re: Perform COPY FROM encoding conversions in larger chunks
|
Список | pgsql-hackers |
On 30/01/2021 20:47, John Naylor wrote: > I took a look at v2, and for the first encoding I tried, it fails to > report the error for invalid input: That's embarassing... > I believe the problem is in UtfToLocal(). I've attached a fix formatted > as a text file to avoid confusing the cfbot. The fix keeps the debugging > ereport() in case you find it useful. Thanks. I fixed it slightly differently, and also changed LocalToUtf() to follow the same pattern, even though LocalToUtf() did not have the same bug. > Some additional test coverage might be good here, but not sure how > much work that would be. I didn't check any other conversions yet. I added a bunch of tests for various built-in conversions. > v2-0002 seems fine to me, I just have cosmetic comments here: > > + * the same, no conversion is required by we must still validate that the > > s/by/but/ > > This comment in copyfrom_internal.h above the *StateData struct is the > same as the corresponding one in copyto.c: > > * Multi-byte encodings: all supported client-side encodings encode > multi-byte > * characters by having the first byte's high bit set. Subsequent bytes > of the > * character can have the high bit not set. When scanning data in such an > * encoding to look for a match to a single-byte (ie ASCII) character, > we must > * use the full pg_encoding_mblen() machinery to skip over multibyte > * characters, else we might find a false match to a trailing byte. In > * supported server encodings, there is no possibility of a false > match, and > * it's faster to make useless comparisons to trailing bytes than it is to > * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii > is true > * when we have to do it the hard way. > > The references to pg_encoding_mblen() and encoding_embeds_ascii, are out > of date for copy-from. I'm not sure the rest is relevant to copy-from > anymore, either. Can you confirm? Yeah, that comment is obsolete for COPY FROM, the encoding conversion works differently now. Removed it from copyfrom_internal.h. > This comment inside the struct is now out of date as well: > > * Similarly, line_buf holds the whole input line being processed. The > * input cycle is first to read the whole line into line_buf, convert it > * to server encoding there, and then extract the individual attribute > > HEAD has this macro already: > > /* Shorthand for number of unconsumed bytes available in raw_buf */ > #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - > (cstate)->raw_buf_index) > > It might make sense to create a CONVERSION_BUF_BYTES equivalent since > the patch calculates cstate->conversion_buf_len - > cstate->conversion_buf_index in a couple places. Thanks for the review! I spent some time refactoring and adding comments all around the patch, hopefully making it all more clear. One notable difference is that I renamed 'raw_buf' (which exists in master too) to 'input_buf', and renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this patch again another day with fresh eyes, and also try to add some tests for the corner cases at buffer boundaries. Attached is a new set of patches. I added some regression tests for the built-in conversion functions, which cover the bug you found, and many other interesting cases that did not have test coverage yet. It comes in two patches: the first patch uses just the existing convert_from() SQL function, and the second one uses the new "noError" variants of the conversion functions. I also kept the bug-fixes compared to the previous patch version as a separate commit, for easier review. - Heikki
Вложения
- v3-0001-Add-regression-tests-for-built-in-encoding-conver.patch
- v3-0002-Change-conversion-function-signature.patch
- v3-0003-Add-tests-for-the-new-noError-variants-of-built-i.patch
- v3-0004-Fix-bugs-in-the-commit-to-change-conversion-funct.patch
- v3-0005-Do-COPY-FROM-encoding-conversion-verification-in-.patch
В списке pgsql-hackers по дате отправления: