Re: Perform COPY FROM encoding conversions in larger chunks
От | Heikki Linnakangas |
---|---|
Тема | Re: Perform COPY FROM encoding conversions in larger chunks |
Дата | |
Msg-id | b9e3167f-f84b-7aa4-5738-be578a4db924@iki.fi обсуждение исходный текст |
Ответ на | Perform COPY FROM encoding conversions in larger chunks (Heikki Linnakangas <hlinnaka@iki.fi>) |
Список | pgsql-hackers |
One of the patches in this patch set is worth calling out separately: 0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per commit message: Add direct conversion routines between EUC_TW and Big5. Conversions between EUC_TW and Big5 were previously implemented by converting the whole input to MIC first, and then from MIC to the target encoding. Implement functions to convert directly between the two. The reason to do this now is that the next patch will change the change the conversion function signature so that if the input is invalid, we convert as much as we can and return the number of bytes successfully converted. That's not possible if we use an intermediary format, because if an error happens in the intermediary -> final conversion, we lose track of the location of the invalid character in the original input. Avoiding the intermediate step should be faster too. This patch is fairly independent of the others. It could be reviewed and applied separately. In order to verify that the new code is correct, I wrote some helper plpgsql functions to generate all valid EUC_TW and Big5 byte sequences that encode one character, and tested converting each of them. Then I compared the the results with unpatched server, to check that the new code performs the same conversion. This is perhaps overkill, but since its pretty straightforward to enumerate all the input characters, might as well do it. For the sake of completeness, I wrote similar helpers for all the other encodings and conversions. Except for UTF-8, there are too many formally valid codepoints for that to feasible. This does test round-trip conversions of all codepoints from all the other encodings to UTF-8 and back, though, so there's pretty good coverage of UTF-8 too. This test suite is probably too large to add to the source tree, but for the sake of the archives, I'm attaching it here. The first patch adds the test suite, including the expected output of each conversion. The second patch contains expected output changes for the above patch to add direct conversions between EUC_TW and Big5. It affected the error messages for some byte sequences that cannot be converted. For example, on unpatched master: postgres=# select convert('\xfdcc', 'euc_tw', 'big5'); ERROR: character with byte sequence 0x95 0xfd 0xcc in encoding "MULE_INTERNAL" has no equivalent in encoding "BIG5" With the patch: postgres=# select convert('\xfdcc', 'euc_tw', 'big5'); ERROR: character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has no equivalent in encoding "BIG5" The old message talked about "MULE_INTERNAL" which exposes the implementation detail that we used it as an intermediate in the conversion. That can be confusing to a user, the new message makes more sense. So that's also nice. - Heikki
Вложения
В списке pgsql-hackers по дате отправления: