Обсуждение: Re: [HACKERS] UNICODE characters above 0x10000
Ahh, but that's not the case. You cannot just delete the check, since not all combinations of bytes are valid UTF8. UTF bytes FE & FF never appear in a byte sequence for instance. UTF8 is more that two bytes btw, up to 6 bytes are used to represent an UTF8 character. The 5 and 6 byte characters are currently not in use tho. I didn't actually notice the difference in UTF8 width between my original patch and my last, so attached, updated patch. Regards, John Hansen -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Saturday, August 07, 2004 3:07 PM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 "John Hansen" <john@geeknet.com.au> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane
Вложения
"John Hansen" <john@geeknet.com.au> writes: > Ahh, but that's not the case. You cannot just delete the check, since > not all combinations of bytes are valid UTF8. UTF bytes FE & FF never > appear in a byte sequence for instance. Well, this is still working at the wrong level. The code that's in pg_verifymbstr is mainly intended to enforce the *system wide* assumption that multibyte characters must have the high bit set in every byte. (We do not support encodings without this property in the backend, because it breaks code that looks for ASCII characters ... such as the main parser/lexer ...) It's not really intended to check that the multibyte character is actually legal in its encoding. The "special UTF-8 check" was never more than a very quick-n-dirty hack that was in the wrong place to start with. We ought to be getting rid of it not institutionalizing it. If you want an exact encoding-specific check on the legitimacy of a multibyte sequence, I think the right way to do it is to add another function pointer to pg_wchar_table entries to let each encoding have its own check routine. Perhaps this could be defined so as to avoid a separate call to pg_mblen inside the loop, and thereby not add any new overhead. I'm thinking about an API something like int validate_mbchar(const unsigned char *str, int len) with result +N if a valid character N bytes long is present at *str, and -N if an invalid character is present at *str and it would be appropriate to display N bytes in the complaint. (N must be <= len in either case.) This would reduce the main loop of pg_verifymbstr to a call of this function and an error-case-handling block. regards, tom lane