Re: Assert failure with ICU support
От | Jeff Davis |
---|---|
Тема | Re: Assert failure with ICU support |
Дата | |
Msg-id | ece303a3068026c6d6f10cab2a7f3f0836f4adb7.camel@j-davis.com обсуждение исходный текст |
Ответ на | Assert failure with ICU support (Richard Guo <guofenglinux@gmail.com>) |
Ответы |
Re: Assert failure with ICU support
|
Список | pgsql-bugs |
On Wed, 2023-04-19 at 18:30 +0800, Richard Guo wrote: > I happened to run into an assert failure by chance with ICU support. > Here is the query: > > SELECT '1' SIMILAR TO '\൧'; > > The failure happens in lexescape(), > > default: > assert(iscalpha(c)); > FAILW(REG_EESCAPE); /* unknown alphabetic escape */ > break; > > Without ICU support, the same query just gives an error. > > # SELECT '1' SIMILAR TO '\൧'; > ERROR: invalid regular expression: invalid escape \ sequence > > FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam. Thank you for the report and analysis! The problem exists all the way back if you do: SELECT '1' COLLATE "en-US-x-icu" SIMILAR TO '\൧'; The root cause (which you allude to) is that the code makes the assumption that digits only include 0-9, but u_isdigit('൧') == true, which violates that assumption. For Linux[1] specifically, it seems that the assumption should hold for iswdigit(). But looking here[2], it seems that the behavior of iswdigit() depends on the locale and I don't think it's correct to make that assumption. I did some experimentation on ICU and I found (pseudocode -- the real code needs to create a UChar32 from an encoded string first): char name: MALAYALAM DIGIT ONE u_isalnum('൧'): true u_isalpha('൧'): false u_isdigit('൧'): true u_charType('൧') == U_DECIMAL_DIGIT_NUMBER: true u_hasBinaryProperty('൧', UCHAR_POSIX_XDIGIT): true u_hasBinaryProperty('൧', UCHAR_POSIX_ALNUM): true The docs[3] for ICU say: "There are also functions that provide easy migration from C/POSIX functions like isblank(). Their use is generally discouraged because the C/POSIX standards do not define their semantics beyond the ASCII range, which means that different implementations exhibit very different behavior. Instead, Unicode properties should be used directly." We should probably just check that it's plain ASCII. Unfortunately I would not be surprised if there are more bugs similar to this one. Regards, Jeff Davis [1] https://man7.org/linux/man-pages/man3/iswdigit.3.html [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswdigit.html [3] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
В списке pgsql-bugs по дате отправления: