Built-in CTYPE provider
От | Jeff Davis |
---|---|
Тема | Built-in CTYPE provider |
Дата | |
Msg-id | ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com обсуждение исходный текст |
Ответы |
Re: Built-in CTYPE provider
(Jeremy Schneider <schneider@ardentperf.com>)
Re: Built-in CTYPE provider ("Daniel Verite" <daniel@manitou-mail.org>) Re: Built-in CTYPE provider (Jeremy Schneider <schneider@ardentperf.com>) |
Список | pgsql-hackers |
CTYPE, which handles character classification and upper/lowercasing behavior, may be simpler than it first appears. We may be able to get a net decrease in complexity by just building in most (or perhaps all) of the functionality. Unicode offers relatively simple rules for CTYPE-like functionality based on data files. There are a few exceptions and a few options, which I'll address below. (In contrast, collation varies a lot from locale to locale, and has a lot more options and nuance than ctype.) === Proposal === Parse some Unicode data files into static lookup tables in .h files (similar to what we already do for normalization) and provide functions to perform the right lookups according to Unicode recommentations[1][2]. Then expose the functionality as either a specially-named locale for the libc provider, or as part of the built-in collation provider which I previously proposed[3]. (Provided patches don't expose the functionality yet; I'm looking for feedback first.) Using libc or ICU for a CTYPE provider would still be supported, but as I explain below, there's not nearly as much reason to do so as you might expect. As far as I can tell, using an external provider for CTYPE functionality is mostly unnecessary complexity and magic. There's still plenty of reason to use the plain "C" semantics, if desired, but those semantics are already built-in. === Benefits === * platform-independent ctype semantics based on Unicode, not tied to any dependency's implementation * ability to combine fast memcmp() collation with rich ctype semantics * user-visible semantics can be documented and tested * stability within a PG major version * transparency of changes: tables would be checked in to .h files, so whoever runs the "update-unicode" build target would see if there are unexpected or impactful changes that should be addressed in the release notes * the built-in tables themselves can be tested exhaustively by comparing with ICU so we can detect trivial parsing errors and the like === Character Classification === Character classification is used for regexes, e.g. whether a character is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" class. Unicode defines what character properties map into these classes in TR #18 [1], specifying both a "Standard" variant and a "POSIX Compatible" variant. The main difference with the POSIX variant is that symbols count as punctuation. Character classification in Unicode does not vary from locale to locale. The same character is considered to be a member of the same classes regardless of locale (in other words, there's no "tailoring"). There is no strong compatibility guarantee around the classification of characters, but it doesn't seem to change much in practice (I could collect more data here if it matters). In glibc, character classification is not affected by the locale as far as I can tell -- all non-"C" locales behave like "C.UTF-8" (perhaps other libc implementations or versions or custom locales behave differently -- corrections welcome). There are some differences between "C.UTF-8" and what Unicode seems to recommend, and I'm not entirely sure why those differences exist or whether those differences are important for anything other than compatibility. Note: ICU offers character classification based on Unicode standards, too, but the fact that it's an external dependency makes it a difficult-to-test black box that is not tied to a PG major version. Also, we currently don't use the APIs that Unicode recommends; so in Postgres today, ICU-based character classification is further from Unicode than glibc character classification. === LOWER()/INITCAP()/UPPER() === The LOWER() and UPPER() functions are defined in the SQL spec with surprising detail, relying on specific Unicode General Category assignments. How to map characters seems to be left (implicitly) up to Unicode. If the input string is normalized, the output string must be normalized, too. Weirdly, there's no room in the SQL spec to localize LOWER()/UPPER() at all to handle issues like [1]. Also, the standard specifies one example, which is that "ß" becomes "SS" when folded to upper case. INITCAP() is not in the SQL spec. In Unicode, lowercasing and uppercasing behavior is a mapping[2], and also backed by a strong compatibility guarantee that "case pairs" will always remain case pairs[4]. The mapping may be "simple" (context-insensitive, locale-insensitive, not adding any code points), or "full" (may be context-sensitive, locale-sensitive, and one code point may turn into 1-3 code points). Titlecasing (INITCAP() in Postgres) in Unicode is similar to upper/lowercasing, except that it has the additional complexity of finding word boundaries, which have a non-trivial definition. To simplify, we'd either use the Postgres definition (alphanumeric) or the "word" character class specified in [1]. If someone wants more sophisticated word segmentation they could use ICU. While "full" case mapping sounds more complex, there are actually very few cases to consider and they are covered in another (small) data file. That data file covers ~100 code points that convert to multiple code points when the case changes (e.g. "ß" -> "SS"), 7 code points that have context-sensitive mappings, and three locales which have special conversions ("lt", "tr", and "az") for a few code points. ICU can do the simple case mapping (u_tolower(), etc.) or full mapping (u_strToLower(), etc.). I see one difference in ICU that I can't yet explain for the full titlecase mapping of a singular \+000345. glibc in UTF8 (at least in my tests) just does the simple upper/lower case mapping, extended with simple mappings for the locales with special conversions (which I think are exactly the same 3 locales mentioned above). libc doesn't do titlecase. If the resuling character isn't representable in the server encoding, I think libc just maps the character to itself, though I should test this assumption. === Encodings === It's easiest to implement these rules in UTF8, but possible for any encoding where we can decode to a Unicode code point. === Patches === 0001 & 0002 are just cleanup. I intend to commit them unless someone has a comment. 0003 implements character classification ("Standard" and "POSIX Compatible" variants) but doesn't actually use them for anything. 0004 implements "simple" case mapping, and a partial implementation of "full" case mapping. Again, does not use them yet. === Questions === * Is a built-in ctype provider a reasonable direction for Postgres as a project? * Does it feel like it would be simpler or more complex than what we're doing now? * Do we want to just try to improve our ICU support instead? * Do we want the built-in provider to be one thing, or have a few options (e.g. "standard" or "posix" character classification; "simple" or "full" case mapping)? Regards, Jeff Davis [1] http://www.unicode.org/reports/tr18/#Compatibility_Properties [2] https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992 [3] https://www.postgresql.org/message-id/flat/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com [4] https://www.unicode.org/policies/stability_policy.html#Case_Pair -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
В списке pgsql-hackers по дате отправления: