Обсуждение: Trying out native UTF-8 locales on Windows

Поиск
Список
Период
Сортировка

Trying out native UTF-8 locales on Windows

От
Thomas Munro
Дата:
Here's a very short patch to experiment with the idea of using
Windows' native UTF-8 support when possible, ie when using
"en-US.UTF-8" in a UTF-8 database.  Otherwise it continues to use the
special Windows-only wchar_t conversion that allows for locales with
non-matching locales, ie the reason you're allowed to use
"English_United States.1252" in a UTF-8 database on that OS, something
we wouldn't allow on Unix.

As I understand it, that mechanism dates from the pre-Windows 10 era
when it had no .UTF-8 locales but users wanted or needed to use UTF-8
databases.  I think some locales used encodings that we don't even
support as server encodings, eg SJIS in Japan, so that was a
workaround.  I assume you could use "ja-JP.UTF-8" these days.

CI tells me it compiles and passes, but I am not a Windows person, I'm
primarily interested in code cleanup and removing weird platform
differences.  I wonder if someone directly interested in Windows would
like to experiment with this and report whether (1) it works as
expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
(which I guess uses WIN1252 encoding and triggers the conversion
path?), and similarly for other locale pairs you might be interested
in?

It's possible that strcoll_l() internally converts the whole string to
wchar_t internally anyway, in which case it might turn out to be
marginally slower.  We often have to copy the char strings up front
ourselves in the regular path strcoll_l() path in order to
null-terminate them, something that is skipped in the wchar_t
conversion path that combines widening with null-termination in one
step.  Not sure if that'd kill the idea, but it'd at least be nice to
know if we might eventually be able to drop the special code paths and
strange configuration possibilities compared to Unix, and use it in
less performance critical paths.   At the very least, the comments are
wrong...

Вложения

Re: Trying out native UTF-8 locales on Windows

От
Bryan Green
Дата:
On 10/27/2025 10:22 PM, Thomas Munro wrote:
> Here's a very short patch to experiment with the idea of using
> Windows' native UTF-8 support when possible, ie when using
> "en-US.UTF-8" in a UTF-8 database.  Otherwise it continues to use the
> special Windows-only wchar_t conversion that allows for locales with
> non-matching locales, ie the reason you're allowed to use
> "English_United States.1252" in a UTF-8 database on that OS, something
> we wouldn't allow on Unix.
> 
> As I understand it, that mechanism dates from the pre-Windows 10 era
> when it had no .UTF-8 locales but users wanted or needed to use UTF-8
> databases.  I think some locales used encodings that we don't even
> support as server encodings, eg SJIS in Japan, so that was a
> workaround.  I assume you could use "ja-JP.UTF-8" these days.
> 
> CI tells me it compiles and passes, but I am not a Windows person, I'm
> primarily interested in code cleanup and removing weird platform
> differences.  I wonder if someone directly interested in Windows would
> like to experiment with this and report whether (1) it works as
> expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
> (which I guess uses WIN1252 encoding and triggers the conversion
> path?), and similarly for other locale pairs you might be interested
> in?

I wrote a standalone test to check this. Results on Windows 11 x64,
16 cores, ACP=1252.

(1) Correctness: PASS. strcoll_l() with UTF-8 locale matches wcscoll_l()
    for all 26 test cases (ASCII, accents, umlauts, ß, Greek, etc).
    Sorting 38 German/French words with both methods produces identical
    order.

(2) Performance vs WIN1252: It depends on the data.

Basic comparison with short real-world strings (1M iterations each):

  Test                   UTF8-new   UTF8-cur   WIN1252
  ----                   --------   --------   -------
  'hello' vs 'world'       82 ms     108 ms     76 ms
  'apple' vs 'banana'      85 ms     110 ms     77 ms
  'PostgreSQL' vs 'MySQL'  89 ms     113 ms     83 ms

UTF8-new = strcoll_l with UTF-8 locale (proposed patch)
UTF8-cur = wcscoll_l via conversion (current PostgreSQL)
WIN1252  = strcoll_l with legacy locale (baseline)

For ASCII strings, there's a crossover around 15-20 characters
(500K iterations each):

  Length   UTF8     WIN1252   Ratio
  ------   ----     -------   -----
       5   43 ms     40 ms    0.93x  (UTF8 7% slower)
      10   50 ms     48 ms    0.96x  (UTF8 4% slower)
      20   57 ms     65 ms    1.13x  (UTF8 13% faster)
      50  104 ms    122 ms    1.17x  (UTF8 17% faster)
     100  150 ms    195 ms    1.30x  (UTF8 30% faster)
     500  550 ms    783 ms    1.43x  (UTF8 43% faster)

For accented characters (á = 2 bytes UTF-8, 1 byte WIN1252), UTF-8
is consistently ~2x slower, as expected from the byte count
(500K iterations each):

  Chars    UTF8     WIN1252   Ratio
  -----    ----     -------   -----
      5    55 ms     42 ms    0.76x
     50   233 ms    117 ms    0.50x
    200   694 ms    342 ms    0.49x

With 200-char ASCII strings, UTF-8 beats WIN1252 even when the
difference is at position 0 (500K iterations each):

  Difference at    UTF8     WIN1252
  -------------    ----     -------
  Position 0       168 ms   260 ms
  Position 199     252 ms   342 ms

This suggests WIN1252's strcoll_l has poor scaling characteristics
that UTF-8's implementation avoids. I don't have an explanation for
why.

The patch is correct, and the new strcoll_l() path is 10-25% faster than
the current wcscoll_l() conversion path. Whether UTF-8 locale is faster
or slower than WIN1252 depends on string length and content - but users
choosing UTF-8 locales presumably want Unicode support, not WIN1252
compatibility.

I can test more if needed.  I can also provide the test program for
anyone who wants it.

-- 
Bryan Green
EDB: https://www.enterprisedb.com