Обсуждение: Problem in 'ORDER BY' of a column using a created collation?
Hi,
We encountered a strange behavior in ordering a column using a
We encountered a strange behavior in ordering a column using a
created collation. Here are the experiments details:
Experiment 1:-
SQL File : PG_Exp_1.sql
Actual Output : PG_Exp_1.out
Created COLLATION : CREATE COLLATION test_coll (
Experiment 1:-
SQL File : PG_Exp_1.sql
Actual Output : PG_Exp_1.out
Created COLLATION : CREATE COLLATION test_coll (
provider = icu, locale = 'ja-u-kr-latn-digit');
Expectation : All alphabets should come before all digits.
Seen Behavior : Column 1 in the 'ORDER BY' i.e value1 of the first
Experiment 2:-
SQL File : PG_Exp_2.sql
Actual Output : PG_Exp_2.out
Created 'COLLATION' : CREATE COLLATION test_coll (
'SELECT' Queries : Same as 'Experiment 1'.
Expectation : All digits should come before all alphabets.
Seen Behavior : Matching with expectation. Column 1 in the
We did debug 'Experiment 1' and we find that:-
Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,
Need help in confirming why 'Experiment 1' is behaving as mentioned
'SELECT' Queries :
SELECT * FROM test_table ORDER BY value1 COLLATE
SELECT * FROM test_table ORDER BY value1 COLLATE
test_coll, value2 COLLATE test_coll;
SELECT * FROM test_table ORDER BY value2 COLLATE
test_coll, value1 COLLATE test_coll;
Expectation : All alphabets should come before all digits.
Seen Behavior : Column 1 in the 'ORDER BY' i.e value1 of the first
'SELECT' and Column 1 in the 'ORDER BY' i.e value2 of the second
'SELECT' is giving the correct order. But Column 2 in the 'ORDER BY'
i.e value2 in the first 'SELECT' and Column 2 in the 'ORDER BY' i.e
value1 in the second 'SELECT' is NOT giving the correct order.
Experiment 2:-
SQL File : PG_Exp_2.sql
Actual Output : PG_Exp_2.out
Created 'COLLATION' : CREATE COLLATION test_coll (
provider = icu, locale = 'ja-u-kr-digit-latn');
'SELECT' Queries : Same as 'Experiment 1'.
Expectation : All digits should come before all alphabets.
Seen Behavior : Matching with expectation. Column 1 in the
'ORDER BY' i.e value1 of the first 'SELECT' and Column 1 in the
'ORDER BY' i.e value2 of the second 'SELECT' is giving the correct
order. And Column 2 in the 'ORDER BY' i.e value2 in the first
'SELECT' and Column 2 in the 'ORDER BY' i.e value1 in the
second 'SELECT' is giving the correct order.
We did debug 'Experiment 1' and we find that:-
Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,
because it uses abbreviated sort optimization due to which its data
datum gets converted to abbreviated datum using
"varstr_abbrev_convert()" function, and then the comparator
function selected is
"ssup->comparator = ssup_datum_unsigned_cmp()"
for sorting operation. But in case of column 2 in 'ORDER BY' (which
is showing incorrect result for 'Experiment 1') does not use
abbreviated sort optimization and here comparator function selected
is "ssup->comparator = varlenafastcmp_locale" -->
"strncoll_icu_utf8()", which appears, uses the third party ICU
library function for comparison and does not work as expected.
Need help in confirming why 'Experiment 1' is behaving as mentioned
above -
1. If our expectation of 'Experiment 1' is wrong?
2. Bug in abbreviated sort optimization?
3. Bug in third party comparator function "strncoll_icu_utf8()"?
4. Any other aspects which we are missing?
5. Or everything appears good?
PFA, the experiment files.
Thanks & Regards,
Nishant Sharma.
EnterpriseDB, Pune, India.
1. If our expectation of 'Experiment 1' is wrong?
2. Bug in abbreviated sort optimization?
3. Bug in third party comparator function "strncoll_icu_utf8()"?
4. Any other aspects which we are missing?
5. Or everything appears good?
PFA, the experiment files.
Thanks & Regards,
Nishant Sharma.
EnterpriseDB, Pune, India.
Вложения
On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma <nishant.sharma@enterprisedb.com> wrote: > > > Experiment 1:- > SQL File : PG_Exp_1.sql > > Actual Output : PG_Exp_1.out > > Created COLLATION : CREATE COLLATION test_coll ( > provider = icu, locale = 'ja-u-kr-latn-digit'); > > Experiment 2:- > SQL File : PG_Exp_2.sql > > Actual Output : PG_Exp_2.out > > Created 'COLLATION' : CREATE COLLATION test_coll ( > provider = icu, locale = 'ja-u-kr-digit-latn'); > > 'SELECT' Queries : Same as 'Experiment 1'. > > Expectation : All digits should come before all alphabets. > > Need help in confirming why 'Experiment 1' is behaving as mentioned > above - > 1. If our expectation of 'Experiment 1' is wrong? I am not sure. but for the 'Experiment 1', I can use the following collation to get the expected result you want, I think. CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit'); ------both two SELECTS, numeric first then alphabets SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x; SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;
On Mon, Aug 25, 2025 at 8:59 PM jian he <jian.universality@gmail.com> wrote:
On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
>
>
> Experiment 1:-
> SQL File : PG_Exp_1.sql
>
> Actual Output : PG_Exp_1.out
>
> Created COLLATION : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-latn-digit');
>
> Experiment 2:-
> SQL File : PG_Exp_2.sql
>
> Actual Output : PG_Exp_2.out
>
> Created 'COLLATION' : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-digit-latn');
>
> 'SELECT' Queries : Same as 'Experiment 1'.
>
> Expectation : All digits should come before all alphabets.
>
> Need help in confirming why 'Experiment 1' is behaving as mentioned
> above -
> 1. If our expectation of 'Experiment 1' is wrong?
I am not sure.
but for the 'Experiment 1', I can use the following collation to get
the expected result you want, I think.
CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit');
------both two SELECTS, numeric first then alphabets
SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x;
SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;
Thanks for your response!
Yeah, we can get the correct result with below create collation as well:
CREATE COLLATION test_coll (provider = icu,locale = 'ja',
rules = $$& a <*AbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ <*0-9$$);
That is by explicitly giving the priority rule. But we are curious to know
the root cause of 'Experiment 1' behaviour.
I will wait for any response for the same.
Regards,
Nishant.
On Tue, Aug 26, 2025 at 4:34 PM Nishant Sharma <nishant.sharma@enterprisedb.com> wrote:
On Mon, Aug 25, 2025 at 8:59 PM jian he <jian.universality@gmail.com> wrote:On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
>
>
> Experiment 1:-
> SQL File : PG_Exp_1.sql
>
> Actual Output : PG_Exp_1.out
>
> Created COLLATION : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-latn-digit');
>
> Experiment 2:-
> SQL File : PG_Exp_2.sql
>
> Actual Output : PG_Exp_2.out
>
> Created 'COLLATION' : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-digit-latn');
>
> 'SELECT' Queries : Same as 'Experiment 1'.
>
> Expectation : All digits should come before all alphabets.
>
> Need help in confirming why 'Experiment 1' is behaving as mentioned
> above -
> 1. If our expectation of 'Experiment 1' is wrong?
I am not sure.
but for the 'Experiment 1', I can use the following collation to get
the expected result you want, I think.
CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit');
------both two SELECTS, numeric first then alphabets
SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x;
SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;Thanks for your response!Yeah, we can get the correct result with below create collation as well:CREATE COLLATION test_coll (provider = icu,locale = 'ja',rules = $$& a <*AbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ <*0-9$$);That is by explicitly giving the priority rule. But we are curious to knowthe root cause of 'Experiment 1' behaviour.I will wait for any response for the same.Regards,Nishant.
Any help from experts on the confirmation about the ORDER BY with
COLLATION behaviour mentioned in the first email/comment?
Thanks in advance!
On Mon, Aug 25, 2025 at 3:52 AM Nishant Sharma <nishant.sharma@enterprisedb.com> wrote: > We did debug 'Experiment 1' and we find that:- > Whatever is the Column 1 in 'ORDER BY' gets correctly ordered, > because it uses abbreviated sort optimization due to which its data > datum gets converted to abbreviated datum using > "varstr_abbrev_convert()" function, and then the comparator > function selected is > "ssup->comparator = ssup_datum_unsigned_cmp()" > for sorting operation. But in case of column 2 in 'ORDER BY' (which > is showing incorrect result for 'Experiment 1') does not use > abbreviated sort optimization and here comparator function selected > is "ssup->comparator = varlenafastcmp_locale" --> > "strncoll_icu_utf8()", which appears, uses the third party ICU > library function for comparison and does not work as expected. Generally, there are two methods for performing collation-aware string comparisons, and they are expected to produce equivalent results. For libc collations, the first method is to use strcoll() or strcoll_l() or similar to directly compared the strings, and the other is to use strxfrm() or similar to convert the string to a normalize representation which can then be compared using memcmp(). Equivalent facilities exist for ICU; see collate_methods_icu and collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those should be getting used here. Even though these two methods are expected to produce equivalent results, if there's a bug, they might not. That bug could either exist in our code or it could exist in ICU; I suspect the latter is more likely, but the former is certainly possible. What I think you want to do is try to track down two specific strings where transforming them via strnxfrm_icu() produces results that compare in one order using memcmp(), but passing the same strings to strncoll_icu() or strncoll_icu_utf8() -- whichever is appropriate -- produces a different result. If you're able to find such strings, then you can probably rephrase those examples in terms of whatever underlying functions strncoll_icu() and strxfrm_icu() are calling and we can maybe report the problem to the ICU maintainers. If no such strings seem to exist, then there's probably a problem in the PostgreSQL code someplace, and you should try to figure out why we're getting the wrong answer despite ICU apparently doing the right thing. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Sep 26, 2025 at 6:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
Generally, there are two methods for performing collation-aware string
comparisons, and they are expected to produce equivalent results. For
libc collations, the first method is to use strcoll() or strcoll_l()
or similar to directly compared the strings, and the other is to use
strxfrm() or similar to convert the string to a normalize
representation which can then be compared using memcmp(). Equivalent
facilities exist for ICU; see collate_methods_icu and
collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
should be getting used here.
Even though these two methods are expected to produce equivalent
results, if there's a bug, they might not. That bug could either exist
in our code or it could exist in ICU; I suspect the latter is more
likely, but the former is certainly possible. What I think you want to
do is try to track down two specific strings where transforming them
via strnxfrm_icu() produces results that compare in one order using
memcmp(), but passing the same strings to strncoll_icu() or
strncoll_icu_utf8() -- whichever is appropriate -- produces a
different result. If you're able to find such strings, then you can
probably rephrase those examples in terms of whatever underlying
functions strncoll_icu() and strxfrm_icu() are calling and we can
maybe report the problem to the ICU maintainers. If no such strings
seem to exist, then there's probably a problem in the PostgreSQL code
someplace, and you should try to figure out why we're getting the
wrong answer despite ICU apparently doing the right thing.
--
Robert Haas
EDB: http://www.enterprisedb.com
Thanks Robert for your reply!
I was able to create an independent C src file. With no relation to PG in it.
Did a simple comparison of the two methods mentioned by Robert for '1'
& 'a' as input strings. The ASCENDING SORT ORDER of them is printed.
Here is how I compiled the C src code:-
"gcc icu_issue_repro.c -L/usr/lib -licui18n -licuuc -licudata -o
icu_issue_repro.o"
Here is its output:-
------------------------------------------------------------------------------------------
Testing sort order for '1' & 'a' using ICU library with collation = 'ja-u-kr-latn-digit'
With Method ucol_strcollUTF8():
SORT ORDER ASC : '1', 'a'
With Method ucol_nextSortKeyPart() (i.e transform and memcmp):
SORT ORDER ASC : 'a', '1'
Testing Ends
------------------------------------------------------------------------------------------
We can clearly see that both methods are showing different sort order
for the same input strings.
So, I think this helps confirm that the problem lies in the ICU library
method "ucol_strcollUTF8()".
Robert, if you agree with my shared details, can you please help me
with details on how I can share this issue to ICU maintainers for the
correction?
Thank you!
PFA, the C src code.
Regards,
Nishant Sharma,
EDB, Pune.
Вложения
Nishant Sharma wrote: > Testing sort order for '1' & 'a' using ICU library with collation = > 'ja-u-kr-latn-digit' > > With Method ucol_strcollUTF8(): > SORT ORDER ASC : '1', 'a' > > With Method ucol_nextSortKeyPart() (i.e transform and memcmp): > SORT ORDER ASC : 'a', '1' Yes, ucol_strcoll() with kr-latn-digit has been flawed for a long time [1] and still is in the development branch of ICU. There's a recent ticket about this [2] with a reproducer similar to yours. [1] https://www.postgresql.org/message-id/flat/153201618542.1404.3611626898935613264%40wrigleys.postgresql.org [2] https://unicode-org.atlassian.net/browse/ICU-23016 Best regards, -- Daniel Vérité https://postgresql.verite.pro/