Обсуждение: Problem in 'ORDER BY' of a column using a created collation?

Поиск

Список

Период

Сортировка

Problem in 'ORDER BY' of a column using a created collation?

От

Nishant Sharma

Дата:

25 августа, 10:51:45

Hi,

We encountered a strange behavior in ordering a column using a

created collation. Here are the experiments details:

Experiment 1:-
SQL File : PG_Exp_1.sql

Actual Output : PG_Exp_1.out

Created COLLATION : CREATE COLLATION test_coll (

provider = icu, locale = 'ja-u-kr-latn-digit');

'SELECT' Queries :
SELECT * FROM test_table ORDER BY value1 COLLATE

test_coll, value2 COLLATE test_coll;

SELECT * FROM test_table ORDER BY value2 COLLATE

test_coll, value1 COLLATE test_coll;

Expectation : All alphabets should come before all digits.

Seen Behavior : Column 1 in the 'ORDER BY' i.e value1 of the first

'SELECT' and Column 1 in the 'ORDER BY' i.e value2 of the second

'SELECT' is giving the correct order. But Column 2 in the 'ORDER BY'

i.e value2 in the first 'SELECT' and Column 2 in the 'ORDER BY' i.e

value1 in the second 'SELECT' is NOT giving the correct order.

Experiment 2:-
SQL File : PG_Exp_2.sql

Actual Output : PG_Exp_2.out

Created 'COLLATION' : CREATE COLLATION test_coll (

provider = icu, locale = 'ja-u-kr-digit-latn');

'SELECT' Queries : Same as 'Experiment 1'.

Expectation : All digits should come before all alphabets.

Seen Behavior : Matching with expectation. Column 1 in the

'ORDER BY' i.e value1 of the first 'SELECT' and Column 1 in the

'ORDER BY' i.e value2 of the second 'SELECT' is giving the correct

order. And Column 2 in the 'ORDER BY' i.e value2 in the first

'SELECT' and Column 2 in the 'ORDER BY' i.e value1 in the

second 'SELECT' is giving the correct order.

We did debug 'Experiment 1' and we find that:-
Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,

because it uses abbreviated sort optimization due to which its data

datum gets converted to abbreviated datum using

"varstr_abbrev_convert()" function, and then the comparator

function selected is

"ssup->comparator = ssup_datum_unsigned_cmp()"

for sorting operation. But in case of column 2 in 'ORDER BY' (which

is showing incorrect result for 'Experiment 1') does not use

abbreviated sort optimization and here comparator function selected

is "ssup->comparator = varlenafastcmp_locale" -->

"strncoll_icu_utf8()", which appears, uses the third party ICU

library function for comparison and does not work as expected.

Need help in confirming why 'Experiment 1' is behaving as mentioned

above -
1. If our expectation of 'Experiment 1' is wrong?
2. Bug in abbreviated sort optimization?
3. Bug in third party comparator function "strncoll_icu_utf8()"?
4. Any other aspects which we are missing?
5. Or everything appears good?

PFA, the experiment files.

Thanks & Regards,
Nishant Sharma.
EnterpriseDB, Pune, India.

Вложения

Re: Problem in 'ORDER BY' of a column using a created collation?

От

jian he

Дата:

25 августа, 18:28:49

On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
>
>
> Experiment 1:-
> SQL File : PG_Exp_1.sql
>
> Actual Output : PG_Exp_1.out
>
> Created COLLATION : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-latn-digit');
>
> Experiment 2:-
> SQL File : PG_Exp_2.sql
>
> Actual Output : PG_Exp_2.out
>
> Created 'COLLATION' : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-digit-latn');
>
> 'SELECT' Queries : Same as 'Experiment 1'.
>
> Expectation : All digits should come before all alphabets.
>
> Need help in confirming why 'Experiment 1' is behaving as mentioned
> above -
> 1. If our expectation of 'Experiment 1' is wrong?

I am not sure.
but for the 'Experiment 1', I can use the following collation to get
the expected result you want, I think.
CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit');

------both two SELECTS, numeric first then alphabets
SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x;
SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;

Re: Problem in 'ORDER BY' of a column using a created collation?

От

Nishant Sharma

Дата:

26 августа, 14:04:36

On Mon, Aug 25, 2025 at 8:59 PM jian he <jian.universality@gmail.com> wrote:

On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
>
>
> Experiment 1:-
> SQL File : PG_Exp_1.sql
>
> Actual Output : PG_Exp_1.out
>
> Created COLLATION : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-latn-digit');
>
> Experiment 2:-
> SQL File : PG_Exp_2.sql
>
> Actual Output : PG_Exp_2.out
>
> Created 'COLLATION' : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-digit-latn');
>
> 'SELECT' Queries : Same as 'Experiment 1'.
>
> Expectation : All digits should come before all alphabets.
>
> Need help in confirming why 'Experiment 1' is behaving as mentioned
> above -
> 1. If our expectation of 'Experiment 1' is wrong?

I am not sure.
but for the 'Experiment 1', I can use the following collation to get
the expected result you want, I think.
CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit');

------both two SELECTS, numeric first then alphabets
SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x;
SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;

Thanks for your response!

Yeah, we can get the correct result with below create collation as well:

CREATE COLLATION test_coll (provider = icu,locale = 'ja',

rules = $$& a <*AbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ <*0-9$$);

That is by explicitly giving the priority rule. But we are curious to know

the root cause of 'Experiment 1' behaviour.

I will wait for any response for the same.

Regards,

Nishant.

Re: Problem in 'ORDER BY' of a column using a created collation?

От

Nishant Sharma

Дата:

09 сентября, 08:06:51

On Tue, Aug 26, 2025 at 4:34 PM Nishant Sharma <nishant.sharma@enterprisedb.com> wrote:

On Mon, Aug 25, 2025 at 8:59 PM jian he <jian.universality@gmail.com> wrote:
On Mon, Aug 25, 2025 at 3:52 PM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
>
>
> Experiment 1:-
> SQL File : PG_Exp_1.sql
>
> Actual Output : PG_Exp_1.out
>
> Created COLLATION : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-latn-digit');
>
> Experiment 2:-
> SQL File : PG_Exp_2.sql
>
> Actual Output : PG_Exp_2.out
>
> Created 'COLLATION' : CREATE COLLATION test_coll (
> provider = icu, locale = 'ja-u-kr-digit-latn');
>
> 'SELECT' Queries : Same as 'Experiment 1'.
>
> Expectation : All digits should come before all alphabets.
>
> Need help in confirming why 'Experiment 1' is behaving as mentioned
> above -
> 1. If our expectation of 'Experiment 1' is wrong?

I am not sure.
but for the 'Experiment 1', I can use the following collation to get
the expected result you want, I think.
CREATE COLLATION x (provider = icu, locale = 'und-u-latn-digit');

------both two SELECTS, numeric first then alphabets
SELECT * FROM test_table ORDER BY value1 COLLATE x, value2 COLLATE x;
SELECT * FROM test_table ORDER BY value2 COLLATE x, value1 COLLATE x;

Thanks for your response!

Yeah, we can get the correct result with below create collation as well:

CREATE COLLATION test_coll (provider = icu,locale = 'ja',
rules = $$& a <*AbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ <*0-9$$);

That is by explicitly giving the priority rule. But we are curious to know
the root cause of 'Experiment 1' behaviour.

I will wait for any response for the same.

Regards,
Nishant.

Any help from experts on the confirmation about the ORDER BY with

COLLATION behaviour mentioned in the first email/comment?

Thanks in advance!

Re: Problem in 'ORDER BY' of a column using a created collation?

От

Robert Haas

Дата:

26 сентября, 16:15:51

On Mon, Aug 25, 2025 at 3:52 AM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
> We did debug 'Experiment 1' and we find that:-
> Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,
> because it uses abbreviated sort optimization due to which its data
> datum gets converted to abbreviated datum using
> "varstr_abbrev_convert()" function, and then the comparator
> function selected is
> "ssup->comparator = ssup_datum_unsigned_cmp()"
> for sorting operation. But in case of column 2 in 'ORDER BY' (which
> is showing incorrect result for 'Experiment 1') does not use
> abbreviated sort optimization and here comparator function selected
> is "ssup->comparator = varlenafastcmp_locale" -->
> "strncoll_icu_utf8()", which appears, uses the third party ICU
> library function for comparison and does not work as expected.

Generally, there are two methods for performing collation-aware string
comparisons, and they are expected to produce equivalent results. For
libc collations, the first method is to use strcoll() or strcoll_l()
or similar to directly compared the strings, and the other is to use
strxfrm() or similar to convert the string to a normalize
representation which can then be compared using memcmp(). Equivalent
facilities exist for ICU; see collate_methods_icu and
collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
should be getting used here.

Even though these two methods are expected to produce equivalent
results, if there's a bug, they might not. That bug could either exist
in our code or it could exist in ICU; I suspect the latter is more
likely, but the former is certainly possible. What I think you want to
do is try to track down two specific strings where transforming them
via strnxfrm_icu() produces results that compare in one order using
memcmp(), but passing the same strings to strncoll_icu() or
strncoll_icu_utf8() -- whichever is appropriate -- produces a
different result. If you're able to find such strings, then you can
probably rephrase those examples in terms of whatever underlying
functions strncoll_icu() and strxfrm_icu() are calling and we can
maybe report the problem to the ICU maintainers. If no such strings
seem to exist, then there's probably a problem in the PostgreSQL code
someplace, and you should try to figure out why we're getting the
wrong answer despite ICU apparently doing the right thing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Problem in 'ORDER BY' of a column using a created collation?

От

Nishant Sharma

Дата:

01 октября, 13:02:33

On Fri, Sep 26, 2025 at 6:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

Generally, there are two methods for performing collation-aware string
comparisons, and they are expected to produce equivalent results. For
libc collations, the first method is to use strcoll() or strcoll_l()
or similar to directly compared the strings, and the other is to use
strxfrm() or similar to convert the string to a normalize
representation which can then be compared using memcmp(). Equivalent
facilities exist for ICU; see collate_methods_icu and
collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
should be getting used here.

Even though these two methods are expected to produce equivalent
results, if there's a bug, they might not. That bug could either exist
in our code or it could exist in ICU; I suspect the latter is more
likely, but the former is certainly possible. What I think you want to
do is try to track down two specific strings where transforming them
via strnxfrm_icu() produces results that compare in one order using
memcmp(), but passing the same strings to strncoll_icu() or
strncoll_icu_utf8() -- whichever is appropriate -- produces a
different result. If you're able to find such strings, then you can
probably rephrase those examples in terms of whatever underlying
functions strncoll_icu() and strxfrm_icu() are calling and we can
maybe report the problem to the ICU maintainers. If no such strings
seem to exist, then there's probably a problem in the PostgreSQL code
someplace, and you should try to figure out why we're getting the
wrong answer despite ICU apparently doing the right thing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Thanks Robert for your reply!

I was able to create an independent C src file. With no relation to PG in it.

Did a simple comparison of the two methods mentioned by Robert for '1'

& 'a' as input strings. The ASCENDING SORT ORDER of them is printed.

Here is how I compiled the C src code:-

"gcc icu_issue_repro.c -L/usr/lib -licui18n -licuuc -licudata -o

icu_issue_repro.o"

Here is its output:-

------------------------------------------------------------------------------------------

Testing sort order for '1' & 'a' using ICU library with collation = 'ja-u-kr-latn-digit'

With Method ucol_strcollUTF8():
SORT ORDER ASC : '1', 'a'

With Method ucol_nextSortKeyPart() (i.e transform and memcmp):
SORT ORDER ASC : 'a', '1'

Testing Ends

------------------------------------------------------------------------------------------

We can clearly see that both methods are showing different sort order

for the same input strings.

So, I think this helps confirm that the problem lies in the ICU library

method "ucol_strcollUTF8()".

Robert, if you agree with my shared details, can you please help me

with details on how I can share this issue to ICU maintainers for the

correction?

Thank you!

PFA, the C src code.

Regards,

Nishant Sharma,

EDB, Pune.

Вложения

icu_issue_repro.c

Re: Problem in 'ORDER BY' of a column using a created collation?

От

"Daniel Verite"

Дата:

02 октября, 14:11:13

    Nishant Sharma wrote:

> Testing sort order for '1' & 'a' using ICU library with collation =
> 'ja-u-kr-latn-digit'
>
> With Method ucol_strcollUTF8():
> SORT ORDER ASC : '1', 'a'
>
> With Method ucol_nextSortKeyPart() (i.e transform and memcmp):
> SORT ORDER ASC : 'a', '1'

Yes, ucol_strcoll() with kr-latn-digit has been flawed for a
long time [1] and still is in the development branch of ICU.

There's a recent ticket about this [2] with a reproducer similar
to yours.


[1]
https://www.postgresql.org/message-id/flat/153201618542.1404.3611626898935613264%40wrigleys.postgresql.org
[2] https://unicode-org.atlassian.net/browse/ICU-23016


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Problem in 'ORDER BY' of a column using a created collation?

Problem in 'ORDER BY' of a column using a created collation?

Вложения

Re: Problem in 'ORDER BY' of a column using a created collation?

Re: Problem in 'ORDER BY' of a column using a created collation?

Re: Problem in 'ORDER BY' of a column using a created collation?

Re: Problem in 'ORDER BY' of a column using a created collation?

Re: Problem in 'ORDER BY' of a column using a created collation?

Вложения

Re: Problem in 'ORDER BY' of a column using a created collation?