On Tue, 2023-02-14 at 12:17 +0100, Dominique Devienne wrote:
> On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > > Surely sorting should be "constant left-to-right", no? What are we missing?
> >
> > No, it isn't. That's not how natural language collations work.
>
> Honestly, who expects the same prefix to sort differently based on what comes
> after, in left-to-right languages?
> How does one even find out what the (capricious?) rules for sorting in a given
> collation are?
Look at the documentation / implementation.
As far as ICU is concerned, here: https://unicode.org/reports/tr10/
> > > I'm already surprised (star) comes before (space), when the latter "comes
> > > before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> > > prefixed pairs are not clustered after sorting is just mistifying to me. So how come?
> >
> > Because they compare identical on the first three levels. Any difference in
> > letters, accents or case weighs stronger, even if it occurs to the right
> > of these substrings.
>
> That's completely unintuitive...
Well, you can complain to GNU and the Unicode consortium, but that's pretty
much the way it is.
> > Yes, it soulds like the "C" collation may be best for you. That is, if you don't
> > mind that "Z" < "a".
>
> I would mind if I asked for case-insensitive comparisons.
>
> So the "C" collation is fine with general UTF-8 encoding?
> I.e. it will be codepoint ordered OK?
Yes, exactly.
Yours,
Laurenz Albe