Обсуждение: BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'

Поиск
Список
Период
Сортировка

BUG #18216: Unaccent function is unable to remove accents (diacritic signs) from Japanese character 'ド'

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18216
Logged by:          Shailesh Totale
Email address:      shailesh.totale@sailpoint.com
PostgreSQL version: 13.8
Operating system:   Linux
Description:

Hello team ,

PostgreSQL's unaccent module does not use Unicode normalisation, but only a
simple search-and-replace dictionary. The dictionary, unaccent.rules
(https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
  , does not contain these Japanese  characters, thus  its unable to remove
the diacritic signs.  Can someone please guide when we can expect these
Japanese characters will be added.

Also tried to check with latest versions of Postgresql still the latest
version does not have support for the Japanese characters.

https://pgpedia.info/u/unaccent.html

Thanks,
Shailesh


PG Bug reporting form <noreply@postgresql.org> writes:
> PostgreSQL's unaccent module does not use Unicode normalisation, but only a
> simple search-and-replace dictionary. The dictionary, unaccent.rules
> (https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
>   , does not contain these Japanese  characters, thus  its unable to remove
> the diacritic signs.  Can someone please guide when we can expect these
> Japanese characters will be added.

unaccent.rules, as distributed, is just an example.  It is not meant
to be exhaustive or authoritative.  Feel free to add your own entries
to your copy.

            regards, tom lane



On Tue, Nov 28, 2023 at 09:58:35AM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> PostgreSQL's unaccent module does not use Unicode normalisation, but only a
>> simple search-and-replace dictionary. The dictionary, unaccent.rules
>> (https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
>>   , does not contain these Japanese  characters, thus  its unable to remove
>> the diacritic signs.  Can someone please guide when we can expect these
>> Japanese characters will be added.
>
> unaccent.rules, as distributed, is just an example.  It is not meant
> to be exhaustive or authoritative.

FWIW, I'm quite fluent in Japanese and was discussing a bit this
around me and, like me, folks were kind of troubled with the concept
that these should be considered as "accents", because it would
entirely change the meaning of what each Hiragana and Katakana means.
I am not sure if it would make sense to apply such an operation on an
expression index or similar, either.  As a whole, adding that to the
in-core unaccent.rules would be a bad idea if we were to consider it.

> Feel free to add your own entries to your copy.

Indeed.  The way to write a .rules should be clearly documented.
--
Michael

Вложения


On Tue, Nov 28, 2023 at 8:06 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 28, 2023 at 09:58:35AM -0500, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> PostgreSQL's unaccent module does not use Unicode normalisation, but only a
>> simple search-and-replace dictionary. The dictionary, unaccent.rules
>> (https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
>>   , does not contain these Japanese  characters, thus  its unable to remove
>> the diacritic signs.  Can someone please guide when we can expect these
>> Japanese characters will be added.
>
> unaccent.rules, as distributed, is just an example.  It is not meant
> to be exhaustive or authoritative.

FWIW, I'm quite fluent in Japanese and was discussing a bit this
around me and, like me, folks were kind of troubled with the concept
that these should be considered as "accents", because it would
entirely change the meaning of what each Hiragana and Katakana means.

But isn't it generally the case that removing accents might make you land on a different word with a different meaning?

'ano' and  'año' for example mean different things in Spanish (but unaccent removes it anyway, at least in one out of four attempts to get the non-7-bit-ASCII wedged through my terminal and into the function).

That doesn't mean that unaccent is required to do it, of course. But the possibility of changing the meaning doesn't seem like a reason not to do it.

Cheers,

Jeff
Hi Jeff:

On Wed, 29 Nov 2023 at 03:40, Jeff Janes <jeff.janes@gmail.com> wrote:

I am not going to generally discuss this:
> But isn't it generally the case that removing accents might make you land on a different word with a different
meaning?

But this one is a bad example,
> 'ano' and  'año' for example mean different things in Spanish (but unaccent removes it anyway, at least in one out of
fourattempts to get the non-7-bit-ASCII wedged through my terminal and into the function). 

N and Ñ are different letters in spanish. It looks like an accent, can
be typed as such and some unaccent rules in some programs may make
them equal, Ñ is as different from N as it is from Z ( I am spanish,
and in case you want some authority link see
https://www.rae.es/dpd/%C3%B1 ). It has it own pages in the dictionary
( even on paper, I just checked in case my memory fails ).

We used to have also CH and LL as letters, but they were dropped
"recently" ( that meaning this century, I'm getting old ).

On the other "accents", à,è,ì,ò, ù  can generally be unaccented w/o
problem, although they may change meaning in some corner cases I do
not remember seen them do that since the special examples in school.
Other thing is ü, which is used on our "special" handling of hard/soft
vowels after g, i.e., you do not pronounce the u in "reguero" ( bot
modify how you pronounce the g, differently from agente ), but in
"agüero" you do pronounce it.

But Ñ is a proper letter, you cannot break it. Our alphabet goes m-n-ñ-o-p-q.

Francisco Olarte.

P.S. to really sound spanish, we would have picked up "cono" for the
examples :-p

FO



Hi

st 29. 11. 2023 v 9:13 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:
Hi Jeff:

On Wed, 29 Nov 2023 at 03:40, Jeff Janes <jeff.janes@gmail.com> wrote:

I am not going to generally discuss this:
> But isn't it generally the case that removing accents might make you land on a different word with a different meaning?

But this one is a bad example,
> 'ano' and  'año' for example mean different things in Spanish (but unaccent removes it anyway, at least in one out of four attempts to get the non-7-bit-ASCII wedged through my terminal and into the function).

N and Ñ are different letters in spanish. It looks like an accent, can
be typed as such and some unaccent rules in some programs may make
them equal, Ñ is as different from N as it is from Z ( I am spanish,
and in case you want some authority link see
https://www.rae.es/dpd/%C3%B1 ). It has it own pages in the dictionary
( even on paper, I just checked in case my memory fails ).

We used to have also CH and LL as letters, but they were dropped
"recently" ( that meaning this century, I'm getting old ).

On the other "accents", à,è,ì,ò, ù  can generally be unaccented w/o
problem, although they may change meaning in some corner cases I do
not remember seen them do that since the special examples in school.
Other thing is ü, which is used on our "special" handling of hard/soft
vowels after g, i.e., you do not pronounce the u in "reguero" ( bot
modify how you pronounce the g, differently from agente ), but in
"agüero" you do pronounce it.

But Ñ is a proper letter, you cannot break it. Our alphabet goes m-n-ñ-o-p-q.

Some users use unaccent for transformation to 7bit ASCII. 

In the Czech language I can find more examples, where removing diacritics means significant loss and the meaning of the world should be based only on context.

Žár (the heat) -> zar
Zář (the shine) -> zar
Být (to be) -> byt
Byt (the flat)-> byt

And for unaccent we expected this loss.

So my question is, can the unaccent function be used for transformation to 7bit ASCII or is it wrong usage?

Regards

Pavel
 

Francisco Olarte.

P.S. to really sound spanish, we would have picked up "cono" for the
examples :-p

FO


On 28.11.23 08:15, PG Bug reporting form wrote:
> PostgreSQL's unaccent module does not use Unicode normalisation, but only a
> simple search-and-replace dictionary. The dictionary, unaccent.rules
> (https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules)
>    , does not contain these Japanese  characters, thus  its unable to remove
> the diacritic signs.  Can someone please guide when we can expect these
> Japanese characters will be added.
> 
> Also tried to check with latest versions of Postgresql still the latest
> version does not have support for the Japanese characters.
> 
> https://pgpedia.info/u/unaccent.html

As the subsequent discussion shows, it's not quite clear to everybody 
what the exact mandate of the unaccent extension is.  Maybe we'll arrive 
at some conclusion.

In the meantime, I suggest you also consider solving this with 
collations.  You might find that those have a more principled approach 
to this problem, and they also have a lot of customization capabilities. 
  The documentation contains examples of accent-insensitive collations 
(e.g., [0]).  Maybe that will work for you, or serve as the basis for 
customization.

[0]: 
https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTIC



Hi Pavel.

On Wed, 29 Nov 2023 at 09:45, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> st 29. 11. 2023 v 9:13 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:
...
>> But Ñ is a proper letter, you cannot break it. Our alphabet goes m-n-ñ-o-p-q.
> Some users use unaccent for transformation to 7bit ASCII.

Right, I've done it manually sometimes. But I did not normaly just
supress the ~ , I turned año to anno ( IIRC nn was the predecessor of
Ñ, and it is used in similar place like "Anno domini" ) or to agno (
which sounds similar in French, and in things like "agnus dei qui
tollit pecata mundi" ( although that one has a much different meanig )
).

I was trying that normally you can supress tildes in spanish without
much problem, like in aviòn. Most of them just marks how to pronounce
them, they are useful if you do not know the word, but useless if you
know it. Some of them are used to differentiate things like adverbs
and pronoums, but in this case you can deduce it from the whole
phrase. But not with n/ñ. ñoño and nono are completely different and
unrelated words, and they even go in different "chapters" of the
dictionary.

> In the Czech language I can find more examples, where removing diacritics means significant loss and the meaning of
theworld should be based only on context. 
...
That seems even more complex than French, and I've never been able to
cope with them!
> And for unaccent we expected this loss.
> So my question is, can the unaccent function be used for transformation to 7bit ASCII or is it wrong usage?

You may need to turn chars to sequences.

Francisco Olarte,.





st 29. 11. 2023 v 10:16 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:
Hi Pavel.

On Wed, 29 Nov 2023 at 09:45, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> st 29. 11. 2023 v 9:13 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:
...
>> But Ñ is a proper letter, you cannot break it. Our alphabet goes m-n-ñ-o-p-q.
> Some users use unaccent for transformation to 7bit ASCII.

Right, I've done it manually sometimes. But I did not normaly just
supress the ~ , I turned año to anno ( IIRC nn was the predecessor of
Ñ, and it is used in similar place like "Anno domini" ) or to agno (
which sounds similar in French, and in things like "agnus dei qui
tollit pecata mundi" ( although that one has a much different meanig )
).

Š, S, Ž, Z are different chars, different sounds - some languages use two chars for these sounds https://www.optilingo.com/blog/polish/everything-about-polish-language/ Polish Digraphs and Trigraphs.

I was trying that normally you can supress tildes in spanish without
much problem, like in aviòn. Most of them just marks how to pronounce
them, they are useful if you do not know the word, but useless if you
know it. Some of them are used to differentiate things like adverbs
and pronoums, but in this case you can deduce it from the whole
phrase. But not with n/ñ. ñoño and nono are completely different and
unrelated words, and they even go in different "chapters" of the
dictionary.

> In the Czech language I can find more examples, where removing diacritics means significant loss and the meaning of the world should be based only on context.
...
That seems even more complex than French, and I've never been able to
cope with them!
> And for unaccent we expected this loss.
> So my question is, can the unaccent function be used for transformation to 7bit ASCII or is it wrong usage?

You may need to turn chars to sequences.

In Czech language we don't do it - probably nobody can read it.  We are trained to read it just without an accent. Lot of people write it usually, because it uses keywords without Czech chars, and for Czech language it is not too big a problem. Maybe it is wrong except for other languages, but we do it.

Pavel



Francisco Olarte,.
On Wed, 2023-11-29 at 10:15 +0100, Francisco Olarte wrote:
> On Wed, 29 Nov 2023 at 09:45, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> > In the Czech language I can find more examples, where removing diacritics means
> > significant loss and the meaning of the world should be based only on context.
> ...
> That seems even more complex than French, and I've never been able to
> cope with them!

This is far from unusual; see German:

 wurde (became)  <> würde (would)
 schin (already) <> schön (beautiful)

Yours,
Laurenz Albe
>