Обсуждение: upper and UTF-8
I just used the upper(text) function on a database which is utf8 encoded and which has spanish text.
All of the regular characters were properly converted, except for characters which had accents.
On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote: > I just used the upper(text) function on a database which is utf8 encoded and > which has spanish text. > > All of the regular characters were properly converted, except for characters > which had accents. What are your various LC_* variables for that database? -- To understand recursion, one must first understand recursion.
CREATE DATABASE ishield WITH OWNER = postgres ENCODING = 'UTF8' LC_COLLATE = 'C' LC_CTYPE = 'C' CONNECTION LIMIT = -1; > -----Original Message----- > From: Scott Marlowe [mailto:scott.marlowe@gmail.com] > Sent: Monday, July 26, 2010 3:17 PM > To: Benjamin Krajmalnik > Cc: pgsql-admin@postgresql.org > Subject: Re: [ADMIN] upper and UTF-8 > > On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik > <kraj@servoyant.com> wrote: > > I just used the upper(text) function on a database which is utf8 > encoded and > > which has spanish text. > > > > All of the regular characters were properly converted, except for > characters > > which had accents. > > What are your various LC_* variables for that database? > > -- > To understand recursion, one must first understand recursion.
I'd try creating a db with en_US or even better whatever is spanish encoding for lc_collate and see what happens. On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote: > CREATE DATABASE ishield > WITH OWNER = postgres > ENCODING = 'UTF8' > LC_COLLATE = 'C' > LC_CTYPE = 'C' > CONNECTION LIMIT = -1; > > >> -----Original Message----- >> From: Scott Marlowe [mailto:scott.marlowe@gmail.com] >> Sent: Monday, July 26, 2010 3:17 PM >> To: Benjamin Krajmalnik >> Cc: pgsql-admin@postgresql.org >> Subject: Re: [ADMIN] upper and UTF-8 >> >> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik >> <kraj@servoyant.com> wrote: >> > I just used the upper(text) function on a database which is utf8 >> encoded and >> > which has spanish text. >> > >> > All of the regular characters were properly converted, except for >> characters >> > which had accents. >> >> What are your various LC_* variables for that database? >> >> -- >> To understand recursion, one must first understand recursion. > -- To understand recursion, one must first understand recursion.
Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering. It is not a big deal - I just found it interesting that it did not uppercase the accented letters. The reason I came across it is that I created a table of all the ISO countries. I had found a NySQL script which createdit, and it had the fields in both upper case and mixed case. Since our platform is multi-lingual, we expanded thetable to add the language code and started adding the translation. After I finished the translation, I figured for consistencyI would upper case the one field into the other, and this is where I saw the inconsistency. Operationally, it does not affect me in any way - but I found it strange that it did not handle the accented characters. For now we are keeping the column to facilitate the translation to other languages - ultimately it will be dropped. > -----Original Message----- > From: Scott Marlowe [mailto:scott.marlowe@gmail.com] > Sent: Monday, July 26, 2010 3:39 PM > To: Benjamin Krajmalnik > Cc: pgsql-admin@postgresql.org > Subject: Re: [ADMIN] upper and UTF-8 > > I'd try creating a db with en_US or even better whatever is spanish > encoding for lc_collate and see what happens. > > On Mon, Jul 26, 2010 at 3:18 PM, Benjamin Krajmalnik > <kraj@servoyant.com> wrote: > > CREATE DATABASE ishield > > WITH OWNER = postgres > > ENCODING = 'UTF8' > > LC_COLLATE = 'C' > > LC_CTYPE = 'C' > > CONNECTION LIMIT = -1; > > > > > >> -----Original Message----- > >> From: Scott Marlowe [mailto:scott.marlowe@gmail.com] > >> Sent: Monday, July 26, 2010 3:17 PM > >> To: Benjamin Krajmalnik > >> Cc: pgsql-admin@postgresql.org > >> Subject: Re: [ADMIN] upper and UTF-8 > >> > >> On Mon, Jul 26, 2010 at 3:03 PM, Benjamin Krajmalnik > >> <kraj@servoyant.com> wrote: > >> > I just used the upper(text) function on a database which is utf8 > >> encoded and > >> > which has spanish text. > >> > > >> > All of the regular characters were properly converted, except for > >> characters > >> > which had accents. > >> > >> What are your various LC_* variables for that database? > >> > >> -- > >> To understand recursion, one must first understand recursion. > > > > > > -- > To understand recursion, one must first understand recursion.
On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote: > Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering. The encoding determines that, not the collation. UTF-8 allows you to insert various languages in that encoding. > It is not a big deal - I just found it interesting that it did not uppercase the accented letters. Just tested it and the lc_collate seems to make the difference.
On Mon, Jul 26, 2010 at 3:51 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Jul 26, 2010 at 3:47 PM, Benjamin Krajmalnik <kraj@servoyant.com> wrote: >> Unfortunately, the database has to accept data in multiple languages, since it is a SaaS offering. > > The encoding determines that, not the collation. UTF-8 allows you to > insert various languages in that encoding. > >> It is not a big deal - I just found it interesting that it did not uppercase the accented letters. > > Just tested it and the lc_collate seems to make the difference. To be more specific, when my lc_collate is en_US, it works properly. I didn't have to use a spanish collation to make it work. Note that changing collation will change sort order, and some matching rules and things like that. Also, a db is usually noticeably faster working with text in locale of C, because it then treats the data mostly as though it's in byte order.
Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 2010: > I just used the upper(text) function on a database which is utf8 encoded > and which has spanish text. > > All of the regular characters were properly converted, except for > characters which had accents. FWIW it works fine for me: alvherre=# show lc_collate ; lc_collate ------------ es_CL.utf8 (1 fila) alvherre=# select upper('benjamín'); upper ---------- BENJAMÍN (1 fila) I suspect that the problem is an incorrect client_encoding setting.
On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Benjamin Krajmalnik's message of lun jul 26 17:03:54 -0400 2010: >> I just used the upper(text) function on a database which is utf8 encoded >> and which has spanish text. >> >> All of the regular characters were properly converted, except for >> characters which had accents. > > FWIW it works fine for me: > > alvherre=# show lc_collate ; > lc_collate > ------------ > es_CL.utf8 > (1 fila) > > alvherre=# select upper('benjamín'); > upper > ---------- > BENJAMÍN > (1 fila) > > I suspect that the problem is an incorrect client_encoding setting. Yeah, OP had set lc_collate to C under the mistaken impression that collation controlled the character sets you could insert into the database. If you create a db with lc_collate='C' then the upper only works on basic ascii characters near as I can tell.
Excerpts from Scott Marlowe's message of lun jul 26 23:12:08 -0400 2010: > On Mon, Jul 26, 2010 at 8:09 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > I suspect that the problem is an incorrect client_encoding setting. > > Yeah, OP had set lc_collate to C under the mistaken impression that > collation controlled the character sets you could insert into the > database. If you create a db with lc_collate='C' then the upper only > works on basic ascii characters near as I can tell. Makes sense. The code seems to say that it's lc_ctype that's important though, see str_toupper in formatting.c. So I think you could still set collation to C and use a language-specific lc_ctype.
Benjamin,
We're using the contrib module citext for all text columns so that we can do case insensitive searches and so far we haven't found any that it doesn't find.
Best Regards
Mike Gould
"Benjamin Krajmalnik" <kraj@servoyant.com> wrote:
I just used the upper(text) function on a database which is utf8 encoded and which has spanish text.
All of the regular characters were properly converted, except for characters which had accents.