Обсуждение: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

Поиск
Список
Период
Сортировка

Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Tom Lane
Дата:
I got tired of reading complaints about how upper/lower don't work with
Unicode, so I went and prototyped a solution.  The attached code uses
the C99-standard functions mbstowcs and wcstombs to convert to and from
a "wchar_t[]" representation that can be fed to the also-C99 functions
towupper, towlower, etc.

This code will only work if the database is running under an LC_CTYPE
setting that implies the same encoding specified by server_encoding.
However, I don't see that as a fatal objection, because in point of fact
the existing upper/lower code assumes the same thing.  When they don't
match, this code may deliver an "invalid multibyte character" error
rather than silently producing a wrong answer, but is that really a step
backward?

Note this patch is *not* meant for application to CVS yet.  It's not
autoconfiscated.  But if you have a platform that has mbstowcs and
friends, please try it and let me know about any portability gotchas
you see.

Also, as a character-set-impaired American, I'm probably not the best
qualified person to judge whether the patch actually does what's wanted.
It seemed to do the right sorts of conversions in my limited testing,
but does it do what *you* want it to do?

            regards, tom lane

PS: the patch works against either 7.4 or CVS tip.

*** src/backend/utils/adt/oracle_compat.c.orig    Sat Feb 28 12:53:23 2004
--- src/backend/utils/adt/oracle_compat.c    Wed May 12 21:19:33 2004
***************
*** 15,21 ****
   */
  #include "postgres.h"

! #include <ctype.h>

  #include "utils/builtins.h"
  #include "mb/pg_wchar.h"
--- 15,22 ----
   */
  #include "postgres.h"

! #include <wchar.h>
! #include <wctype.h>

  #include "utils/builtins.h"
  #include "mb/pg_wchar.h"
***************
*** 26,31 ****
--- 27,124 ----
         bool doltrim, bool dortrim);


+ /*
+  * Convert a TEXT value into a palloc'd wchar string.
+  */
+ static wchar_t *
+ texttowcs(const text *txt)
+ {
+     int            nbytes = VARSIZE(txt) - VARHDRSZ;
+     char       *workstr;
+     wchar_t       *result;
+     size_t        ncodes;
+
+     /* Overflow paranoia */
+     if (nbytes < 0 ||
+         nbytes > (int) (INT_MAX / sizeof(wchar_t)) - 1)
+         ereport(ERROR,
+                 (errcode(ERRCODE_OUT_OF_MEMORY),
+                  errmsg("out of memory")));
+
+     /* Need a null-terminated version of the input */
+     workstr = (char *) palloc(nbytes + 1);
+     memcpy(workstr, VARDATA(txt), nbytes);
+     workstr[nbytes] = '\0';
+
+     /* Output workspace cannot have more codes than input bytes */
+     result = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
+
+     /* Do the conversion */
+     ncodes = mbstowcs(result, workstr, nbytes + 1);
+
+     if (ncodes == (size_t) -1)
+     {
+         /*
+          * Invalid multibyte character encountered.  We try to give a useful
+          * error message by letting pg_verifymbstr check the string.  But
+          * it's possible that the string is OK to us, and not OK to mbstowcs
+          * --- this suggests that the LC_CTYPE locale is different from the
+          * database encoding.  Give a generic error message if verifymbstr
+          * can't find anything wrong.
+          */
+         pg_verifymbstr(workstr, nbytes, false);
+         ereport(ERROR,
+                 (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+                  errmsg("invalid multibyte character for locale")));
+     }
+
+     Assert(ncodes <= (size_t) nbytes);
+
+     return result;
+ }
+
+
+ /*
+  * Convert a wchar string into a palloc'd TEXT value.  The wchar string
+  * must be zero-terminated, but we also require the caller to pass the string
+  * length, since it will know it anyway in current uses.
+  */
+ static text *
+ wcstotext(const wchar_t *str, int ncodes)
+ {
+     text       *result;
+     size_t        nbytes;
+
+     /* Overflow paranoia */
+     if (ncodes < 0 ||
+         ncodes > (int) ((INT_MAX - VARHDRSZ) / MB_CUR_MAX) - 1)
+         ereport(ERROR,
+                 (errcode(ERRCODE_OUT_OF_MEMORY),
+                  errmsg("out of memory")));
+
+     /* Make workspace certainly large enough for result */
+     result = (text *) palloc((ncodes + 1) * MB_CUR_MAX + VARHDRSZ);
+
+     /* Do the conversion */
+     nbytes = wcstombs((char *) VARDATA(result), str,
+                       (ncodes + 1) * MB_CUR_MAX);
+
+     if (nbytes == (size_t) -1)
+     {
+         /* Invalid multibyte character encountered ... shouldn't happen */
+         ereport(ERROR,
+                 (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+                  errmsg("invalid multibyte character for locale")));
+     }
+
+     Assert(nbytes <= (size_t) (ncodes * MB_CUR_MAX));
+
+     VARATT_SIZEP(result) = nbytes + VARHDRSZ;
+
+     return result;
+ }
+
+
  /********************************************************************
   *
   * lower
***************
*** 43,63 ****
  Datum
  lower(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P_COPY(0);
!     char       *ptr;
!     int            m;
!
!     /* Since we copied the string, we can scribble directly on the value */
!     ptr = VARDATA(string);
!     m = VARSIZE(string) - VARHDRSZ;

!     while (m-- > 0)
!     {
!         *ptr = tolower((unsigned char) *ptr);
!         ptr++;
!     }

!     PG_RETURN_TEXT_P(string);
  }


--- 136,156 ----
  Datum
  lower(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P(0);
!     text       *result;
!     wchar_t       *workspace;
!     int            i;

!     workspace = texttowcs(string);
!
!     for (i = 0; workspace[i] != 0; i++)
!         workspace[i] = towlower(workspace[i]);
!
!     result = wcstotext(workspace, i);
!
!     pfree(workspace);

!     PG_RETURN_TEXT_P(result);
  }


***************
*** 78,98 ****
  Datum
  upper(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P_COPY(0);
!     char       *ptr;
!     int            m;
!
!     /* Since we copied the string, we can scribble directly on the value */
!     ptr = VARDATA(string);
!     m = VARSIZE(string) - VARHDRSZ;

!     while (m-- > 0)
!     {
!         *ptr = toupper((unsigned char) *ptr);
!         ptr++;
!     }

!     PG_RETURN_TEXT_P(string);
  }


--- 171,191 ----
  Datum
  upper(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P(0);
!     text       *result;
!     wchar_t       *workspace;
!     int            i;

!     workspace = texttowcs(string);
!
!     for (i = 0; workspace[i] != 0; i++)
!         workspace[i] = towupper(workspace[i]);
!
!     result = wcstotext(workspace, i);
!
!     pfree(workspace);

!     PG_RETURN_TEXT_P(result);
  }


***************
*** 116,147 ****
  Datum
  initcap(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P_COPY(0);
!     char       *ptr;
!     int            m;
!
!     /* Since we copied the string, we can scribble directly on the value */
!     ptr = VARDATA(string);
!     m = VARSIZE(string) - VARHDRSZ;

!     if (m > 0)
!     {
!         *ptr = toupper((unsigned char) *ptr);
!         ptr++;
!         m--;
!     }

!     while (m-- > 0)
      {
!         /* Oracle capitalizes after all non-alphanumeric */
!         if (!isalnum((unsigned char) ptr[-1]))
!             *ptr = toupper((unsigned char) *ptr);
          else
!             *ptr = tolower((unsigned char) *ptr);
!         ptr++;
      }

!     PG_RETURN_TEXT_P(string);
  }


--- 209,236 ----
  Datum
  initcap(PG_FUNCTION_ARGS)
  {
!     text       *string = PG_GETARG_TEXT_P(0);
!     text       *result;
!     wchar_t       *workspace;
!     int            wasalnum = 0;
!     int            i;

!     workspace = texttowcs(string);

!     for (i = 0; workspace[i] != 0; i++)
      {
!         if (wasalnum)
!             workspace[i] = towlower(workspace[i]);
          else
!             workspace[i] = towupper(workspace[i]);
!         wasalnum = iswalnum(workspace[i]);
      }

!     result = wcstotext(workspace, i);
!
!     pfree(workspace);
!
!     PG_RETURN_TEXT_P(result);
  }



Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Jean-Michel POURE
Дата:
Le jeudi 13 Mai 2004 04:42, Tom Lane a écrit :
> I got tired of reading complaints about how upper/lower don't work with
> Unicode, so I went and prototyped a solution.  The attached code uses
> the C99-standard functions mbstowcs and wcstombs to convert to and from
> a "wchar_t[]" representation that can be fed to the also-C99 functions
> towupper, towlower, etc.

These are really good news, thanks.
Jean-Michel Pouré


Re: Rough draft for Unicode-aware

От
Markus Bertheau
Дата:
В Чтв, 13.05.2004, в 04:42, Tom Lane пишет:

> But if you have a platform that has mbstowcs and
> friends, please try it and let me know about any portability gotchas
> you see.

I can't test it because with a clean 7.4.2 with the patch applied I get
[bert@yarrow postgresql-7.4.2]$ LANG=C make install
make -C doc install
make[1]: Entering directory `/home/bert/src/postgresql-7.4.2/doc'
mkdir /home/bertheau/pg742
mkdir /home/bertheau/pg742/doc
mkdir /home/bertheau/pg742/doc/postgresql
mkdir /home/bertheau/pg742/doc/postgresql/html
make[1]: *** [installdirs] Error 1
make[1]: Leaving directory `/home/bert/src/postgresql-7.4.2/doc'
make: *** [install] Error 2
[bert@yarrow postgresql-7.4.2]$

make and make check worked ok.

--
Markus Bertheau <twanger@bluetwanger.de>



Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Marko Karppinen
Дата:
Tom Lane wrote:
> This code will only work if the database is running under an LC_CTYPE
> setting that implies the same encoding specified by server_encoding.
> However, I don't see that as a fatal objection, because in point of 
> fact
> the existing upper/lower code assumes the same thing.

I think this interaction between the locale and server_encoding is
confusing. Is there any use case for running an incompatible mix?
If not, would it not make sense to fetch initdb's default database
encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

initdb could even emit a warning if the --encoding option was
used without also specifying --no-locale.

Using nl_langinfo(CODESET) was discussed and quietly dismissed a
year ago (although the topic was the client encoding back then).
But I think that the idea is worth revisiting because it would
allow UPPER() and LOWER() to work correctly with international
alphabets -- out of the box and without configuration -- on a
wide variety of modern systems.

mk



Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Peter Eisentraut
Дата:
Marko Karppinen wrote:
> I think this interaction between the locale and server_encoding is
> confusing. Is there any use case for running an incompatible mix?
> If not, would it not make sense to fetch initdb's default database
> encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

This would be fine and dandy if we had any sort of idea about what sort 
of strings nl_langinfo(CODESET) returns and how to map them to our 
encoding names.



Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Marko Karppinen
Дата:
> Marko Karppinen wrote:
>> I think this interaction between the locale and server_encoding is
>> confusing. Is there any use case for running an incompatible mix?
>> If not, would it not make sense to fetch initdb's default database
>> encoding with nl_langinfo(CODESET) instead of using SQL_ASCII?

Peter Eisentraut wrote:
> This would be fine and dandy if we had any sort of idea about what sort
> of strings nl_langinfo(CODESET) returns and how to map them to our
> encoding names.

Karel Zak posted an answer to this last year, here on pgsql-hackers:
http://archives.postgresql.org/pgsql-hackers/2003-05/msg00744.php
It's not complete, but it's sort of an idea.

The code is under LGPL, but copyright doesn't reach down to the
actual information about the encoding strings used by various
operating systems, so it's possible to reappropriate. I'd imagine
that it covers many, if not most, of the likely cases.

The current situation of upper/lower/collating/etc just being
broken by default on many non-C locales is bad enough to warrant
bailing out during initdb when this situation is detected
(with a reasonably cautious heuristic).

It used to be that you got what you deserved if you were stupid
enough to define a non-C, non-ASCII-based locale. You had only
yourself to blame for everything breaking. These days, however,
millions of systems get shipped and installed with UTF-8 locales
on by default, so it's not possible to portray this as an user error.

Requiring every one of these people to configure initdb's encoding
manually would be harsh, however, so I think that an heuristic
that'd work with most modern systems would strike an appropriate
balance of correctness and path-of-least-surprise.

mk



Re: Rough draft for Unicode-aware

От
Tatsuo Ishii
Дата:
> initdb could even emit a warning if the --encoding option was
> used without also specifying --no-locale.

Please don't do that. Most Asian chasets does not work with locale
enabled PostgreSQL installation. i.e. it returns WRONG SELECT
results. I've been telling this to Japanese users for hundreds of
times when they ask me why SELECT returns results. If that kind of
wanings are installed, I think more Japanese users will be confused.
--
Tatsuo Ishii


Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Marko Karppinen
Дата:
Tatsuo Ishii wrote:
>> initdb could even emit a warning if the --encoding option was
>> used without also specifying --no-locale.
>
> Please don't do that. Most Asian chasets does not work with locale
> enabled PostgreSQL installation. i.e. it returns WRONG SELECT
> results. I've been telling this to Japanese users for hundreds of
> times when they ask me why SELECT returns results. If that kind of
> wanings are installed, I think more Japanese users will be confused.

You've advocated a default of --no-locale yourself for this reason.
If using a Japanese --encoding setting without --no-locale emitted
a warning suggesting the use of --no-locale, I'd imagine you wouldn't
have had to give human support to most of those hundreds of people?

Wouldn't that be a halfway point to your goal?

mk



Re: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

От
Tom Lane
Дата:
Marko Karppinen <marko@karppinen.fi> writes:
> I think this interaction between the locale and server_encoding is
> confusing. Is there any use case for running an incompatible mix?

In hindsight we should probably not have invented per-database encoding
selection, since it's so fragile to use in combination with cluster-wide
locale settings.  However I believe that a lot of people in the Far East
are using multiple database encodings successfully, since they don't
much care about upper()/lower() etc ...

The long-term answer is to write our own locale support so we can
eliminate the cluster-wide-locale restriction.  In the meantime I don't
want to remove flexibility that is useful to some people.
        regards, tom lane