Re: Unicode normalization
От | Andreas Kalsch |
---|---|
Тема | Re: Unicode normalization |
Дата | |
Msg-id | 4AB13F3D.20202@aka-fotos.de обсуждение исходный текст |
Ответ на | Re: Unicode normalization (Andreas Kalsch <andreaskalsch@gmx.de>) |
Список | pgsql-general |
Update: The error is of course: The function tries to return "str" instead of unicode. It is not str.decode('UTF-8') which causes the error. Andreas Kalsch schrieb: > No, > > I need a solution which is as generic as possible. I use UTF-8 encoded > unicode strings on all levels. This is what I have done so far: > > > 1) Writing a separate Python command line script for testing - works > as expected: > > #!/usr/bin/python > > import sys > import unicodedata > > str = sys.argv[1].decode('UTF-8') > str = unicodedata.normalize('NFKD', str) > str = ''.join(c for c in str if unicodedata.combining(c) == 0) > print str > > > 2) Transfering this to PL/Python: > > CREATE OR REPLACE FUNCTION test (str text) > RETURNS text > AS $$ > import unicodedata > return unicodedata.normalize('NFKD', str.decode('UTF-8')) > $$ LANGUAGE plpythonu; > > Problem: plpython throws an error, where my commandline script did it > correctly: > > # select test('aÄÖÜ'); > > ERROR: plpython: function "test" could not create return value > DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't > encode character u'\u0308' in position 2: ordinal not in range(128) > > > > I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like > in a normal python environment? > > > In the end it should look like this: > > CREATE TABLE t ( > ... > ts ts_vector NOT NULL > ); > > INSERT INTO t (ts) VALUES(to_tsvector(normalize(?))); > > Andi > > > David Fetter schrieb: >> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote: >> >>> Has somebody integrated Unicode normalization into Postgres? if not, >>> I would have to implement my own function by using this CPAN >>> module: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ . >>> >>> I need a function which removes all diacritics (1) and transforms >>> some characters to a more compatible form (2) to get a better index >>> on strings. >>> >>> Best, >>> >>> Andi >>> >>> >>> 1) à,ä, ... => a >>> 2) ø => o, ƒ => f, ª => a >>> >> >> You mean something like this? >> >> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase >> >> >> Cheers, >> David. >> > >
В списке pgsql-general по дате отправления: