Re: Unicode normalization

Поиск

Список

Период

Сортировка

От	Andreas Kalsch
Тема	Re: Unicode normalization
Дата	16 сентября 2009 г. 17:37:04
Msg-id	4AB13F3D.20202@aka-fotos.de обсуждение исходный текст
Ответ на	Re: Unicode normalization (Andreas Kalsch <andreaskalsch@gmx.de>)
Список	pgsql-general

Дерево обсуждения

Update: The error is of course: The function tries to return "str"
instead of unicode. It is not str.decode('UTF-8') which causes the error.

Andreas Kalsch schrieb:
> No,
>
> I need a solution which is as generic as possible. I use UTF-8 encoded
> unicode strings on all levels. This is what I have done so far:
>
>
> 1) Writing a separate Python command line script for testing - works
> as expected:
>
> #!/usr/bin/python
>
> import sys
> import unicodedata
>
> str = sys.argv[1].decode('UTF-8')
> str = unicodedata.normalize('NFKD', str)
> str = ''.join(c for c in str if unicodedata.combining(c) == 0)
> print str
>
>
> 2) Transfering this to PL/Python:
>
> CREATE OR REPLACE FUNCTION test (str text)
>  RETURNS text
> AS $$
>    import unicodedata
>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
> $$ LANGUAGE plpythonu;
>
> Problem: plpython throws an error, where my commandline script did it
> correctly:
>
> # select test('aÄÖÜ');
>
> ERROR:  plpython: function "test" could not create return value
> DETAIL:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
> encode character u'\u0308' in position 2: ordinal not in range(128)
>
>
>
> I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like
> in a normal python environment?
>
>
> In the end it should look like this:
>
> CREATE TABLE t (
> ...
> ts ts_vector NOT NULL
> );
>
> INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));
>
> Andi
>
>
> David Fetter schrieb:
>> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>>
>>> Has somebody integrated Unicode normalization into Postgres? if not,
>>> I  would have to implement my own function by using this CPAN
>>> module:  http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>>
>>> I need a function which removes all diacritics (1) and transforms
>>> some  characters to a more compatible form (2) to get a better index
>>> on strings.
>>>
>>> Best,
>>>
>>> Andi
>>>
>>>
>>> 1) à,ä, ... => a
>>> 2) ø => o, ƒ => f, ª => a
>>>
>>
>> You mean something like this?
>>
>> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>>
>>
>> Cheers,
>> David.
>>
>
>

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Unicode normalization