How to simplify unicode strings
От | Andreas Kalsch |
---|---|
Тема | How to simplify unicode strings |
Дата | |
Msg-id | 4AB176CB.9080004@gmx.de обсуждение исходный текст |
Ответ на | Re: Unicode normalization (Sam Mason <sam@samason.me.uk>) |
Список | pgsql-general |
Thank you Sam, this leaded to the correct solution: CREATE OR REPLACE FUNCTION simplify (str text) RETURNS text AS $$ import unicodedata s = unicodedata.normalize('NFKD', str.decode('UTF-8')) s = ''.join(c for c in s if unicodedata.combining(c) == 0) return s.encode('UTF-8') $$ LANGUAGE plpythonu; test=# select simplify('Français va à Paris, () {} [] µ @ º Ångstrøm Phiat-im hû-hō sī phiat tī 1-ê ki-chhó· jī-bó bīn-téng ê hû-hō. Siōng phó·-phiàn ê kong-lêng sī kái-piàn ki-chhó· jī-bó ê hoat-im.'); simplify --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Francais va a Paris, () {} [] μ @ o Angstrøm Phiat-im hu-ho si phiat ti 1-e ki-chho· ji-bo bin-teng e hu-ho. Siong pho·-phian e kong-leng si kai-pian ki-chho· ji-bo e hoat-im. (1 row) One question remains: How is the performance of PL/Python? When there are syntax errors in the Python code, they are not reported on CREATE, because the function seems be recompiled on every call. This leads to the next question: When will the unicode stuff included in the main distribution? Andi Sam Mason schrieb: > On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote: > >> CREATE OR REPLACE FUNCTION test (str text) >> RETURNS text >> AS $$ >> import unicodedata >> return unicodedata.normalize('NFKD', str.decode('UTF-8')) >> $$ LANGUAGE plpythonu; >> > > I'd guess you want that to be: > > return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8'); > > If you're converting from a utf8 encoding, you probably need to go > back again! This could certainly be made easier though, PG knows what > encoding its strings are stored in, why doesn't it work with unicode > strings by default? > >
В списке pgsql-general по дате отправления: