Re: Is there a similarity-function that minds national charsets?
От | Craig Ringer |
---|---|
Тема | Re: Is there a similarity-function that minds national charsets? |
Дата | |
Msg-id | 4FE28C8C.9080300@ringerc.id.au обсуждение исходный текст |
Ответ на | Is there a similarity-function that minds national charsets? (Andreas <maps.on@gmx.net>) |
Список | pgsql-sql |
On 06/21/2012 12:30 AM, Andreas wrote: > Hi, > > Is there a similarity-function that minds national charsets? > > Over here we've got some special cases that screw up the results on > similarity(). > > Our characters: ä, ö, ü, ß > could as well be written as: ae, oe, ue, ss > > e.g. > > select similarity ( 'Müller', 'Mueller' ) > results to: 0.363636 > > In normal cases everything below 0.5 would be to far apart to be > considered a match. That's not just charset aware, that's looking for awareness of language-and-dialect specific transliteration rules for representing accented chars in 7-bit ASCII. My understanding was that these rules and conventions vary and are specific to each language - or even region. tsearch2 has big language dictionaries to try to handle some issues like this (though I don't know about this issue specifically). It's possible you could extend the tsearch2 dictionaries with synonyms, possibly algorithmically generated. If you have what you consider to be an acceptable 1:1 translation rule you could build a functional index on it and test against that, eg: CREATE INDEX blah ON thetable ( (flatten_accent(target_column) ); SELECT similarity( flatten_accent('Müller'), target_column ); Note that the flatten_accent function must be IMMUTABLE and can't access or refer to data in other tables, columns, etc nor SET (GUC) variables that might change at runtime. -- Craig Ringer
В списке pgsql-sql по дате отправления: