Re: Searching for "bare" letters
От | Eduardo Morras |
---|---|
Тема | Re: Searching for "bare" letters |
Дата | |
Msg-id | 4E5EF2B1008AF013@> (added by postmaster@resmaa14.ono.com) обсуждение исходный текст |
Ответ на | Searching for "bare" letters ("Reuven M. Lerner" <reuven@lerner.co.il>) |
Список | pgsql-general |
At 01:25 02/10/2011, Reuven M. Lerner wrote: >Hi, everyone. I'm working on a project on >PostgreSQL 9.0 (soon to be upgraded to 9.1, >given that we haven't yet launched). The >project will involve numerous text fields >containing English, Spanish, and >Portuguese. Some of those text fields will be >searchable by the user. That's easy enough to >do; for our purposes, I was planning to use some >combination of LIKE searches; the database is >small enough that this doesn't take very much >time, and we don't expect the number of >searchable records (or columns within those records) to be all that large. > >The thing is, the people running the site want >searches to work on what I'm calling (for lack >of a better term) "bare" letters. That is, if >the user searches for "n", then the search >should also match Spanish words containing >"ñ". I'm told by Spanish-speaking members of >the team that this is how they would expect >searches to work. However, when I just did a >quick test using a UTF-8 encoded 9.0 database, I >found that PostgreSQL didn't see the two >characters as identical. (I must say, this is >the behavior that I would have expected, had the >Spanish-speaking team member not said anything on the subject.) > >So my question is whether I can somehow wrangle >PostgreSQL into thinking that "n" and "ñ" are >the same character for search purposes, or if I >need to do something else -- use regexps, keep a >"naked," searchable version of each column >alongside the native one, or something else entirely -- to get this to work. > >Any ideas? You can use perceptual hashing for that. There are multiple algorithms, some of them can be tuned for specific languages. See this documentation: http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description, http://en.wikipedia.org/wiki/Soundex is the first one developed, very old, http://en.wikipedia.org/wiki/Metaphone is a family of several modern algorithms. Remember that they are hashing algorithms, some words can collide because they have the same pronunciation but write different. I remember that datapark search engine uses them with dictionaries. You can check it too. http://www.dataparksearch.org/ >Thanks, > >Reuven HTH
В списке pgsql-general по дате отправления: