Latin vs non-Latin words in text search parsing
От | Tom Lane |
---|---|
Тема | Latin vs non-Latin words in text search parsing |
Дата | |
Msg-id | 29209.1192999663@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Latin vs non-Latin words in text search parsing
Re: Latin vs non-Latin words in text search parsing |
Список | pgsql-hackers |
If I am reading the state machine in wparser_def.c correctly, the three classifications of words that the default parser knows are lword Composed entirely of ASCII letters nlword Composed entirely of non-ASCII letters (where "letter" is defined by iswalpha()) word Entirely alphanumeric (per iswalnum()), but not above cases This classification is probably sane enough for dealing with mixed Russian/English text --- IIUC, Russian words will come entirely from the Cyrillic alphabet which has no overlap with ASCII letters. But I'm thinking it'll be quite inconvenient for other European languages whose alphabets include the base ASCII letters plus other stuff such as accented letters. They will have a lot of words that fall into the catchall "word" category, which will mean they have to index mixed alpha-and-number words in order to catch all native words. ISTM that perhaps a more generally useful definition would be lword Only ASCII letters nlword Entirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not nlword (hence, includes at least one digit) However, I am no linguist and maybe I'm missing something. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления: