Обсуждение: Latin vs non-Latin words in text search parsing
If I am reading the state machine in wparser_def.c correctly, the three classifications of words that the default parser knows are lword Composed entirely of ASCII letters nlword Composed entirely of non-ASCII letters (where "letter" is defined by iswalpha()) word Entirely alphanumeric (per iswalnum()), but not above cases This classification is probably sane enough for dealing with mixed Russian/English text --- IIUC, Russian words will come entirely from the Cyrillic alphabet which has no overlap with ASCII letters. But I'm thinking it'll be quite inconvenient for other European languages whose alphabets include the base ASCII letters plus other stuff such as accented letters. They will have a lot of words that fall into the catchall "word" category, which will mean they have to index mixed alpha-and-number words in order to catch all native words. ISTM that perhaps a more generally useful definition would be lword Only ASCII letters nlword Entirely letters per iswalpha(), but not lword word Entirely alphanumeric per iswalnum(), but not nlword (hence, includes at least one digit) However, I am no linguist and maybe I'm missing something. Comments? regards, tom lane
Tom Lane wrote: > ISTM that perhaps a more generally useful definition would be > > lword Only ASCII letters > nlword Entirely letters per iswalpha(), but not lword > word Entirely alphanumeric per iswalnum(), but not nlword > (hence, includes at least one digit) > > However, I am no linguist and maybe I'm missing something. I tend to agree with the need to redefine the categories. I am not sure I agree with this particular definition though. I would think that a "latin word" should include ASCII letters and accented letters, and a non-latin word would be one that included only non-ASCII chars. alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-----------+----------------+--------------------------word | Word | añadido | {spanish_stem}| spanish_stem: {añad}blank | Space symbols | | {} | word | Word | añadió | {spanish_stem} | spanish_stem: {añad}blank | Space symbols | | {} | word | Word | añadidura| {spanish_stem} | spanish_stem: {añadidur} (5 lignes) I would think those would all fit in the "latin word" category. This example is more interesting because it shows a word categorized differently just because the plural loses the accent: alvherre=# select * from ts_debug('spanish', 'caracteres carácter');Alias | Description | Token | Dictionaries | Lexized token -------+---------------+------------+----------------+--------------------------lword | Latin word | caracteres | {spanish_stem}| spanish_stem: {caracter}blank | Space symbols | | {} | word | Word | carácter | {spanish_stem} | spanish_stem: {caract} (3 lignes) I am not sure if there are any western european languages were words can only be formed with non-ascii chars. At least in spanish accents tend to be rare. However, I would think this is also wrong: alvherre=# select * from ts_debug('french', 'à');Alias | Description | Token | Dictionaries | Lexized token --------+----------------+-------+---------------+-----------------nlword | Non-latin word | à | {french_stem} | french_stem:{} (1 ligne) I don't think this is much of a problem, this particular word being (most likely) a stopword. So, how about lword Entirely letters per iswalpha, with at least one ASCII nlword Entirely letters per iswalpha word Entirely alphanumeric per iswalnum, but not nlword -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> ISTM that perhaps a more generally useful definition would be >> >> lword Only ASCII letters >> nlword Entirely letters per iswalpha(), but not lword >> word Entirely alphanumeric per iswalnum(), but not nlword > ... how about > lword Entirely letters per iswalpha, with at least one ASCII > nlword Entirely letters per iswalpha > word Entirely alphanumeric per iswalnum, but not nlword Hmm. Then we have no category for "entirely ASCII", which is an interesting category at least from the English standpoint, and I think also in a lot of computer-oriented contexts. I think you may be putting too much emphasis on the "Latin" aspect of the category name, which I find to be a bit historical. I'm not sure if it's too late to consider renaming the categories; if we were willing to do that I'd propose categories "aword", "naword", "word", defined as above. Another thing that bothers me about your suggestion is that (at least in some locales) iswalpha will return true for things that are neither ASCII letters nor accented versions of them, eg Cyrillic letters. So I'm not sure the surprise factor is any less with your approach than mine: you could still get "lword" for something decidedly not Latin-derived. regards, tom lane
Alvaro Herrera wrote: > Tom Lane wrote: > >> ISTM that perhaps a more generally useful definition would be >> >> lword Only ASCII letters >> nlword Entirely letters per iswalpha(), but not lword >> word Entirely alphanumeric per iswalnum(), but not nlword >> (hence, includes at least one digit) > ... > I am not sure if there are any western european languages were words can > only be formed with non-ascii chars. There is at least in Swedish: "ö" (island) and å (river). They're both a bit special because they're just one letter each. > lword Entirely letters per iswalpha, with at least one ASCII > nlword Entirely letters per iswalpha > word Entirely alphanumeric per iswalnum, but not nlword I don't like this categorization much more than the original. The distinction between lword and nlword is useless for most European languages. I suppose that Tom's argument that it's useful to distinguish words made of purely ASCII characters in computer-oriented stuff is valid, though I can't immediately think of a use case. For things like parsing a programming language, that's not really enough, so you'd probably end up writing your own parser anyway. I'm also not clear what the use case for the distinction between words with digits or not is. I don't think there's any natural languages where a word can contain digits, so it must be a computer-oriented thing as well. I like the "aword" name more than "lword", BTW. If we change the meaning of the classes, surely we can change the name as well, right? Note that the default parser is useless for languages like Japanese, where words are not separated by whitespace, anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Alvaro Herrera wrote: > > Tom Lane wrote: > > > >> ISTM that perhaps a more generally useful definition would be > >> > >> lword Only ASCII letters > >> nlword Entirely letters per iswalpha(), but not lword > >> word Entirely alphanumeric per iswalnum(), but not nlword > >> (hence, includes at least one digit) > > ... > > I am not sure if there are any western european languages were words can > > only be formed with non-ascii chars. > > There is at least in Swedish: "ö" (island) and å (river). They're both a > bit special because they're just one letter each. > > > lword Entirely letters per iswalpha, with at least one ASCII > > nlword Entirely letters per iswalpha > > word Entirely alphanumeric per iswalnum, but not nlword > > I don't like this categorization much more than the original. The > distinction between lword and nlword is useless for most European > languages. > > I suppose that Tom's argument that it's useful to distinguish words made > of purely ASCII characters in computer-oriented stuff is valid, though I > can't immediately think of a use case. For things like parsing a > programming language, that's not really enough, so you'd probably end up > writing your own parser anyway. I'm also not clear what the use case for > the distinction between words with digits or not is. I don't think > there's any natural languages where a word can contain digits, so it > must be a computer-oriented thing as well. > > I like the "aword" name more than "lword", BTW. If we change the meaning > of the classes, surely we can change the name as well, right? > > Note that the default parser is useless for languages like Japanese, > where words are not separated by whitespace, anyway. Above is true but that does not neccessary mean that Tsearch is not used for Japanese at all. I overcome the problem above by doing a pre-process step which separate Japanese sentences to words devided by white space. I wish I could write a new parser which could do the job for 8.4 or later... Please change the word definition very carefully. -- Tatsuo Ishii SRA OSS, Inc. Japan
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Alvaro Herrera wrote: >> Tom Lane wrote: >> >>> ISTM that perhaps a more generally useful definition would be >>> >>> lword Only ASCII letters >>> nlword Entirely letters per iswalpha(), but not lword >>> word Entirely alphanumeric per iswalnum(), but not nlword >>> (hence, includes at least one digit) >> ... >> I am not sure if there are any western european languages were words can >> only be formed with non-ascii chars. > > There is at least in Swedish: "ö" (island) and å (river). They're both a > bit special because they're just one letter each. For what it's worth I did the same search last night and found three French words including "çà" -- which admittedly is likely to be a noise word. Other dictionaries such as Italian and Irish also have one-letter words like this. The only other with multi-letter words is actually Faroese with "íð" and "óð". > I like the "aword" name more than "lword", BTW. If we change the meaning > of the classes, surely we can change the name as well, right? I'm not very familiar with the use case here. Is there a good reason to want to abbreviate these names? I think I would expect "ascii", "word", and "token" for the three categories Tom describes. > Note that the default parser is useless for languages like Japanese, > where words are not separated by whitespace, anyway. I also wonder about languages like Arabic and Hindi which do have words but I'm not sure if they use white space as simply as in latin languages. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Alvaro Herrera wrote: >> lword Entirely letters per iswalpha, with at least one ASCII >> nlword Entirely letters per iswalpha >> word Entirely alphanumeric per iswalnum, but not nlword > I don't like this categorization much more than the original. The > distinction between lword and nlword is useless for most European > languages. Right. That's not an objection in itself, since you can just add the same dictionary mappings to both token types, but the question is when would such a distinction actually be useful? AFAICS the only case where it'd make sense to put different mappings on lword and nlword with the above definitions is when dealing with Russian or similar languages, where the entire alphabet is non-ASCII. However, my proposal (pure ASCII vs not pure ASCII) seems to work just as well for that case as this proposal does. > ... I'm also not clear what the use case for > the distinction between words with digits or not is. I don't think > there's any natural languages where a word can contain digits, so it > must be a computer-oriented thing as well. Well, that's exactly why we *should* distinguish words-with-digits; it's unlikely that any standard dictionary will do sane things with them, so if you want to index them they need to go down a different dictionary chain. A more drastic change would be to not treat a string like "beta1" as a single token at all, so that the alphanumeric-word category would go away entirely. However I'm disinclined to tinker with the parser that much. It's seen enough use in the contrib module that I'm prepared to grant that the design is generally useful. I'm just worried that the subcategories of "word" need a bit of adjustment for languages other than Russian and English. regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> I like the "aword" name more than "lword", BTW. If we change the meaning >> of the classes, surely we can change the name as well, right? > I'm not very familiar with the use case here. Is there a good reason to want > to abbreviate these names? I think I would expect "ascii", "word", and "token" > for the three categories Tom describes. Please look at the first nine rows of the table here: http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html It's not clear to me where we'd go with the names for the hyphenated-word and hyphenated-word-part categories. Also, ISTM thatwe should use related names for these three categories,since they are all considered valid parts of hyphenated words. Another point: "token" is probably unreasonably confusing as a name for a token type. "Is that a token token or a word token?" Maybe "aword", "word", and "numword"? regards, tom lane
I wrote: > Maybe "aword", "word", and "numword"? Does the lack of response mean people are satisfied with that? Fleshing the proposal out to include the hyphenated-word categories: aword All ASCII letters word All letters according to iswalpha() numword Mixed letters and digits (all iswalnum()) ahword Hyphenated word, all ASCII letters hword Hyphenated word, all letters numhword Hyphenated word, mixed letters and digits apart_hword Part of hyphenated word, all ASCII letters part_hword Part of hyphenated word, all letters numpart_hword Part of hyphenated word, mixed letters and digits (As an example, "foo-beta1" is a numhword, with component tokens "foo" an aword and "beta1" a numword. This is how it works now modulo the redefinition of the base categories.) I'm not totally thrilled with these short names for the hyphenation categories, but they will seem at least somewhat familiar to users of contrib/tsearch2, and it's probably not worth changing them just to make them look prettier. regards, tom lane
I wrote: > (As an example, "foo-beta1" is a numhword, with component tokens > "foo" an aword and "beta1" a numword. This is how it works now > modulo the redefinition of the base categories.) Argh... need more caffeine. Obviously the component tokens would be apart_hword and numpart_hword. They'd be the others only if they were *not* part of a hyphenated word. regards, tom lane
On Oct 23, 2007, at 10:42 , Tom Lane wrote: > apart_hword Part of hyphenated word, all ASCII letters > part_hword Part of hyphenated word, all letters > numpart_hword Part of hyphenated word, mixed letters and digits Is there a rationale for using these instead of hword_apart, hword_part and hword_numpart? I find the latter to be more readable as variable names. Or was your thought to be able to identify the content from the first part of the variable name? Michael Glaesemann grzm seespotcode net
Michael Glaesemann <grzm@seespotcode.net> writes: > On Oct 23, 2007, at 10:42 , Tom Lane wrote: >> apart_hword Part of hyphenated word, all ASCII letters >> part_hword Part of hyphenated word, all letters >> numpart_hword Part of hyphenated word, mixed letters and digits > Is there a rationale for using these instead of hword_apart, > hword_part and hword_numpart? Only that the category names were constructed that way in the contrib module, and so this would seem familiar to existing tsearch2 users. However, we are changing enough other details of the tsearch configuration that maybe that's not a very strong consideration. I have no objection in principle to choosing nicer names, except that I would like to avoid a long-drawn-out discussion. Is there general approval of Michael's suggestion? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > I wrote: >> Maybe "aword", "word", and "numword"? > > Does the lack of response mean people are satisfied with that? Sorry, I had a couple responses partially written but never finished. If we were doing it from scratch I would suggest using longer names. At the least I would still suggest using "ascii" or "asciiword" instead of "aword". > Fleshing the proposal out to include the hyphenated-word categories: > > aword All ASCII letters > word All letters according to iswalpha() > numword Mixed letters and digits (all iswalnum()) This does bring up another idea. Using the ctype names. They could be named asciiword, alphaword, alnumword. Frankly I don't think this is any nicer than numword anyways. > I'm not totally thrilled with these short names for the hyphenation > categories, but they will seem at least somewhat familiar to users > of contrib/tsearch2, and it's probably not worth changing them just > to make them look prettier. I tried thinking of better words for this and couldn't think of any. The only other word for a hyphenated word I could think of is probably "compound" and the word for parts of a compound word is "lexeme", but that's certainly not going to be clearer (and technically it's not quite right anyway). So in short I would still suggest using "ascii" instead of just "a" but otherwise I think your suggestion is best. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Michael Glaesemann <grzm@seespotcode.net> writes: > > On Oct 23, 2007, at 10:42 , Tom Lane wrote: > >> apart_hword Part of hyphenated word, all ASCII letters > >> part_hword Part of hyphenated word, all letters > >> numpart_hword Part of hyphenated word, mixed letters and digits > > > Is there a rationale for using these instead of hword_apart, > > hword_part and hword_numpart? > > Only that the category names were constructed that way in the contrib > module, and so this would seem familiar to existing tsearch2 users. > However, we are changing enough other details of the tsearch > configuration that maybe that's not a very strong consideration. > > I have no objection in principle to choosing nicer names, except > that I would like to avoid a long-drawn-out discussion. Is there > general approval of Michael's suggestion? +1 -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > I wrote: > >> Maybe "aword", "word", and "numword"? > > > > Does the lack of response mean people are satisfied with that? > > Sorry, I had a couple responses partially written but never finished. > > If we were doing it from scratch I would suggest using longer names. At the > least I would still suggest using "ascii" or "asciiword" instead of "aword". +1 for asciiword; "aword" sounds too much like "a word" which is not the meaning I think we're trying to convey. It is a bit longer, but there are longer names already so I don't think it's a problem. (It's not like it's something anyone needs to type often). -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "En el principio del tiempo era el desencanto. Y era la desolación. Y era grande el escándalo, y el destello de monitores y el crujir de teclas." ("Sean los Pájaros Pulentios",Daniel Correa)
Alvaro Herrera <alvherre@commandprompt.com> writes: > Gregory Stark wrote: >> If we were doing it from scratch I would suggest using longer names. At the >> least I would still suggest using "ascii" or "asciiword" instead of "aword". > +1 for asciiword; "aword" sounds too much like "a word" which is not the > meaning I think we're trying to convey. OK, so with that and Michael's suggestion we have asciiwordwordnumword asciihwordhwordnumhword hword_asciiparthword_parthword_numpart Sold? regards, tom lane
Tom Lane wrote: > OK, so with that and Michael's suggestion we have > > asciiword > word > numword > > asciihword > hword > numhword > > hword_asciipart > hword_part > hword_numpart > > Sold? Sold here. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > hword_asciipart > hword_part > hword_numpart Out of curiosity would the foo in foo-bär or the foo-beta1 be a hword_asciipart or a hword_part/hword_numpart? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Out of curiosity would the foo in foo-b�r or the foo-beta1 be a > hword_asciipart or a hword_part/hword_numpart? foo would be hword_asciipart independently of what was in the other parts of the hword. AFAICS this is what you want for the purpose, which is to know which dictionary stack to push the token through. regards, tom lane
On Oct 23, 2007, at 12:09 , Alvaro Herrera wrote: > Tom Lane wrote: > >> OK, so with that and Michael's suggestion we have >> >> asciiword >> word >> numword >> >> asciihword >> hword >> numhword >> >> hword_asciipart >> hword_part >> hword_numpart >> >> Sold? > > Sold here. No huge preference, but I see benefit in what Gregory was saying re: asciiword, alphaword, alnumword. word itself is pretty general, while alphaword ties it much closer to its intended meaning. They've got pretty consistent lengths as well. Maybe it leans too Hungarian. I'll take your answer off the air :) Michael Glaesemann grzm seespotcode net
Michael Glaesemann <grzm@seespotcode.net> writes: >> Tom Lane wrote: >>> asciiword >>> word >>> numword > No huge preference, but I see benefit in what Gregory was saying re: > asciiword, alphaword, alnumword. word itself is pretty general, while > alphaword ties it much closer to its intended meaning. They've got > pretty consistent lengths as well. Maybe it leans too Hungarian. I stuck with the previous proposal, mainly because I was already pretty well into making the edits by the time I saw your message. But I think that with this definition "word" matches pretty well with everyone's understanding of that, and the other two are supersets and subsets that might have specific uses. regards, tom lane
Just for clarification. Are you going to make these changes in the 8.3 beta test period? -- Tatsuo Ishii SRA OSS, Inc. Japan > If I am reading the state machine in wparser_def.c correctly, the > three classifications of words that the default parser knows are > > lword Composed entirely of ASCII letters > nlword Composed entirely of non-ASCII letters > (where "letter" is defined by iswalpha()) > word Entirely alphanumeric (per iswalnum()), but not above > cases > > This classification is probably sane enough for dealing with mixed > Russian/English text --- IIUC, Russian words will come entirely from > the Cyrillic alphabet which has no overlap with ASCII letters. But > I'm thinking it'll be quite inconvenient for other European languages > whose alphabets include the base ASCII letters plus other stuff such > as accented letters. They will have a lot of words that fall into > the catchall "word" category, which will mean they have to index > mixed alpha-and-number words in order to catch all native words. > > ISTM that perhaps a more generally useful definition would be > > lword Only ASCII letters > nlword Entirely letters per iswalpha(), but not lword > word Entirely alphanumeric per iswalnum(), but not nlword > (hence, includes at least one digit) > > However, I am no linguist and maybe I'm missing something. > > Comments? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Tatsuo Ishii <ishii@postgresql.org> writes: > Just for clarification. > Are you going to make these changes in the 8.3 beta test period? Yes, I committed them a couple hours ago. regards, tom lane