Re: Bunch of tsearch fixes and cleanup
От | Heikki Linnakangas |
---|---|
Тема | Re: Bunch of tsearch fixes and cleanup |
Дата | |
Msg-id | 46CEC388.1050301@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Bunch of tsearch fixes and cleanup ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
Ответы |
Re: Bunch of tsearch fixes and cleanup
Re: Bunch of tsearch fixes and cleanup |
Список | pgsql-patches |
Heikki Linnakangas wrote: > Tom Lane wrote: >> Something that was annoying me yesterday was that it was not clear >> whether we had fixed every single place that uses a tsearch config file >> to assume that the file is in UTF8 and should be converted to database >> encoding. So I was thinking of hardwiring the "recode" part into >> readstopwords, and using wordop just for the "lowercase" part, which >> seemed to me like a saner division of labor. That is, UTF8 is a policy >> that we want to enforce globally, but lowercasing maybe not, and this >> still leaves the door open for more processing besides lowercasing. > > I think we also want to always run input files through pg_verify_mbstr. > We do it for stopwords, and synonym files (though incorrectly), but not > for thesaurus files or ispell files. It's probably best to do that > within the recode-function as well. Ok, here's an updated version of the patch. - ispell initialization crashed on empty dictionary file - ispell initialization crashed on affix file with prefixes but no suffixes - stop words file was ran through pg_verify_mbstr, with database encoding, but it's later interpreted as being UTF-8. Now verifies that it's UTF-8, regardless of database encoding. - introduces new t_readline function that reads a line from a file, verifies that it's valid UTF-8, and converts it to database encoding. Modified all places that read tsearch config files to use this function instead of fgets directly. - readstopwords now sorts the stop words after loading them. Removed the separate sortstopwords function. - moved the wordop-input parameter from StopList struct to a direct argument to readstopwords. Seems cleaner to me that way, the struct is now purely an output of readstopwords, not mixed input/output. readstopwords now recodes the input implicitly using t_readline. - bunch of comments added, typos fixed, and other cleanup PS. It's bank holiday here in the UK on Monday, so I won't be around until Tuesday if something comes up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-patches по дате отправления: