Re: Generalized edit function?
От | Robert Haas |
---|---|
Тема | Re: Generalized edit function? |
Дата | |
Msg-id | AANLkTi=NEHXY89xZhN1FCmpx=a7TeRgXWz313jxL8k6U@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Generalized edit function? (fork <forkandwait@gmail.com>) |
Список | pgsql-hackers |
On Sat, Feb 26, 2011 at 7:40 PM, fork <forkandwait@gmail.com> wrote: >> Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff >> in contrib/fuzzystrmatch still is. > > I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on > multi-byteness so far. I will have to figure out the multibyte character work > once I get the basic algorithm working -- any thoughts on that? Any pitfalls in > porting? The main thing with levenshtein() is that if you're working with single byte characters then you can reference the i'th character as x[i], whereas if you have multi-byte characters then you need to build an offset table and look at length[i] bytes beginning at &x[offset[i]]. That turns out to be significantly more expensive. As initially proposed, the patch to add multi-byte awareness built this lookup table for any multi-byte encoding and used the faster technique for single-byte encodings, but that wasn't actually so hot, because the most widely used encoding these days is probably UTF-8, which of course is a multi-byte encoding. What we ended up with is a fast-path for the case where both strings contain only single-byte characters, which will always be true in a single-byte encoding but might easily also be true in a multi-byte encoding, especially for English speakers. I don't know if that's exactly right for what you're trying to do - you'll probably need to try some different things and benchmark. I would however recommend that you look at the master-branch implementation of levenshtein() rather than the old 9.0.x one, because it's significantly different, and forward-porting your changes will probably be hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: