Re: multibyte charater set in levenshtein function
От | Robert Haas |
---|---|
Тема | Re: multibyte charater set in levenshtein function |
Дата | |
Msg-id | AANLkTim0ZZs8zp6DM7qowV4jwyYhy1Dy+_fN6ApEWBf5@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: multibyte charater set in levenshtein function (Itagaki Takahiro <itagaki.takahiro@gmail.com>) |
Список | pgsql-hackers |
2010/8/2 Alexander Korotkov <aekorotkov@gmail.com>: > On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I reviewed this code in a fair amount of detail today and ended up >> rewriting it. In general terms, it's best to avoid changing things >> that are not relevant to the central purpose of the patch. This patch >> randomly adds a whole bunch of whitespace and removes a whole bunch of >> comments which I find useful and do not want to see removed. It took >> me about an hour to reverse out all of those changes, which then let >> me get down to the business of actually analyzing the code. > I'm sorry. This is my first patch, I will be more accurate next time. But, I > think there is no unified opinion about some changes like replacement "!m" > with "m==0". Sure; if that were the only such change, I wouldn't have mentioned it. > I think this line is not correct: > "if (m != s_bytes && n != t_bytes)" Good catch, fixed in the attached version. >> The attached patch takes the approach of using a fast-path for just >> the innermost loop when neither string contains any multi-byte >> characters (regardless of whether or not the encoding allows >> multi-byte characters). It's still slower than CVS HEAD, but for >> strings containing no multi-byte characters it's only marginally, if >> at all, slower than previous releases, because 9.0 doesn't contain >> your fix to avoid the extra string copy; and it's faster than your >> implementation even when multi-byte characters ARE present. It is >> slightly slower on single-byte encoding databases (I tested >> SQL_ASCII), but still faster than previous releases. It's also quite >> a bit less code duplication. > > Can I look at the test which shows that your implementation is faster than > my when multi-byte characters are present? My tests show reverse. For > example, with russian dictionary of openoffice: Hmm, with the correction you mention above, I'm getting about the same results with multi-byte characters (but faster with only single-byte characters). I did this: CREATE TABLE words (a text not null primary key); COPY words from '/usr/share/dict/words'; SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; With the updated patch attached below, these ran about 102 ms, 222 ms, 129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps this is because I only have multi-byte characters in one of the inputs, not both. I don't have a dictionary handy with multi-byte words in it. Can you send me a file? > I think that the cause of slow down slowdown is memcmp function. This > function is very fast for long parts of memory, but of few bytes inline > function is faster, because compiler have more freedom for optimization. > After replacing memcmp with inline function like following in your > implementation: Yeah, that does seem to be slightly better. I've done a version of this in the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Вложения
В списке pgsql-hackers по дате отправления: