Re: Performance problems testing with Spamassassin 3.1.0
От | Matthew Schumacher |
---|---|
Тема | Re: Performance problems testing with Spamassassin 3.1.0 |
Дата | |
Msg-id | 42EBEBE9.4020504@aptalaska.net обсуждение исходный текст |
Ответ на | Re: Performance problems testing with Spamassassin 3.1.0 (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Performance problems testing with Spamassassin 3.1.0
|
Список | pgsql-performance |
Tom Lane wrote: > I looked into this a bit. It seems that the problem when you wrap the > entire insertion series into one transaction is associated with the fact > that the test does so many successive updates of the single row in > bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up > 49383 dead versions of the one row.) This is bad enough when it's in > separate transactions, but when it's in one transaction, none of those > dead row versions can be marked "fully dead" yet --- so for every update > of the row, the unique-key check has to visit every dead version to make > sure it's dead in the context of the current transaction. This makes > the process O(N^2) in the number of updates per transaction. Which is > bad enough if you just want to do one transaction per message, but it's > intolerable if you try to wrap the whole bulk-load scenario into one > transaction. > > I'm not sure that we can do anything to make this a lot smarter, but > in any case, the real problem is to not do quite so many updates of > bayes_vars. > > How constrained are you as to the format of the SQL generated by > SpamAssassin? In particular, could you convert the commands generated > for a single message into a single statement? I experimented with > passing all the tokens for a given message as a single bytea array, > as in the attached, and got almost a factor of 4 runtime reduction > on your test case. > > BTW, it's possible that this is all just a startup-transient problem: > once the database has been reasonably well populated, one would expect > new tokens to be added infrequently, and so the number of updates to > bayes_vars ought to drop off. > > regards, tom lane > The spamassassins bayes code calls the _put_token method in the storage module a loop. This means that the storage module isn't called once per message, but once per token. I'll look into modifying it to so that the bayes code passes a hash of tokens to the storage module where they can loop or in the case of the pgsql module pass an array of tokens to a procedure where we loop and use temp tables to make this much more efficient. I don't have much time this weekend to toss at this, but will be looking at it on Monday. Thanks, schu
В списке pgsql-performance по дате отправления: