Re: Performance problems testing with Spamassassin
От | Matthew Schumacher |
---|---|
Тема | Re: Performance problems testing with Spamassassin |
Дата | |
Msg-id | 42EB3E63.6040300@aptalaska.net обсуждение исходный текст |
Ответ на | Re: Performance problems testing with Spamassassin (Karim Nassar <karim.nassar@acm.org>) |
Ответы |
Re: Performance problems testing with Spamassassin
|
Список | pgsql-performance |
Karim Nassar wrote: > > kan4@slap-happy:~/k-bayesBenchmark$ time ./test.pl > <-- snip db creation stuff --> > 17:18:44 -- START > 17:19:37 -- AFTER TEMP LOAD : loaded 120596 records > 17:19:46 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token > 17:19:50 -- AFTER bayes_vars UPDATE : updated 1 records > 17:23:37 -- AFTER bayes_token UPDATE : updated 47537 records > DONE > > real 5m4.551s > user 0m29.442s > sys 0m3.925s > > > I am sure someone smarter could optimize further. > > Anyone with a super-spifty machine wanna see if there is an improvement > here? > There is a great improvement in loading the data. While I didn't load it on my server, my test box shows significant gains. It seems that the only thing your script does different is separate the updates from inserts so that an expensive update isn't called when we want to insert. The other major difference is the 'IN' and 'MOT IN' syntax which looks to be much faster than trying everything as an update before inserting. While these optimizations seem to make a huge difference in loading the token data, the real life scenario is a little different. You see, the database keeps track of the number of times each token was found in ham or spam, so that when we see a new message we can parse it into tokens then compare with the database to see how likely the messages is spam based on the statistics of tokens we have already learned on. Since we would want to commit this data after each message, the number of tokens processed at one time would probably only be a few hundred, most of which are probably updates after we have trained on a few thousand emails. I apologize if my crude benchmark was misleading, it was meant to simulate the sheer number of inserts/updates the database may go though in an environment that didn't require people to load spamassassin and start training on spam. I'll do some more testing on Monday, perhaps grouping even 200 tokens at a time using your method will yield significant gains, but probably not as dramatic as it does using my loading benchmark. I post more when I have a chance to look at this in more depth. Thanks, schu
В списке pgsql-performance по дате отправления: