Using tsearch2 in a Bayesian filter
От | Alban Hertroys |
---|---|
Тема | Using tsearch2 in a Bayesian filter |
Дата | |
Msg-id | A2D9C6F9-4394-4871-A882-3348D45CCBFC@solfertje.student.utwente.nl обсуждение исходный текст |
Список | pgsql-general |
Hi all, In my spare time I've started on a general purpose Bayesian filter based on the now built-in tsearch2 functionality. The ability to stem words from a message into lexemes, removing stop words and gist indexes look promising enough to attempt this. However, my experience with tsearch is somewhat limited, so I have a few questions... The messages entering the filter will be in different languages and encoding. For example, I get a lot of Cyrillic spam these days, while I get a lot of English messages and a few in Dutch. Especially the spam is likely to lie about it's encoding. Some messages will be plain text, but many will be HTML. - Is it possible to stem words from that wide a variety of content? - If so, what approach would be best? - Do I need to strip out the HTML tags or can they serve as lexemes themselves? Next, to determine the probability of a lexeme being of a certain classification (for example spam or not spam), I need to be able to count the number of occurrences of that lexeme in a text. I can't store a probability, as the numbers aren't fixed[*] (was hoping to abuse score() here, but that's probably a no-op). I haven't found any tsearch functions to determine the number of occurrences of each lexeme in a text. Ideally I'd have a resultset with ( lexeme, number of occurrences) tuples, so that I can use that directly in a query. - How do I determine the number of occurrences of each lexeme in a text? Thanks for your time. [*] As more messages enter the system, there will be more occurrences of lexemes in messages and in classifications. If I start out with one lexeme occurring once in a single message, the chance that lexeme is in a message is 1. As soon as another message arrives not containing that lexeme, the chance is 0.5. The number of messages, occurrence of lexemes in messages and classifications is a continuously moving number, so I will need the numbers the probability was based on (might still decide to add a column with the probability calculated from those numbers for speed, of course). Regards, Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,47f8b050927661534911704!
В списке pgsql-general по дате отправления: