我目前正试图通过分析我积累的语料库来生成垃圾邮件过滤器.
我正在使用维基百科条目http://en.wikipedia.org/wiki/Bayesian_spam_filtering来开发我的分类代码.
我已经实现了代码来计算消息是垃圾邮件的概率,因为它包含一个特定的单词,通过从wiki实现以下公式:

我的PHP代码:
public function pSpaminess($word)
{
$ps = $this->pContentIsSpam();
$ph = $this->pContentIsHam();
$pws = $this->pWordInSpam($word);
$pwh = $this->pWordInHam($word);
$psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
return $psw;
}
Run Code Online (Sandbox Code Playgroud)
根据组合个体概率部分,我实现了代码,以组合测试消息中所有唯一单词的概率来确定垃圾邮件.
从维基公式:

我的PHP代码:
public function predict($content)
{
$words = $this->tokenize($content);
$pProducts = 1;
$pSums = 1;
foreach($words as $word)
{
$p = $this->pSpaminess($word);
echo "$word: $p\n";
$pProducts *= $p;
$pSums *= (1 - $p);
}
return $pProducts / ($pProducts + $pSums);
} …Run Code Online (Sandbox Code Playgroud)