n-gram与朴素贝叶斯分类器

Aik*_*kin 8 python nltk n-gram

我是python新手,需要帮助!我正在练习python NLTK文本分类.以下是我在http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/上练习的代码示例

我试过这个

from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict

train_samples = {}

with file ('positive.txt', 'rt') as f:
   for line in f.readlines():
       train_samples[line]='pos'

with file ('negative.txt', 'rt') as d:
   for line in d.readlines():
       train_samples[line]='neg'

f=open("test.txt", "r")
test_samples=f.readlines()

def bigramReturner(text):
    tweetString = text.lower()
    bigramFeatureVector = {}
    for item in bigrams(tweetString.split()):
        bigramFeatureVector.append(' '.join(item))
    return bigramFeatureVector

def get_labeled_features(samples):
    word_freqs = {}
    for text, label in train_samples.items():
        tokens = text.split()
        for token in tokens:
            if token not in word_freqs:
                word_freqs[token] = {'pos': 0, 'neg': 0}
            word_freqs[token][label] += 1
    return word_freqs


def get_label_probdist(labeled_features):
    label_fd = FreqDist()
    for item,counts in labeled_features.items():
        for label in ['neg','pos']:
            if counts[label] > 0:
                label_fd.inc(label)
    label_probdist = ELEProbDist(label_fd)
    return label_probdist


def get_feature_probdist(labeled_features):
    feature_freqdist = defaultdict(FreqDist)
    feature_values = defaultdict(set)
    num_samples = len(train_samples) / 2
    for token, counts in labeled_features.items():
        for label in ['neg','pos']:
            feature_freqdist[label, token].inc(True, count=counts[label])
            feature_freqdist[label, token].inc(None, num_samples - counts[label])
            feature_values[token].add(None)
            feature_values[token].add(True)
    for item in feature_freqdist.items():
        print item[0],item[1]
    feature_probdist = {}
    for ((label, fname), freqdist) in feature_freqdist.items():
        probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
        feature_probdist[label,fname] = probdist
    return feature_probdist



labeled_features = get_labeled_features(train_samples)

label_probdist = get_label_probdist(labeled_features)

feature_probdist = get_feature_probdist(labeled_features)

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

for sample in test_samples:
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
Run Code Online (Sandbox Code Playgroud)

但得到这个错误,为什么?

    Traceback (most recent call last):
  File "C:\python\naive_test.py", line 76, in <module>
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
  File "C:\python\naive_test.py", line 23, in bigramReturner
    bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
Run Code Online (Sandbox Code Playgroud)

use*_*743 15

双字母特征向量遵循与单字母特征向量完全相同的原理.因此,就像您提到的教程一样,您必须检查您将使用的任何文档中是否存在双字母组件功能.至于bigram功能以及如何提取它们,我已经为它编写了代码.你可以简单地采用它们来改变教程中的变量"tweets".

import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Run Code Online (Sandbox Code Playgroud)

而不是打印它们,你可以简单地将它们附加到"推文"列表,你很高兴!我希望这会有用.否则,如果您仍有问题,请告诉我.请注意,在情绪分析等应用中,一些研究人员倾向于对单词进行标记并删除标点,而其他人则不这样做.从经验中我知道,如果你不删除标点符号,朴素贝叶斯的工作方式几乎相同,但是SVM的准确率会降低.您可能需要使用这些内容并确定哪些内容对您的数据集更有效.编辑1:有一本名为"用Python自然语言处理"的书,我可以推荐给你.它包含bigrams的例子以及一些练习.但是,我认为如果没有它,你甚至可以解决这个问题.选择bigrams特征背后的想法是,我们想要知道单词A出现在我们的语料库中后跟单词B的概率.因此,例如在"我驾驶卡车"一词中,单词unigram功能将是每个那两个单词,而bigram这个词的特点是:[我开车,开车,一辆卡车].现在您想要将这3个用作功能.所以下面的代码函数将字符串的所有bigrams放在名为bigramFeatureVector的列表中.

def bigramReturner (tweetString):
  tweetString = tweetString.lower()
  tweetString = removePunctuation (tweetString)
  bigramFeatureVector = []
  for item in nltk.bigrams(tweetString.split()):
      bigramFeatureVector.append(' '.join(item))
  return bigramFeatureVector
Run Code Online (Sandbox Code Playgroud)

请注意,您必须编写自己的removePunctuation函数.你得到的上述函数的输出是二元组特征向量.您将在您提到的教程中处理unigram特征向量的方式完全相同.