nltk NaiveBayesClassifier培训情绪分析

stu*_*001 22 python nlp nltk sentiment-analysis textblob

我正在NaiveBayesClassifier使用句子训练Python,它给出了下面的错误.我不明白错误是什么,任何帮助都会很好.

我尝试了很多其他输入格式,但错误仍然存​​在.代码如下:

from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
         ('This is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('This is my best work.', 'pos'),
         ("What an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('He is my sworn enemy!', 'neg'),
         ('My boss is horrible.', 'neg') ]

test = [('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)
Run Code Online (Sandbox Code Playgroud)

我在下面包括了追溯.

Traceback (most recent call last):
  File "C:\Users\5460\Desktop\train01.py", line 15, in <module>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Users\5460\Desktop\train01.py", line 15, in <genexpr>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize
    return _word_tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize
    text = re.sub(r'^\"', r'``', text)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
Run Code Online (Sandbox Code Playgroud)

πόδ*_*κύς 39

您需要更改数据结构.这是您train目前的列表:

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
Run Code Online (Sandbox Code Playgroud)

但问题是每个元组的第一个元素应该是特征字典.因此,我将列表更改为分类器可以使用的数据结构:

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
Run Code Online (Sandbox Code Playgroud)

您的数据现在应该像这样构建:

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]
Run Code Online (Sandbox Code Playgroud)

请注意,每个元组的第一个元素现在是字典.现在您的数据已就位且每个元组的第一个元素是字典,您可以像这样训练分类器:

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0
Run Code Online (Sandbox Code Playgroud)

如果你想使用分类器,你可以这样做.首先,您从一个测试句开始:

>>> test_sentence = "This is the best band I've ever heard!"
Run Code Online (Sandbox Code Playgroud)

然后,您对该句子进行标记,并找出该句子与all_words共享的单词.这些构成了句子的特征.

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}
Run Code Online (Sandbox Code Playgroud)

您的功能现在将如下所示:

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}
Run Code Online (Sandbox Code Playgroud)

然后,您只需对这些功能进行分类:

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above
Run Code Online (Sandbox Code Playgroud)

这个测试句似乎是积极的.

  • 字符串_are_ hashable和字典_aren't_.这个答案完全倒退了.只需在控制台尝试`hash('abc')`和`hash({1:2})`.最终结构可能会起作用,但_Why_给出的理由没有任何意义. (3认同)

alv*_*vas 20

@ 275365关于NLTK贝叶斯分类器数据结构的教程很棒.从更高的层面来看,我们可以将其视为,

我们输入带有情绪标签的句子:

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
Run Code Online (Sandbox Code Playgroud)

让我们将我们的特征集视为单个单词,因此我们从训练数据中提取所有可能单词的列表(让我们称之为词汇表),如下所示:

from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
Run Code Online (Sandbox Code Playgroud)

基本上,vocabulary这里是相同的@ 275365all_word

>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True
Run Code Online (Sandbox Code Playgroud)

从每个数据点(即每个句子和pos/neg标签),我们想要说明是否存在特征(即来自词汇表的单词).

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}
Run Code Online (Sandbox Code Playgroud)

但我们也想告诉分类器在句子中哪个词不存在但在词汇表中,所以对于每个数据点,我们列出词汇表中所有可能的单词并说出单词是否存在:

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x =  {i:True for i in vocabulary if i in sentence}
>>> y =  {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
Run Code Online (Sandbox Code Playgroud)

但由于这会循环两次词汇表,因此执行此操作会更有效:

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}
Run Code Online (Sandbox Code Playgroud)

因此,对于每个句子,我们想要告诉分类器每个句子哪个单词存在哪个单词不存在,并且还给它pos/neg标签.我们可以称之为a feature_set,它是由x(如上所示)及其标记组成的元组.

>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]
Run Code Online (Sandbox Code Playgroud)

然后我们将feature_set中的这些功能和标记提供给分类器来训练它:

from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)
Run Code Online (Sandbox Code Playgroud)

现在你有一个训练有素的分类器,当你想要标记一个新句子时,你必须"强化"新句子,看看新句子中的哪个单词在分类器训练的词汇表中:

>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
Run Code Online (Sandbox Code Playgroud)

注意:正如您从上面的步骤中看到的那样,朴素贝叶斯分类器无法处理词汇单词,因为foobar令牌在您完成后会消失.

然后将特征化的测试句子输入分类器并要求它进行分类:

>>> classifier.classify(featurized_test_sentence)
'pos'
Run Code Online (Sandbox Code Playgroud)

希望这能更清楚地了解如何将数据输入NLTK的朴素贝叶斯分类器进行情感分析.这是没有评论和演练的完整代码:

from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

classifier = nbc.train(feature_set)

test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)
Run Code Online (Sandbox Code Playgroud)


Ste*_*e L 5

您似乎正在尝试使用TextBlob,但正在训练NLTK NaiveBayesClassifier,正如其他答案所指出的那样,必须传递一个功能字典.

TextBlob有一个默认的特征提取器,用于指示训练集中的哪些单词包含在文档中(如其他答案中所示).因此,TextBlob允许您按原样传递数据.

from textblob.classifiers import NaiveBayesClassifier

train = [('This is an amazing place!', 'pos'),
        ('I feel very good about these beers.', 'pos'),
        ('This is my best work.', 'pos'),
        ("What an awesome view", 'pos'),
        ('I do not like this restaurant', 'neg'),
        ('I am tired of this stuff.', 'neg'),
        ("I can't deal with this", 'neg'),
        ('He is my sworn enemy!', 'neg'),
        ('My boss is horrible.', 'neg') ] 
test = [
        ('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ] 


classifier = NaiveBayesClassifier(train)  # Pass in data as is
# When classifying text, features are extracted automatically
classifier.classify("This is an amazing library!")  # => 'pos'
Run Code Online (Sandbox Code Playgroud)

当然,简单的默认提取器并不适合所有问题.如果您想要如何提取特征,您只需编写一个函数,该函数将一串文本作为输入并输出特征字典并将其传递给分类器.

classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func)
Run Code Online (Sandbox Code Playgroud)

我建议您在这里查看简短的TextBlob分类器教程:http://textblob.readthedocs.org/en/latest/classifiers.html