And*_*tin 1 python sentiment-analysis scikit-learn
链接:https://stackoverflow.com/questions/18154278/is-there-a-maximum-size-for-the-nltk-naive-bayes-classifer
我在代码中实现scikit-learn机器学习算法时遇到了麻烦.scikit-learn的作者之一在我上面提到的问题上帮助了我,但我不能完全理解它,因为我原来的问题是关于另一个问题,我认为最好开一个新问题.
此代码正在输入推文并将其文本和情感读入字典.然后它解析每行文本,并将文本添加到一个列表中,并将其情绪添加到另一个列表中(根据上面链接问题中作者的建议).
然而,尽管在链接中使用代码并尽可能地查找API,我想我错过了一些东西.运行下面的代码首先给出了一串用冒号分隔的输出,如下所示:
(0, 299) 0.270522159585
(0, 271) 0.32340892262
(0, 266) 0.361182814311
: :
(48, 123) 0.240644787937
Run Code Online (Sandbox Code Playgroud)
其次是:
['negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', etc]
Run Code Online (Sandbox Code Playgroud)
然后:
ValueError: empty vocabulary; perhaps the documents only contain stop words
Run Code Online (Sandbox Code Playgroud)
我是以错误的方式分配分类器吗?这是我的代码:
test_file = 'RawTweetDataset/SmallSample.csv'
#test_file = 'RawTweetDataset/Dataset.csv'
sample_tweets = 'SampleTweets/FlumeData2.txt'
csv_file = csv.DictReader(open(test_file, 'rb'), delimiter=',', quotechar='"')
tweetsDict = {}
for line in csv_file:
tweetsDict.update({(line['SentimentText'],line['Sentiment'])})
tweets = []
labels = []
shortenedText = ""
for (text, sentiment) in tweetsDict.items():
text = HTMLParser.HTMLParser().unescape(text.decode("cp1252", "ignore"))
exclude = set(string.punctuation)
for punct in string.punctuation:
text = text.replace(punct,"")
cleanedText = [e.lower() for e in text.split() if not e.startswith(('http', '@'))]
shortenedText = [e.strip() for e in cleanedText if e not in exclude]
text = ' '.join(ch for ch in shortenedText if ch not in exclude)
tweets.append(text.encode("utf-8", "ignore"))
labels.append(sentiment)
vectorizer = TfidfVectorizer(input='content')
X = vectorizer.fit_transform(tweets)
y = labels
classifier = MultinomialNB().fit(X, y)
X_test = vectorizer.fit_transform(sample_tweets)
y_pred = classifier.predict(X_test)
Run Code Online (Sandbox Code Playgroud)
更新:当前代码:
all_files = glob.glob (tweet location)
for filename in all_files:
with open(filename, 'r') as file:
for line file.readlines():
X_test = vectorizer.transform([line])
y_pred = classifier.predict(X_test)
print line
print y_pred
Run Code Online (Sandbox Code Playgroud)
这总是产生如下:
happy bday trish
['negative'] << Never changes, always negative
Run Code Online (Sandbox Code Playgroud)
问题出在这里:
X_test = vectorizer.fit_transform(sample_tweets)
Run Code Online (Sandbox Code Playgroud)
fit_transform打算在训练集上调用,而不是在测试集上调用.在测试集上,打电话transform.
另外,sample_tweets是文件名.在将其传递给矢量图之前,您应该打开它并从中读取推文.如果你这样做,那么你最终应该能够做类似的事情
for tweet, sentiment in zip(list_of_sample_tweets, y_pred):
print("Tweet: %s" % tweet)
print("Sentiment: %s" % sentiment)
Run Code Online (Sandbox Code Playgroud)