小编use*_*184的帖子

使用NLTK/Python中的电影评论语料库进行分类

我想在NLTK第6章中进行一些分类.这本书似乎跳过了创建类别的一步,我不确定我做错了什么.我的脚本在这里,响应如下.我的问题主要源于第一部分 - 基于目录名称的类别创建.这里的一些其他问题使用了文件名(即pos_1.txtneg_1.txt),但我更喜欢创建可以将文件转储到的目录.

from nltk.corpus import movie_reviews

reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']

documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

all_words=nltk.FreqDist(
    w.lower() 
    for w in movie_reviews.words() 
    if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in  string.punctuation)
word_features = all_words.keys()[:100]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
print document_features(movie_reviews.words('pos/11.txt'))

featuresets = [(document_features(d), c) for …
Run Code Online (Sandbox Code Playgroud)

python nlp corpus nltk sentiment-analysis

13
推荐指数
1
解决办法
2万
查看次数

NLTK禁用词删除问题

我正在尝试进行文档分类,如NLTK第6章所述,我在删除停用词时遇到问题.当我添加

all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
Run Code Online (Sandbox Code Playgroud)

它返回

Traceback (most recent call last):
  File "fiction.py", line 8, in <module>
    word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'
Run Code Online (Sandbox Code Playgroud)

我猜测停用词代码改变了用于'all_words'的对象类型,使得它们.key()函数无用.如何在使用键功能之前删除停用词而不更改其类型?完整代码如下:

import nltk 
from nltk.corpus import PlaintextCorpusReader

corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]

def document_features(document): # [_document-classify-extractor]
    document_words = set(document) # [_document-classify-set] …
Run Code Online (Sandbox Code Playgroud)

python nltk

5
推荐指数
1
解决办法
1400
查看次数

标签 统计

nltk ×2

python ×2

corpus ×1

nlp ×1

sentiment-analysis ×1