识别出现在不到1%的语料库文档中的单词

ash*_*nty 2 python counter nlp nltk tf-idf

我有一个客户评论语料库,想要识别罕见的单词,对我来说,这些单词出现在不到1%的语料库文档中.

我已经有了一个可行的解决方案,但它对于我的脚本来说太慢了:

# Review data is a nested list of reviews, each represented as a bag of words
doc_clean = [['This', 'is', 'review', '1'], ['This', 'is', 'review', '2'], ..] 

# Save all words of the corpus in a set
all_words = set([w for doc in doc_clean for w in doc])

# Initialize a list for the collection of rare words
rare_words = []

# Loop through all_words to identify rare words
for word in all_words:

    # Count in how many reviews the word appears
    counts = sum([word in set(review) for review in doc_clean])

    # Add word to rare_words if it appears in less than 1% of the reviews
    if counts / len(doc_clean) <= 0.01:
        rare_words.append(word)
Run Code Online (Sandbox Code Playgroud)

有谁知道更快的实现?通过每次单独审核迭代每个单词似乎非常耗时.

先谢谢你,并祝福马库斯

DYZ*_*DYZ 5

这可能不是最有效的解决方案,但它易于理解和维护,我经常自己使用它.我用Counter和Pandas:

import pandas as pd
from collections import Counter
Run Code Online (Sandbox Code Playgroud)

将计数器应用于每个文档并构建一个术语 - 频率矩阵:

df = pd.DataFrame(list(map(Counter, doc_clean)))
Run Code Online (Sandbox Code Playgroud)

矩阵中的某些字段未定义.它们对应于特定文档中未出现的单词.计算出现次数:

counts = df.notnull().sum()
Run Code Online (Sandbox Code Playgroud)

现在,选择不经常出现的单词:

rare_words = counts[counts < 0.05 * len(doc_clean)].index.tolist()
Run Code Online (Sandbox Code Playgroud)