小编ash*_*nty的帖子

如何有效地计算python中多个文档的bigrams

我有一组文本文档，想计算所有文本文档中的二元组数。

首先，我创建一个列表，其中每个元素又是一个列表，表示一个特定文档中的单词：

print(doc_clean)
# [['This', 'is', 'the', 'first', 'doc'], ['And', 'this', 'is', 'the', 'second'], ..]

Run Code Online (Sandbox Code Playgroud)

然后，我按文档提取二元组并将它们存储在一个列表中：

bigrams = []
for doc in doc_clean:
    bigrams.extend([(doc[i-1], doc[i]) 
                   for i in range(1, len(doc))])
print(bigrams)
# [('This', 'is'), ('is', 'the'), ..]

Run Code Online (Sandbox Code Playgroud)

现在，我想计算每个唯一二元组的频率：

bigrams_freq = [(b, bigrams.count(b)) 
                for b in set(bigrams)]

Run Code Online (Sandbox Code Playgroud)

一般来说，这种方法是有效的，但它太慢了。bigrams 的列表很安静，总共有约 5mio 条目和约 300k 独特的 bigrams。在我的笔记本电脑上，当前的方法花费了太多时间进行分析。

谢谢你帮助我！

python nlp nltk

ash*_*nty

lucky-day

3
推荐指数

1
解决办法

2915
查看次数

识别出现在不到1%的语料库文档中的单词

我有一个客户评论语料库,想要识别罕见的单词,对我来说,这些单词出现在不到1%的语料库文档中.

我已经有了一个可行的解决方案,但它对于我的脚本来说太慢了:

# Review data is a nested list of reviews, each represented as a bag of words
doc_clean = [['This', 'is', 'review', '1'], ['This', 'is', 'review', '2'], ..] 

# Save all words of the corpus in a set
all_words = set([w for doc in doc_clean for w in doc])

# Initialize a list for the collection of rare words
rare_words = []

# Loop through all_words to identify rare words
for word in all_words:

    # Count in how many …

Run Code Online (Sandbox Code Playgroud)

python counter nlp nltk tf-idf

ash*_*nty

2018 06-25

2
推荐指数

1
解决办法

42
查看次数

修改 getitem 后无法对列表进行切片

对于特定用例，我想定义一个列表类，如果索引为负数或超出范围，则返回 0。

我目前的方法已经服务于特定目的：

class mlist(list):
    def __getitem__(self, n):
        if (len(self)<=n) or (n<0):
            return 0
        return super(mlist, self).__getitem__(n)

l = mlist([1,2,3,4])
l[-2]
>>> 0
l[10]
>>> 0

Run Code Online (Sandbox Code Playgroud)

但不幸的是，它在切片列表时会导致一些不良行为：

l[0:2]
>>> TypeError: '<=' not supported between instances of 'int' and 'slice'

Run Code Online (Sandbox Code Playgroud)

有没有办法解决这个问题？

python

ash*_*nty

2020 02-20

1
推荐指数

1
解决办法

62
查看次数

标签统计

python ×3

nlp ×2

nltk ×2

counter ×1

tf-idf ×1

如何有效地计算python中多个文档的bigrams

识别出现在不到1%的语料库文档中的单词

修改 __getitem__ 后无法对列表进行切片

标签 统计

小编ash_nty的帖子

修改 getitem 后无法对列表进行切片

标签统计