在Python 3.3.2中计算短语频率

Question

在Python 3.3.2中计算短语频率

Rau*_*aul 6 python frequency count phrase python-3.x

我一直在网上研究不同的来源,并尝试了各种方法,但只能找到如何计算独特单词的频率而不是唯一的短语.我到目前为止的代码如下:

import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
    if word in wanted:
        cnt [word] += 1
print (cnt)

Run Code Online (Sandbox Code Playgroud)

如果可能的话,我还想计算本文中使用短语"中央银行"和"高通胀"的次数.我感谢您给出的任何建议或指导.

Answer 1

ins*_*get 2

首先，这就是我生成cnt你所做的（以减少内存开销）

def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))

Run Code Online (Sandbox Code Playgroud)

现在，关于短语的问题：

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2)):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    cnt[phrase] += 1

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助

这段代码没有产生OP想要的结果。尝试使用“央行高通胀”作为文件内容和“央行高通胀”的代码。您可能需要使用诸如“itertools.tee”之类的东西。请参阅 [`itertools Recipes`](http://docs.python.org/2/library/itertools.html#recipes) 中的“pairwise”配方。 (2认同)

归档时间：	11 年，10 月前
查看次数：	3535 次
最近记录：	5 年，11 月前