为文件中的每个单词创建一个字典,并计算其后的单词的频率

Kri*_*tie 9 python counter dictionary nltk n-gram

我正在努力解决一个棘手的问题并迷失方向.

这是我应该做的:

INPUT: file
OUTPUT: dictionary

Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.

You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.

Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
...     word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止所做的一切,试图至少提出正确的话:

def word_counts(f):
    i = 0
    orgwordlist = f.split()
    for word in orgwordlist:
        if i<len(orgwordlist)-1:
            print orgwordlist[i]
            print orgwordlist[i+1]

with open('../data/example.txt') as f:
    word_counts(f)
Run Code Online (Sandbox Code Playgroud)

我想我需要以某种方式使用.count方法并最终将一些字典压缩在一起,但我不知道如何计算每个第一个单词的第二个单词.

我知道我无法解决问题,但试图一步一步.任何帮助都表示赞赏,甚至只是指向正确方向的提示.

Wil*_*sem 7

我们可以解决这个两遍:

  1. 在第一遍中,我们使用构造Counter和计算两个连续单词的元组zip(..); 和
  2. 然后我们Counter在字典词典中将其转换.

这导致以下代码:

from collections import Counter, defaultdict

def word_counts(f):
    st = f.read().lower().split()
    ctr = Counter(zip(st,st[1:]))
    dc = defaultdict(dict)
    for (k1,k2),v in ctr.items():
        dc[k1][k2] = v
    return dict(dc)
Run Code Online (Sandbox Code Playgroud)


jua*_*aga 5

我们可以这样做一个合格:

  1. 使用defaultdict作为计数器.
  2. 迭代双字母,就地计数

所以......为了简洁起见,我们将保持标准化和清理:

>>> from collections import defaultdict
>>> counter = defaultdict(lambda: defaultdict(int))
>>> s = 'the dog chased the cat'
>>> tokens = s.split()
>>> from itertools import islice
>>> for a, b in zip(tokens, islice(tokens, 1, None)):
...     counter[a][b] += 1
...
>>> counter
defaultdict(<function <lambda> at 0x102078950>, {'the': defaultdict(<class 'int'>, {'cat': 1, 'dog': 1}), 'dog': defaultdict(<class 'int'>, {'chased': 1}), 'chased': defaultdict(<class 'int'>, {'the': 1})})
Run Code Online (Sandbox Code Playgroud)

更可读的输出:

>>> {k:dict(v) for k,v in counter.items()}
{'the': {'cat': 1, 'dog': 1}, 'dog': {'chased': 1}, 'chased': {'the': 1}}
>>>
Run Code Online (Sandbox Code Playgroud)