我想用python计算文件中所有bigrams(一对相邻单词)的出现次数.在这里,我正在处理非常大的文件,所以我正在寻找一种有效的方法.我尝试在文件内容上使用带有正则表达式"\ w +\s\w +"的count方法,但它没有被证明是有效的.
例如,假设我要计算文件a.txt中的双字母数,其中包含以下内容:
"the quick person did not realize his speed and the quick person bumped "
Run Code Online (Sandbox Code Playgroud)
对于上面的文件,bigram集和它们的计数将是:
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
Run Code Online (Sandbox Code Playgroud)
我在Python中遇到了一个Counter对象的例子,它用于计算unigrams(单个单词).它还使用正则表达式方法.
这个例子是这样的:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)
Run Code Online (Sandbox Code Playgroud)
上面代码的输出是:
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
Run Code Online (Sandbox Code Playgroud)
我想知道是否可以使用Counter对象来获取bigrams的数量.除了Counter对象或正则表达式之外的任何方法也将受到赞赏.
Abh*_*kar 46
一些itertools
魔力:
>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+",
"the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))
Run Code Online (Sandbox Code Playgroud)
输出:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,
('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,
('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,
('realize', 'his'): 1})
Run Code Online (Sandbox Code Playgroud)
奖金
获取任何n-gram的频率:
from itertools import tee, islice
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
>>> Counter(ngrams(words, 3))
Run Code Online (Sandbox Code Playgroud)
输出:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,
('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,
('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,
('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,
('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
Run Code Online (Sandbox Code Playgroud)
这也适用于懒惰的迭代和生成器.因此,您可以编写一个生成器,该生成器逐行读取文件,生成单词,并将其传递ngarms
给懒惰地使用,而无需读取内存中的整个文件.
st0*_*0le 11
怎么样zip()
?
import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
Run Code Online (Sandbox Code Playgroud)
您可以简单地Counter
用于任何 n_gram,如下所示:
from collections import Counter
from nltk.util import ngrams
text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
('did', 'not'): 1,
('his', 'speed'): 1,
('not', 'realize'): 1,
('person', 'bumped'): 1,
('person', 'did'): 1,
('quick', 'person'): 2,
('realize', 'his'): 1,
('speed', 'and'): 1,
('the', 'quick'): 2})
Run Code Online (Sandbox Code Playgroud)
对于 3 克,只需将 更改n_gram
为 3:
n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
('did', 'not', 'realize'): 1,
('his', 'speed', 'and'): 1,
('not', 'realize', 'his'): 1,
('person', 'did', 'not'): 1,
('quick', 'person', 'bumped'): 1,
('quick', 'person', 'did'): 1,
('realize', 'his', 'speed'): 1,
('speed', 'and', 'the'): 1,
('the', 'quick', 'person'): 2})
Run Code Online (Sandbox Code Playgroud)
从 开始Python 3.10
,新pairwise
函数提供了一种滑动连续元素对的方法,这样您的用例就变成了:
from itertools import pairwise
import re
from collections import Counter
# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('\w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})
Run Code Online (Sandbox Code Playgroud)
中间结果的详细信息:
re.findall('\w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('\w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]
Run Code Online (Sandbox Code Playgroud)