使用python计算文件中的双字节(一对两个单词)

Question

使用python计算文件中的双字节(一对两个单词)

我想用python计算文件中所有bigrams(一对相邻单词)的出现次数.在这里,我正在处理非常大的文件,所以我正在寻找一种有效的方法.我尝试在文件内容上使用带有正则表达式"\ w +\s\w +"的count方法,但它没有被证明是有效的.

例如,假设我要计算文件a.txt中的双字母数,其中包含以下内容:

"the quick person did not realize his speed and the quick person bumped "

Run Code Online (Sandbox Code Playgroud)

对于上面的文件,bigram集和它们的计数将是:

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

Run Code Online (Sandbox Code Playgroud)

我在Python中遇到了一个Counter对象的例子,它用于计算unigrams(单个单词).它还使用正则表达式方法.

这个例子是这样的:

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

Run Code Online (Sandbox Code Playgroud)

上面代码的输出是:

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

Run Code Online (Sandbox Code Playgroud)

我想知道是否可以使用Counter对象来获取bigrams的数量.除了Counter对象或正则表达式之外的任何方法也将受到赞赏.

Answer 1

Abh*_*kar 46

一些itertools魔力:

>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+", 
   "the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))

Run Code Online (Sandbox Code Playgroud)

输出:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

Run Code Online (Sandbox Code Playgroud)

奖金

获取任何n-gram的频率:

from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

>>> Counter(ngrams(words, 3))

Run Code Online (Sandbox Code Playgroud)

输出:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

Run Code Online (Sandbox Code Playgroud)

这也适用于懒惰的迭代和生成器.因此,您可以编写一个生成器,该生成器逐行读取文件,生成单词,并将其传递ngarms给懒惰地使用,而无需读取内存中的整个文件.

Answer 2

st0*_*0le 11

怎么样zip()？

import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))

Run Code Online (Sandbox Code Playgroud)

Answer 3

Kri*_*673 5

您可以简单地Counter用于任何 n_gram，如下所示：

from collections import Counter
from nltk.util import ngrams 

text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
         ('did', 'not'): 1,
         ('his', 'speed'): 1,
         ('not', 'realize'): 1,
         ('person', 'bumped'): 1,
         ('person', 'did'): 1,
         ('quick', 'person'): 2,
         ('realize', 'his'): 1,
         ('speed', 'and'): 1,
         ('the', 'quick'): 2})

Run Code Online (Sandbox Code Playgroud)

对于 3 克，只需将更改n_gram为 3：

n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
         ('did', 'not', 'realize'): 1,
         ('his', 'speed', 'and'): 1,
         ('not', 'realize', 'his'): 1,
         ('person', 'did', 'not'): 1,
         ('quick', 'person', 'bumped'): 1,
         ('quick', 'person', 'did'): 1,
         ('realize', 'his', 'speed'): 1,
         ('speed', 'and', 'the'): 1,
         ('the', 'quick', 'person'): 2})

Run Code Online (Sandbox Code Playgroud)

Answer 4

Xav*_*hot 5

从开始Python 3.10，新pairwise函数提供了一种滑动连续元素对的方法，这样您的用例就变成了：

from itertools import pairwise
import re
from collections import Counter

# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('\w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})

Run Code Online (Sandbox Code Playgroud)

中间结果的详细信息：

re.findall('\w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('\w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，4 月前
查看次数：	23631 次
最近记录：	6 年，11 月前