快速计算双字母(有或没有多处理) - python

alv*_*vas 8 python optimization counter mapreduce n-gram

鉴于big.txt来自norvig.com/big.txt,我们的目标是快速计算双子座(想象一下,我必须重复这次计数100,000次).

根据python中的Fast/Optimize N-gram实现,像这样提取bigrams是最优的:

_bigrams = zip(*[text[i:] for i in range(2)])
Run Code Online (Sandbox Code Playgroud)

如果我正在使用Python3,生成器将不会被评估,直到我实现它list(_bigrams)或其他一些将执行相同的功能.

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True: 
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(_bigrams).most_common(100)
    # Do some manipulation to text and repeat the counting.
    text = manipulate(text, top100)      
Run Code Online (Sandbox Code Playgroud)

但是每次迭代需要大约1秒以上,100,000次迭代会太长.

我也尝试过sklearnCountVectorizer,但是提取,计算和获得top100双字母的时间与原生python相当.

然后我尝试了一些multiprocessing,使用Python多处理和共享计数器的轻微修改和http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:

from multiprocessing import Process, Manager, Lock

import time

class MultiProcCounter(object):
    def __init__(self):
        self.dictionary = Manager().dict()
        self.lock = Lock()

    def increment(self, item):
        with self.lock:
            self.dictionary[item] = self.dictionary.get(item, 0) + 1

def func(counter, item):
    counter.increment(item)

def multiproc_count(inputs):
    counter = MultiProcCounter()
    procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
    for p in procs: p.start()
    for p in procs: p.join()
    return counter.dictionary

inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]

print (multiproc_count(inputs))
Run Code Online (Sandbox Code Playgroud)

但是使用MultiProcCounterbigram计数每次迭代的时间甚至超过1秒.我不知道为什么会这样,使用虚拟列表的int例子,multiproc_count完美的工作.

我试过了:

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True:
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(multiproc_count(_bigrams)).most_common(100)
Run Code Online (Sandbox Code Playgroud)

有没有办法在Python中真正快速计算bigrams?

小智 0

我的建议:

Text= "The Project Gutenberg EBook of The Adventures of Sherlock Holmes"
"by Sir Arthur Conan Doyle"

# Counters
Counts= [[0 for x in range(128)] for y in range(128)]

# Perform the counting
R= ord(Text[0])
for i in range(1, len(Text)):
    L= R; R= ord(Text[i])
    Counts[L][R]+= 1

# Output the results
for i in range(ord('A'), ord('{')):
    if i < ord('[') or i >= ord('a'):
        for j in range(ord('A'), ord('{')):
            if (j < ord('[') or j >= ord('a')) and Counts[i][j] > 0:
                print chr(i) + chr(j), Counts[i][j]


Ad 1
Bo 1
EB 1
Gu 1
Ho 1
Pr 1
Sh 1
Th 2
be 1
ck 1
ct 1
dv 1
ec 1
en 2
er 2
es 2
he 3
je 1
lm 1
lo 1
me 1
nb 1
nt 1
oc 1
of 2
oj 1
ok 1
ol 1
oo 1
re 1
rg 1
rl 1
ro 1
te 1
tu 1
ur 1
ut 1
ve 1
Run Code Online (Sandbox Code Playgroud)

该版本区分大小写;可能最好先将整个文本小写。