Zen*_*ega 2 python memory string
根据我在本论坛收到的建议,我使用以下代码(示例)来计算字符串.
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey', 'turnpike. when i took exit 39 there was no', 'lake. i drove my car on muddy roads which turned my red', 'car into brown. driving on newjersey turnpike can be confusing.']
text = " ".join(lines)
dict = {phrase: text.count(phrase) for phrase in phrase_words}
Run Code Online (Sandbox Code Playgroud)
所需的输出和示例代码的输出是:
{'newjersey turnpike': 2, 'red car': 2, 'no lake': 1}
Run Code Online (Sandbox Code Playgroud)
这段代码在一个小于300MB的文本文件上运行得很好.我使用了大小为500MB +的文本文件,并收到以下内存错误:
y=' '.join(lines)
MemoryError
Run Code Online (Sandbox Code Playgroud)
我该如何克服这个问题?谢谢你的帮助!
该算法一次只需要内存中的两行.它假定没有短语将跨越三行:
from itertools import tee, izip
from collections import defaultdict
def pairwise(iterable): # recipe from itertools docs
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
d = defaultdict(int)
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey',
'turnpike. when i took exit 39 there was no',
'lake. i drove my car on muddy roads which turned my red',
'car into brown. driving on newjersey turnpike can be confusing.']
for line1, line2 in pairwise(lines):
both_lines= ' '.join((line1, line2))
for phrase in phrase_words:
# counts phrases in first line and those that span to the next
d[phrase] += both_lines.count(phrase) - line2.count(phrase)
for phrase in phrase_words:
d[phrase] += line2.count(phrase) # otherwise last line is not searched
Run Code Online (Sandbox Code Playgroud)