Eng*_*rad 9 python memory optimization text list
我在一个非常大的列表中遇到了速度问题.我有一个包含很多错误和非常奇怪的文字的文件.我正在尝试使用difflib在我拥有的650,000个单词的字典文件中找到最接近的匹配项.下面这种方法效果很好,但非常慢,我想知道是否有更好的方法来解决这个问题.这是代码:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Run Code Online (Sandbox Code Playgroud)
感谢您的帮助,软件工程甚至不是我的强项.非常感激.
Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
Run Code Online (Sandbox Code Playgroud)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
Run Code Online (Sandbox Code Playgroud)
You're converting each sentence to lower 650,000 times. No need for that.
| 归档时间: |
|
| 查看次数: |
4938 次 |
| 最近记录: |