我必须循环使用2mln行的大文件,看起来像这样
P61981 1433G_HUMAN
P61982 1433G_MOUSE
Q5RC20 1433G_PONAB
P61983 1433G_RAT
P68253 1433G_SHEEP
Run Code Online (Sandbox Code Playgroud)
目前我有以下函数,它接受列表中的每个条目,如果这个大文件中的条目 - 它占用了出现的行,但它很慢(~10分钟).可能由于循环方案,你能建议优化吗?
up = "database.txt"
def mplist(somelist):
newlist = []
with open(up) as U:
for row in U:
for i in somelist:
if i in row:
newlist.append(row)
return newlist
Run Code Online (Sandbox Code Playgroud)
例子 somelist
somelist = [
'P68250',
'P31946',
'Q4R572',
'Q9CQV8',
'A4K2U9',
'P35213',
'P68251'
]
Run Code Online (Sandbox Code Playgroud)
如果您somelist只包含在第一列中找到的值,则拆分该行并仅针对a测试第一个值set,而不是list:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return [line for line in U if line.split(None, 1)[0] in someset]
Run Code Online (Sandbox Code Playgroud)
针对集合的测试是O(1)恒定时间操作(与集合的大小无关).
演示:
>>> up = '/tmp/database.txt'
>>> open(up, 'w').write('''\
... P61981 1433G_HUMAN
... P61982 1433G_MOUSE
... Q5RC20 1433G_PONAB
... P61983 1433G_RAT
... P68253 1433G_SHEEP
... ''')
>>> def mplist(somelist):
... someset = set(somelist)
... with open(up) as U:
... return [line for line in U if line.split(None, 1)[0] in someset]
...
>>> mplist(['P61981', 'Q5RC20'])
['P61981 1433G_HUMAN\n', 'Q5RC20 1433G_PONAB\n']
Run Code Online (Sandbox Code Playgroud)
你可能想要返回一个生成器,而只是过滤,而不是在内存中构建一个列表:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return (line for line in U if line.split(None, 1)[0] in someset)
Run Code Online (Sandbox Code Playgroud)
你可以循环,但不能索引这个结果:
for match in mplist(somelist):
# do something with match
Run Code Online (Sandbox Code Playgroud)
并且不需要在内存中保存所有匹配的条目.
| 归档时间: |
|
| 查看次数: |
121 次 |
| 最近记录: |