Python:循环超过2mln行

Joh*_*aph 0 python loops

我必须循环使用2mln行的大文件,看起来像这样

P61981  1433G_HUMAN
P61982  1433G_MOUSE
Q5RC20  1433G_PONAB
P61983  1433G_RAT
P68253  1433G_SHEEP
Run Code Online (Sandbox Code Playgroud)

目前我有以下函数,它接受列表中的每个条目,如果这个大文件中的条目 - 它占用了出现的行,但它很慢(~10分钟).可能由于循环方案,你能建议优化吗?

up = "database.txt"

def mplist(somelist):
    newlist = []
    with open(up) as U:
        for row in U:
            for i in somelist:
                if i in row:
                    newlist.append(row)
    return newlist
Run Code Online (Sandbox Code Playgroud)

例子 somelist

somelist = [
    'P68250',
    'P31946',
    'Q4R572',
    'Q9CQV8',
    'A4K2U9',
    'P35213',
    'P68251'
]
Run Code Online (Sandbox Code Playgroud)

Mar*_*ers 6

如果您somelist只包含在第一列中找到的值,则拆分该行并仅针对a测试第一个值set,而不是list:

def mplist(somelist):
    someset = set(somelist)
    with open(up) as U:
        return [line for line in U if line.split(None, 1)[0] in someset]
Run Code Online (Sandbox Code Playgroud)

针对集合的测试是O(1)恒定时间操作(与集合的大小无关).

演示:

>>> up = '/tmp/database.txt'
>>> open(up, 'w').write('''\
... P61981  1433G_HUMAN
... P61982  1433G_MOUSE
... Q5RC20  1433G_PONAB
... P61983  1433G_RAT
... P68253  1433G_SHEEP
... ''')
>>> def mplist(somelist):
...     someset = set(somelist)
...     with open(up) as U:
...         return [line for line in U if line.split(None, 1)[0] in someset]
... 
>>> mplist(['P61981', 'Q5RC20'])
['P61981  1433G_HUMAN\n', 'Q5RC20  1433G_PONAB\n']
Run Code Online (Sandbox Code Playgroud)

可能想要返回一个生成器,而只是过滤,而不是在内存中构建一个列表:

def mplist(somelist):
    someset = set(somelist)
    with open(up) as U:
        return (line for line in U if line.split(None, 1)[0] in someset)
Run Code Online (Sandbox Code Playgroud)

你可以循环,但不能索引这个结果:

for match in mplist(somelist):
    # do something with match
Run Code Online (Sandbox Code Playgroud)

并且不需要在内存中保存所有匹配的条目.