读取大文件(> 8GB)并将数据转储到字典并再次加载的最快方法

Pau*_*l85 7 python file-access large-files python-2.7

我正在处理一个大的蛋白质序列(fasta)文件(> 8GB),我的想法是创建字典,其中键和值分别是蛋白质id和序列.

至于现在我可以使用pickle然后尝试打开数据并将数据转储到字典中cpickle(我读取pickle更快的数据转储cpickle速度并加载数据更快).但这里的主要问题是时间:制作和转储它作为字典需要花费太多时间和内存(PC有8GB内存).

有没有更快的选项可用于处理Python中的大文件?

这是我创建字典并转储数据的Python代码:

from Bio import SeqIO
import pickle,sys

fastaSeq = {}
with open('uniref90.fasta') as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
       header =seq_record.id
       uniID = header.split('_')[1]
       seqs = str(seq_record.seq)
       fastaSeq[uniID] = seqs

f = open('uniref90.obj', 'wb')
pickle.dump(fastaSeq, f, pickle.HIGHEST_PROTOCOL)
f.close()
Run Code Online (Sandbox Code Playgroud)

在单独的Python程序中加载字典并执行某些任务:

import cPickle as pickle
seq_dict = pickle.load(open("uniref90.obj", "rb"))
for skey in seq_dict.keys():
   #doing something 
Run Code Online (Sandbox Code Playgroud)

Jak*_*yer 6

数据库是我儿子的朋友.

import sqlite3
from Bio import SeqIO

db = sqlite3.connect("./db")

c = db.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS map (k text unique, v text)''')
db.commit()


def keys(db):
    cursor = db.cursor()
    return cursor.execute("""SELECT k FROM map""").fetchall()


def get(key, db, default=None):
    cursor = db.cursor()
    result = cursor.execute("""SELECT v FROM map WHERE k = ?""", (key,)).fetchone()
    if result is None:
        return default
    return result[0]


def save(key, value, db):
    cursor = db.cursor()
    cursor.execute("""INSERT INTO map VALUES (?,?)""", (key, value))
    db.commit()


with open('uniref90.fasta') as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
       header = seq_record.id
       uniID = header.split('_')[1]
       seqs = str(seq_record.seq)
       save(uniID, seqs, db)
Run Code Online (Sandbox Code Playgroud)