使用openpyxl和大数据的内存错误优于

Dav*_*uez 8 python csv openpyxl

我写了一个脚本,它必须从一个文件夹中读取很多excel文件(大约10,000个).此脚本加载excel文件(其中一些有超过2,000行)并读取一列来计算行数(检查内容).如果行数不等于给定数字,则会将警告写入日志中.

当脚本读取超过1,000个excel文件时出现问题.然后当它抛出内存错误时,我不知道问题出在哪里.以前,该脚本读取两个包含14,000行的csv文件并将其存储在列表中.这些列表包含excel文件的标识符及其各自的行数.如果此行数不等于excel文件的行数,则会写入警告.阅读这些清单可能是个问题?

我正在使用openpyxl来加载工作簿,在打开下一个工作簿之前是否需要关闭它们?

这是我的代码:

# -*- coding: utf-8 -*-

import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook

folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0


cond_numbers = []
with open('Conditions.csv','rb') as csv_name:           # Abre el fichero csv donde estarán las equivalencias   
    csv_read = csv.reader(csv_name,delimiter='\t')

    for reads in csv_read:
        cond_numbers.append(reads)

flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
    csv_read = csv.reader(vuelo_TF,delimiter=';')

    for reads in csv_read:
        flight_TF.append(reads)


excel_files = glob.glob('*.xlsx')

for excel in excel_files:
    print "Leyendo excel: "+excel

    wb = load_workbook(excel)
    ws = wb.get_sheet_by_name('Control System')
    flight = ws.cell('A7').value
    typical_flight = ws.cell('B7').value
    a = 0

    for row in range(6,ws.get_highest_row()):
        conditions = conditions + 1


        value_flight = int(ws.cell(row=row,column=0).value)
        value_TF = ws.cell(row=row,column=1).value
        value_SP = int(ws.cell(row=row,column=4).value)

        if value_flight == '':
            break

        if value_flight != flight:
            flight_error = 1                # Si no todos los flight numbers dentro del vuelo son iguales

        if value_TF != typical_flight:
            typical_flight_error = 2            # Si no todos los typical flight dentro del vuelo son iguales

        if value_SP != 100:
            SP_error = 1



    for cond in cond_numbers:
        if int(flight) == int(cond[0]):
            conds = int(cond[1])
            if conds != int(conditions):
                condition_error = 1         # Si el número de condiciones no se corresponde con el esperado

    for vuelo_TF in flight_TF:
        if int(vuelo_TF[0]) == int(flight):
            TF = vuelo_TF[1]
            if typical_flight != TF:
                typical_flight_error = 1        # Si el vuelo no coincide con el respectivo typical flight

    if flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+':  Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
        log.write(message)
        log.close()
        flight_error = 0

    if condition_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
        log.write(message)
        log.close()
        condition_error = 0

    if typical_flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
        log.write(message)
        log.close() 
        typical_flight_error = 0

    if typical_flight_error == 2:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
        log.write(message)
        log.close()
        typical_flight_error = 0

    if SP_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
        log.write(message)
        log.close()
        SP_error = 0

    conditions = 0
Run Code Online (Sandbox Code Playgroud)

结尾的if语句用于检查和写入警告日志.

我正在使用带有8 GB RAM和intel xeon w3505(两个核心,2,53 GHz)的Windows XP.

anu*_*gal 10

openpyxl的默认实现将所有访问的单元格存储到内存中.我建议你使用Optimized reader(链接 - https://openpyxl.readthedocs.org/en/latest/optimized.html)代替

在代码中: -

wb = load_workbook(file_path, use_iterators = True)
Run Code Online (Sandbox Code Playgroud)

在加载工作簿时传递use_iterators = True.然后访问工作表和单元格,如:

for row in sheet.iter_rows():
    for cell in row:
        cell_text = cell.value
Run Code Online (Sandbox Code Playgroud)

这将内存占用减少到5-10%

更新:在版本2.4.0 use_iterators = True中删除选项.在较新的版本openpyxl.writer.write_only.WriteOnlyWorksheet中引入了转储大量数据.

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
    ws.append(['%d' % i for i in range(200)])

# save the file
wb.save('new_big_file.xlsx') 
Run Code Online (Sandbox Code Playgroud)

没有测试刚刚从上面的链接复制的下面的代码.

感谢@SdaliM提供的信息.

  • 此选项似乎不再存在(openpyxl 2.4.1).您提供的链接未提及此类选项.也许你知道更换? (3认同)

Dmi*_*sov 6

使用 openpyxl 的最新版本,必须使用read_only=True参数加载和读取巨大的源工作簿,并使用write_only=True模式创建/写入巨大的目标工作簿:

https://openpyxl.readthedocs.io/en/latest/optimized.html

  • 这些没有解决的问题是我需要_更新_带有大量附加数据的大型工作簿。我无法将其设置为只读或只写(我相信这仅允许您_创建_新工作簿,而不是更新)。 (4认同)