将Excel数据加速到Pandas

Jki*_*nd9 11 python excel performance python-3.x pandas

I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = load_workbook(file_name, read_only=True, data_only=True)
            ws_loop = wb_loop["SHEET1"]
            df = pd.DataFrame(ws_loop.values)
            print("Opening Workbook:         ", time.clock()-test_time)

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            print("Data Manipulation:         ", time.clock()-test_time)
Run Code Online (Sandbox Code Playgroud)

So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.

I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?

Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer

Edit 2:

So the fastest way this was done was using xlrd as per the accepted answers recommendation. I also realised that it was quicker to delete the workbook at the end of the loop. The final code looks like

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = xlrd.open_workbook(file_name, on_demand = True)
            ws_loop = wb_loop.sheet_by_name("Sheet1")
            print("Opening Workbook:         ", time.clock()-test_time)

            df = pd.DataFrame([ws_loop.row_values(n) for n in  range(ws_loop.nrows)])            

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            del wb_loop

            print("Data Manipulation:         ", time.clock()-test_time)

        except:
            pass
        counter+=1
        print("%s %% Done" %(counter*100/len(unique_file_names)))

    wb_new = xlwt.Workbook()
    ws_new = wb_new.add_sheet("Test")
    ws_new.write(newarray)
    wb_new.save(r"C:Libraries/Documents/NewOutput.xls")

Run Code Online (Sandbox Code Playgroud)

This outputs an average time per loop of 1.6-1.8s. Thanks for everyones help.

Dav*_*vid 2

这是一个快速基准(扩展了这个)。显然,对于测试 .xlsx 文件,直接使用 xlrd 比 pandas 稍快。如果 .csv 文件可用,读取它们肯定会快得多,但使用 LibreOffice 转换它们会慢得多:

\n\n
pd_base 1.96 [in seconds]\npd_float 2.03\npd_object 2.01 [see cs95\xc2\xb4s comment to your question]\npd_xlrd 1.95\npyxl_base 2.15\nxlrd_base 1.79\ncsv_ready 0.17\ncsv_convert 18.72\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是代码:

\n\n
import pandas as pd\nimport openpyxl\nimport xlrd\nimport subprocess\n\nfile = \'test.xlsx\'\ndf = pd.DataFrame([[i+j for i in range(50)] for j in range(100)])\ndf.to_excel(file, index=False)\ndf.to_csv(file.replace(\'.xlsx\', \'.csv\'), index=False)\n\ndef pd_base():\n    df = pd.read_excel(file)\ndef pd_float():\n    df = pd.read_excel(file, dtype=np.int)\ndef pd_object():\n    df = pd.read_excel(file, sheet_name="Sheet1", dtype=object)\ndef pd_xlrd():\n    df = pd.read_excel(file, engine=\'xlrd\')\ndef pyxl_base():\n    wb = openpyxl.load_workbook(file, read_only=True, keep_links=False, data_only=True)\n    sh = wb.active\n    df = pd.DataFrame(sh.values)\ndef xlrd_base():\n    wb = xlrd.open_workbook(file)\n    sh = wb.sheet_by_index(0)\n    df = pd.DataFrame([sh.row_values(n) for n in  range(sh.nrows)])\ndef csv_ready():    \n    df = pd.read_csv(file.replace(\'.xlsx\', \'.csv\'))\ndef csv_convert():    \n    out = subprocess.check_output([\'libreoffice --headless --convert-to csv test.xlsx\'], shell=True, stderr=subprocess.STDOUT)\n    df = pd.read_csv(file.replace(\'.xlsx\', \'.csv\'))\n\ndef measure(func, nums=50):\n    temp = time.time()\n    for num in range(nums):\n        func()\n    diff = time.time() - temp\n    print(func.__name__, \'%.2f\' % diff)\n\nfor func in [pd_base, pd_float, pd_object, pd_xlrd, pyxl_base, xlrd_base, csv_ready, csv_convert]:\n    measure(func)    \n
Run Code Online (Sandbox Code Playgroud)\n