Jki*_*nd9 11 python excel performance python-3.x pandas
I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.
for file in unique_file_names[1:]:
file_name = rootdir + "/" + str(file)
test_time = time.clock()
try:
wb_loop = load_workbook(file_name, read_only=True, data_only=True)
ws_loop = wb_loop["SHEET1"]
df = pd.DataFrame(ws_loop.values)
print("Opening Workbook: ", time.clock()-test_time)
newarray = np.vstack((newarray, df.loc[4:43,:13].values))
print("Data Manipulation: ", time.clock()-test_time)
Run Code Online (Sandbox Code Playgroud)
So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.
I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?
Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer
Edit 2:
So the fastest way this was done was using xlrd as per the accepted answers recommendation. I also realised that it was quicker to delete the workbook at the end of the loop. The final code looks like
for file in unique_file_names[1:]:
file_name = rootdir + "/" + str(file)
test_time = time.clock()
try:
wb_loop = xlrd.open_workbook(file_name, on_demand = True)
ws_loop = wb_loop.sheet_by_name("Sheet1")
print("Opening Workbook: ", time.clock()-test_time)
df = pd.DataFrame([ws_loop.row_values(n) for n in range(ws_loop.nrows)])
newarray = np.vstack((newarray, df.loc[4:43,:13].values))
del wb_loop
print("Data Manipulation: ", time.clock()-test_time)
except:
pass
counter+=1
print("%s %% Done" %(counter*100/len(unique_file_names)))
wb_new = xlwt.Workbook()
ws_new = wb_new.add_sheet("Test")
ws_new.write(newarray)
wb_new.save(r"C:Libraries/Documents/NewOutput.xls")
Run Code Online (Sandbox Code Playgroud)
This outputs an average time per loop of 1.6-1.8s. Thanks for everyones help.
这是一个快速基准(扩展了这个)。显然,对于测试 .xlsx 文件,直接使用 xlrd 比 pandas 稍快。如果 .csv 文件可用,读取它们肯定会快得多,但使用 LibreOffice 转换它们会慢得多:
\n\npd_base 1.96 [in seconds]\npd_float 2.03\npd_object 2.01 [see cs95\xc2\xb4s comment to your question]\npd_xlrd 1.95\npyxl_base 2.15\nxlrd_base 1.79\ncsv_ready 0.17\ncsv_convert 18.72\nRun Code Online (Sandbox Code Playgroud)\n\n这是代码:
\n\nimport pandas as pd\nimport openpyxl\nimport xlrd\nimport subprocess\n\nfile = \'test.xlsx\'\ndf = pd.DataFrame([[i+j for i in range(50)] for j in range(100)])\ndf.to_excel(file, index=False)\ndf.to_csv(file.replace(\'.xlsx\', \'.csv\'), index=False)\n\ndef pd_base():\n df = pd.read_excel(file)\ndef pd_float():\n df = pd.read_excel(file, dtype=np.int)\ndef pd_object():\n df = pd.read_excel(file, sheet_name="Sheet1", dtype=object)\ndef pd_xlrd():\n df = pd.read_excel(file, engine=\'xlrd\')\ndef pyxl_base():\n wb = openpyxl.load_workbook(file, read_only=True, keep_links=False, data_only=True)\n sh = wb.active\n df = pd.DataFrame(sh.values)\ndef xlrd_base():\n wb = xlrd.open_workbook(file)\n sh = wb.sheet_by_index(0)\n df = pd.DataFrame([sh.row_values(n) for n in range(sh.nrows)])\ndef csv_ready(): \n df = pd.read_csv(file.replace(\'.xlsx\', \'.csv\'))\ndef csv_convert(): \n out = subprocess.check_output([\'libreoffice --headless --convert-to csv test.xlsx\'], shell=True, stderr=subprocess.STDOUT)\n df = pd.read_csv(file.replace(\'.xlsx\', \'.csv\'))\n\ndef measure(func, nums=50):\n temp = time.time()\n for num in range(nums):\n func()\n diff = time.time() - temp\n print(func.__name__, \'%.2f\' % diff)\n\nfor func in [pd_base, pd_float, pd_object, pd_xlrd, pyxl_base, xlrd_base, csv_ready, csv_convert]:\n measure(func) \nRun Code Online (Sandbox Code Playgroud)\n