在Pandas Dataframe中并行加载输入文件

MMS*_*MMS 5 python pandas anaconda

我有一个要求,我有三个输入文件,需要在两个文件合并到一个数据框之前加载它们在Pandas数据框内.

文件扩展名总是更改,可能是.txt一次,而.xlsx或.csv则是另一次.

如何平行运行此过程以节省等待/加载时间?

这是我目前的代码,

from time import time # to measure the time taken to run the code
start_time = time()

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"

import pandas as pd # to work with the data frames
Primary_df = pd.read_excel (Primary_File)
Secondary_1_df = pd.read_csv (Secondary_File_1)
Secondary_2_df = pd.read_csv (Secondary_File_2)

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()

print(end_time - start_time)
Run Code Online (Sandbox Code Playgroud)

加载我的primary_df和secondary_df需要大约20分钟.所以,我正在寻找一种有效的解决方案,可能使用并行处理来节省时间.我通过阅读操作计时,大部分时间大约需要18分45秒.

硬件配置: - Intel i5处理器,16 GB Ram和64位OS

问题是否有资格获得赏金: - 因为我正在寻找具有详细步骤的工作代码 - 使用anaconda环境中的支持加载我的输入文件并行并将它们分别存储在pandas数据框中.这最终应该节省时间.

Cez*_*ulc 7

试试这个:

from time import time 
import pandas as pd
from multiprocessing.pool import ThreadPool


start_time = time()

pool = ThreadPool(processes=3)

Primary_File = "//ServerA/Testing Folder File Open/Report.xlsx"
Secondary_File_1 = "//ServerA/Testing Folder File Open/Report2.csv"
Secondary_File_2 = "//ServerA/Testing Folder File Open/Report2.csv"


# Define a function for the thread
def import_xlsx(file_name):
    df_xlsx = pd.read_excel(file_name)
    # print(df_xlsx.head())
    return df_xlsx


def import_csv(file_name):
    df_csv = pd.read_csv(file_name)
    # print(df_csv.head())
    return df_csv

# Create two threads as follows

Primary_df = pool.apply_async(import_xlsx, (Primary_File, )).get() 
Secondary_1_df = pool.apply_async(import_csv, (Secondary_File_1, )).get() 
Secondary_2_df = pool.apply_async(import_csv, (Secondary_File_2, )).get() 

Secondary_df = Secondary_1_df.merge(Secondary_2_df, how='inner', on=['ID'])
end_time = time()
Run Code Online (Sandbox Code Playgroud)