我可以在不同的线程中运行 pd.df.to_csv 吗？

Question

我可以在不同的线程中运行 pd.df.to_csv 吗？

我有一个相当大的熊猫数据框，我想根据条件选择一些行。

问题在于，另存为 CSV 的操作与程序的整体流程是分开的，并且会消耗相当多的时间。

是否可以分离线程，以便主线程前进到选定的行，同时将未选定的行保存为另一个线程中的 csv？

例如...

# This is pseudo code

import pandas as pd

df = pd.DataFrame({"col1":[x for x in range(10000)], "col2":[x**2 for x in range(0, 10000)]})

df_selected = df[df.apply(lambda x: x.col1%3==0, axis=1)] 
df_unselected = df[df.apply(lambda x: x.col1%3!=0, axis=1)] 


def Other_thread_save_to_csv(df:pd.DataFrame):
     # this function is the last function to use df_unselected .


Other_thread_save_to_csv(df_unselected )

all_other_hadlings(df_selected )

Run Code Online (Sandbox Code Playgroud)

Answer 1

RAJ*_*TEL 5

是的，Python 的线程或多处理功能对于并发任务（例如在执行其他任务时将 DataFrame 保存到 CSV）非常方便。

在 python 中使用线程和多处理时需要考虑一些事情：

Python 中的全局解释器锁 (GIL)：这意味着线程可能并不总是能加速 CPU 密集型任务。但对于 I/O 任务（比如文件写入）来说，它还是很好用的。
对繁重的 CPU 任务使用多重处理：如果您的其他 DataFrame 任务是 CPU 密集型任务，则多重处理是比线程更好的选择。

最后一个是线程安全，当您将 DataFrame 写入 CSV 时，您必须确保没有其他线程正在更改 DataFrame。

# This is pseudo code

import pandas as pd
import threading

def save_to_csv(df, filename):
    df.to_csv(filename, index=False)

df = pd.DataFrame({"col1": [x for x in range(10000)], "col2": [x**2 for x in range(10000)]})

df_selected = df[df["col1"] % 3 == 0]
df_unselected = df[df["col1"] % 3 != 0]

# Initiating a thread to save a portion of DataFrame
thread = threading.Thread(target=save_to_csv, args=(df_unselected, 'unselected_rows.csv'))
thread.start()

# Continue other tasks with the main thread
# additional_operations(df_selected)

# Optionally, wait for the thread to complete
thread.join()

Run Code Online (Sandbox Code Playgroud)

save_to_csv函数在单独的线程上运行，允许您的程序在后台保存的df_selected同时进行处理。df_unselected

归档时间：	1 年，11 月前
查看次数：	101 次
最近记录：	1 年，11 月前