我发现如果我们从 DataFrame 列表中初始化 pandas Series 对象,速度会非常慢。例如下面的代码:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Run Code Online (Sandbox Code Playgroud)
最初我认为 Series 初始化意外地深度复制了 DataFrame,这使得它变慢,但事实证明它只是像=python 中通常那样通过引用复制。
另一方面,如果我只是创建一个系列并手动浅复制元素(在 for 循环中),那么速度会很快:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), …Run Code Online (Sandbox Code Playgroud)