在Pandas中创建许多新列的大多数Pythonic方法

Question

在Pandas中创建许多新列的大多数Pythonic方法

我有一个大的数据帧df(~100列和~700万行),我需要创建~50个新的变量/列,它们是当前变量的简单转换.一种方法是使用许多.apply语句(我只是transform*用作简单转换的占位符,例如max或平方):

df['new_var1'] = df['old_var1'].apply(lambda x : transform1(x))
...
df['new_var50'] = df['old_var50'].apply(lambda x : transform50(x))

Run Code Online (Sandbox Code Playgroud)

另一种方法是首先创建一个字典

transform_dict = {
'new_var1' : lambda row : transform1(row),
...,
'new_var50' : lambda row : transform50(row)
}

Run Code Online (Sandbox Code Playgroud)

然后写一个.apply结合.concat:

df = pd.concat([df, 
   df.apply(lambda r: pd.Series({var : transform_dict[var](r) for var in transform_dict.keys()}), axis=1)], axis=1)

Run Code Online (Sandbox Code Playgroud)

一种方法优于另一种方法,无论是"Pythonic"如何,还是效率,可扩展性,灵活性？

Answer 1

Ste*_*fan 3

从...开始：

df = pd.DataFrame(np.random.random((1000, 100)))

Run Code Online (Sandbox Code Playgroud)

添加单独的列：

def cols_via_apply(df):
    for i in range(100, 150):
        df[i] = df[i-100].apply(lambda x: x * i)
    return df  

%timeit cols_via_apply(df)

10 loops, best of 3: 29.6 ms per loop

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None

Run Code Online (Sandbox Code Playgroud)

似乎比使用更有效- 大概是因为所涉及的pd.concat存在一个循环。因此，随着长度变长，这种差异会变得更糟：rowsDataFrameDataFrame

def cols_via_concat(df):
    df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i-100] for i in range(100, 150)}), axis=1)])
    return df


%timeit cols_via_concat(df)

1 loops, best of 3: 450 ms per loop

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	870 次
最近记录：	9 年，9 月前