War*_*ren 4 python lambda pandas
我有DataFrame,我试图将每列中的所有值分配为该列的总和.
x = pd.DataFrame(data = [[1,2],[3,4],[5,6],[7,8],[9,10]],index=[1,2,3,4,5],columns=['a','b'])
x
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
Run Code Online (Sandbox Code Playgroud)
输出应该是
a b
1 25 30
2 25 30
3 25 30
4 25 30
5 25 30
Run Code Online (Sandbox Code Playgroud)
我想使用x.apply(f,axis = 0),但我不知道如何定义一个将列转换为lambda函数中所有列值之和的函数.以下行引发SyntaxError:无法分配给lambda
f = lambda x : x[:]= x.sum()
Run Code Online (Sandbox Code Playgroud)
for col in df:
df[col] = df[col].sum()
Run Code Online (Sandbox Code Playgroud)
或者不使用循环的较慢解决方案......
df = pd.DataFrame([df.sum()] * len(df))
Run Code Online (Sandbox Code Playgroud)
计时
@jezrael感谢您的时间安排.这可以在更大的数据帧上进行,也包括for循环.大部分时间都花在创建数据框而不是计算总和上,因此执行此操作的最有效方法似乎是来自@ayhan的方法,它直接将值分配给值:
from string import ascii_letters
df = pd.DataFrame(np.random.randn(10000, 52), columns=list(ascii_letters))
# A baseline timing figure to determine sum of each column.
%timeit df.sum()
1000 loops, best of 3: 1.47 ms per loop
# Solution 1 from @Alexander
%%timeit
for col in df:
df[col] = df[col].sum()
100 loops, best of 3: 21.3 ms per loop
# Solution 2 from @Alexander (without `for loop`, but much slower)
%timeit df2 = pd.DataFrame([df.sum()] * len(df))
1 loops, best of 3: 270 ms per loop
# Solution from @PiRSquared
%timeit df.stack().groupby(level=1).transform('sum').unstack()
10 loops, best of 3: 159 ms per loop
# Solution 1 from @Jezrael
%timeit (pd.DataFrame(np.tile(df.sum().values, (len(df.index),1)), columns=df.columns, index=df.index))
100 loops, best of 3: 2.32 ms per loop
# Solution 2 from @Jezrael
%%timeit
df2 = pd.DataFrame(df.sum().values[np.newaxis,:].repeat(len(df.index), axis=0),
columns=df.columns,
index=df.index)
100 loops, best of 3: 2.3 ms per loop
# Solution from @ayhan
%time df.values[:] = df.values.sum(0)
CPU times: user 1.54 ms, sys: 485 µs, total: 2.02 ms
Wall time: 1.36 ms # <<<< FASTEST
Run Code Online (Sandbox Code Playgroud)
另一个更快速的numpy解决方案numpy.tile:
print (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)),
columns=x.columns,
index=x.index))
a b
1 25 30
2 25 30
3 25 30
4 25 30
5 25 30
Run Code Online (Sandbox Code Playgroud)
另一个解决方案numpy.repeat:
h = pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),
columns=x.columns,
index=x.index)
print (h)
a b
1 25 30
2 25 30
3 25 30
4 25 30
5 25 30
In [431]: %timeit df = pd.DataFrame([x.sum()] * len(x))
1000 loops, best of 3: 786 µs per loop
In [432]: %timeit (pd.DataFrame(np.tile(x.sum().values, (len(x.index),1)), columns=x.columns, index=x.index))
1000 loops, best of 3: 192 µs per loop
In [460]: %timeit pd.DataFrame(x.sum().values[np.newaxis,:].repeat(len(x.index), axis=0),columns=x.columns, index=x.index)
The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 184 µs per loop
Run Code Online (Sandbox Code Playgroud)