her*_*lla 2 python dataframe python-2.7 pandas
我的数据如下:
Close a b c d e Time
2015-12-03 2051.25 5 4 3 1 1 05:00:00
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
Run Code Online (Sandbox Code Playgroud)
我需要"水平"计算不是NaN的列['a']到['e']中的值.结果将是这样的:
df['Count'] = .....
df
Close a b c d e Time Count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
Run Code Online (Sandbox Code Playgroud)
谢谢
您可以从您的df中进行选择并呼叫count传递axis=1:
In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df
Out[24]:
Close a b c d e Time count
2015-12-03 2051.25 5 4 3 1 1 05:00:00 5
2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4
2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3
2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2
2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
Run Code Online (Sandbox Code Playgroud)
的时间设置
In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop
Run Code Online (Sandbox Code Playgroud)
apply是最慢的,这不是一个惊喜,drop版本略快,但从语义上讲,我更喜欢传递感兴趣的列表并要求count可读性
嗯,我现在不断变化的时间:
In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop
Run Code Online (Sandbox Code Playgroud)
更多时间
In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop
Run Code Online (Sandbox Code Playgroud)
似乎测试notnull和求和(因为notnull将生成布尔掩码)在此数据集上更快
在50k行df上,最后一种方法稍微快一些:
In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1)
1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop
Run Code Online (Sandbox Code Playgroud)