MYG*_*YGz 34 python python-2.7 pandas
我有这样的数据帧:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik]
Run Code Online (Sandbox Code Playgroud)
我是CreationDate列中列表的计算长度,并创建一个新Length列,如下所示:
df['Length'] = df.CreationDate.apply(lambda x: len(x))
Run Code Online (Sandbox Code Playgroud)
这给了我这个:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Run Code Online (Sandbox Code Playgroud)
是否有更多的pythonic方式来做到这一点?
ayh*_*han 50
您也可以使用str访问器进行一些列表操作.在这个例子中,
df['CreationDate'].str.len()
Run Code Online (Sandbox Code Playgroud)
返回每个列表的长度.请参阅文档str.len.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Run Code Online (Sandbox Code Playgroud)
对于这些操作,vanilla Python通常更快.熊猫虽然处理NaNs.这是时间:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
Run Code Online (Sandbox Code Playgroud)
pandas.Series.map(len)并且pandas.Series.apply(len)在执行时间上是相同的,并且比 略快pandas.Series.str.len()。
import pandas as pd
data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
df = pd.DataFrame(data, index)
# create Length column
df['Length'] = df.os.map(len)
# display(df)
os Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Run Code Online (Sandbox Code Playgroud)
%timeitimport pandas as pd
import random
import string
random.seed(365)
ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
20917 次 |
| 最近记录: |