Kla*_*sos 7 python dataframe pandas categorical-data
我的DataFrame有一列:
import pandas as pd
list=[1,1,4,5,6,6,30,20,80,90]
df=pd.DataFrame({'col1':list})
Run Code Online (Sandbox Code Playgroud)
如何在col1中添加一个包含分类信息的列'col2':
if col1 > 0 and col1 <= 10 then col2 = 'xxx'
if col1 > 10 and col1 <= 50 then col2 = 'yyy'
if col1 > 50 then col2 = 'zzz'
Run Code Online (Sandbox Code Playgroud)
您可以使用pd.cut
以下方法:
df['col2'] = pd.cut(df['col1'], bins=[0, 10, 50, float('Inf')], labels=['xxx', 'yyy', 'zzz'])
Run Code Online (Sandbox Code Playgroud)
输出:
col1 col2
0 1 xxx
1 1 xxx
2 4 xxx
3 5 xxx
4 6 xxx
5 6 xxx
6 30 yyy
7 20 yyy
8 80 zzz
9 90 zzz
Run Code Online (Sandbox Code Playgroud)
您可以先创建一个新列col2
,然后根据条件更新其值:
df['col2'] = 'zzz'
df.loc[(df['col1'] > 0) & (df['col1'] <= 10), 'loc2'] = 'xxx'
df.loc[(df['col1'] > 10) & (df['col1'] <= 50), 'loc2'] = 'yyy'
print df
Run Code Online (Sandbox Code Playgroud)
输出:
col1 col2
0 1 xxx
1 1 xxx
2 4 xxx
3 5 xxx
4 6 xxx
5 6 xxx
6 30 yyy
7 20 yyy
8 80 zzz
9 90 zzz
Run Code Online (Sandbox Code Playgroud)
或者,您也可以根据列应用函数col1
:
def func(x):
if 0 < x <= 10:
return 'xxx'
elif 10 < x <= 50:
return 'yyy'
return 'zzz'
df['col2'] = df['col1'].apply(func)
Run Code Online (Sandbox Code Playgroud)
这将产生相同的输出.
该apply
方法应在这种情况下是首选,因为它的速度要快得多:
%timeit run() # packaged to run the first approach
# 100 loops, best of 3: 3.28 ms per loop
%timeit df['col2'] = df['col1'].apply(func)
# 10000 loops, best of 3: 187 µs per loop
Run Code Online (Sandbox Code Playgroud)
但是,当DataFrame的大小很大时,内置的矢量化操作(即使用屏蔽方法)可能会更快.