使用 pandas 数据框中的文本字符串数据进行条件数据选择

Question

使用 pandas 数据框中的文本字符串数据进行条件数据选择

Col*_*son 6 python string numpy dataframe pandas

我已经看过，但似乎找不到以下问题的答案。

我有一个与此类似的 pandas 数据框（称之为“df”）：

        Type              Set
    1   theGreen          Z
    2   andGreen          Z           
    3   yellowRed         X
    4   roadRed           Y

Run Code Online (Sandbox Code Playgroud)

我想向数据帧添加另一列（或生成一系列），其长度与数据帧相同（=相等的记录/行数），如果类型包含字符串“Green”，则该列分配数字编码变量（1）， (0) 否则。

本质上，我正在尝试找到一种方法来做到这一点：

   df['color'] = np.where(df['Type'] == 'Green', 1, 0)

Run Code Online (Sandbox Code Playgroud)

除了通常的 numpy 运算符（<、>、==、!= 等）之外，我需要一种表达“in”或“contains”的方式。这可能吗？任何和所有的帮助表示赞赏！

Answer 1

jez*_*ael 7

使用str.contains：

df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Run Code Online (Sandbox Code Playgroud)

另一个解决方案apply：

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Run Code Online (Sandbox Code Playgroud)

第二个解决方案更快，但不适用于NaNin column Type，然后返回error：

类型错误：“float”类型的参数不可迭代

时间：

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)  

In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop

In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，3 月前
查看次数：	7830 次
最近记录：	9 年，3 月前