使用 np.where 将列值转换为 NaN

Question

使用 np.where 将列值转换为 NaN

Joh*_*tud 3 python numpy python-3.x pandas

我无法弄清楚如何在 for 循环中使用 np.where 的索引结果。我想使用这个 for 循环仅更改给定 np.where 索引结果的列的值。

这是一个假设的例子，我想在我的数据集中找到某些问题或异常的索引位置，使用 np.where 获取它们的位置，然后在数据帧上运行一个循环以将它们重新编码为 NaN，同时留下每个其他索引不变。

到目前为止，这是我的简单代码尝试：

import pandas as pd
import numpy as np

# import iris
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/iris.csv')

# conditional np.where -- hypothetical problem data
find_error = np.where((df['petal_length'] == 1.6) & 
                  (df['petal_width'] == 0.2))

# loop over column to change error into NA
for i in enumerate(find_error):
    df = df['species'].replace({'setosa': np.nan})

# df[i] is a problem but I cannot figure out how to get around this or an alternative

Run Code Online (Sandbox Code Playgroud)

Answer 1

cs9*_*s95 5

您可以直接分配给列：

m = (df['petal_length'] == 1.6) & (df['petal_width'] == 0.2)
df.loc[m, 'species'] = np.nan

Run Code Online (Sandbox Code Playgroud)

或者，修复您的代码。

df['species'] = np.where(m, np.nan, df['species'])

Run Code Online (Sandbox Code Playgroud)

或者，使用Series.mask：

df['species'] = df['species'].mask(m)

Run Code Online (Sandbox Code Playgroud)

@JohnStud 在某些情况下循环很有用，但通常不建议将它们用于数字数据（尤其是当存在矢量化方法时）。循环适用于字符串/正则表达式操作。我在这里有一篇详细的文章：[For loops with pandas - 我什么时候应该关心？](/sf/ask/3781973961/) (2认同)

归档时间：	7 年，1 月前
查看次数：	395 次
最近记录：	7 年，1 月前