保持NaNs与pandas数据帧不等式

Question

保持NaNs与pandas数据帧不等式

我有一个pandas.DataFrame对象,包含大约100列和200000行数据.我试图将其转换为bool数据帧,其中True表示该值大于阈值,False表示它更小,并且保持NaN值.

如果没有NaN值,我运行大约需要60毫秒:

df >= threshold

Run Code Online (Sandbox Code Playgroud)

但是当我尝试处理NaN时,下面的方法有效,但速度非常慢(20秒).

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

Run Code Online (Sandbox Code Playgroud)

有更快的方法吗？

Answer 1

oce*_*paf 6

你可以做：

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

Run Code Online (Sandbox Code Playgroud)

但是，这与使用apply方法会得到的结果不同。在这里，您的遮罩具有包含NaN，0.0和1.0的float dtype。在objectApply 解决方案中，您将获得带有NaN，False和True的dtype。

都不能用作面具，因为您可能无法获得想要的东西。IEEE表示，任何NaN比较必须得出False，并且apply方法通过返回NaN隐式违反了该方法！

最好的选择是分别跟踪NaN，并且在安装瓶颈时df.isnull（）非常快。

归档时间：	10 年，7 月前
查看次数：	1095 次
最近记录：	10 年，6 月前