还有比 pandas fillna() 更快的方法吗?

Poe*_*dit 6 python missing-data pandas

Pandas fillna()速度非常慢,尤其是在数据框中存在大量丢失数据的情况下。

还有比它更快的方法吗?

(我知道如果我简单地删除一些包含 NA 的行和/或列会有帮助)

jez*_*ael 7

我尝试测试:

\n\n
np.random.seed(123)\nN = 60000\ndf = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n
In [333]: %timeit df.fillna('b')\n93.5 ms \xc2\xb1 1.28 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\nIn [337]: %timeit df[df.isna()] = 'b'\n122 ms \xc2\xb1 2.75 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

稍微改变了解决方案(但我觉得它有点hacky):

\n\n
#pandas below\nIn [335]: %timeit df.values[df.isna()] = 'b'\n56.7 ms \xc2\xb1 799 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n\n#pandas 0.24+\nIn [339]: %timeit df.to_numpy()[df.isna()] = 'b'\n56.5 ms \xc2\xb1 951 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n