我有一个 df ,它有一些 NaN 值。例如,这里是 df:
import numpy as np
import pandas as pd
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = np.NaN
data[6,0] = np.NaN
data[5,1] = np.NaN
data[7,1] = np.NaN
data[1,2] = np.NaN
data[8,2] = np.NaN
data[6,2] = np.NaN
df = pd.DataFrame(data)
df
Run Code Online (Sandbox Code Playgroud)
这是上面代码的运行结果:
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 NaN
2 0.670749 0.825853 0.136707
3 NaN 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 NaN 0.171941
6 NaN 0.274074 NaN
7 0.940030 NaN 0.336112
8 0.175410 0.372832 NaN
9 0.252426 0.795663 0.015255
Run Code Online (Sandbox Code Playgroud)
我想要的是 NaN 值用上限和下限值的平均值填充,如下所示:
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = (data[2,0] + data[4,0])/2
data[6,0] = (data[5,0] + data[7,0])/2
data[5,1] = (data[4,1] + data[6,1])/2
data[7,1] = (data[6,1] + data[8,1])/2
data[1,2] = (data[0,2] + data[2,2])/2
data[8,2] = (data[7,2] + data[9,2])/2
data[6,2] = (data[5,2] + data[7,2])/2
df = pd.DataFrame(data)
df
Run Code Online (Sandbox Code Playgroud)
上面代码的结果是:
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.280612
2 0.670749 0.825853 0.136707
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
Run Code Online (Sandbox Code Playgroud)
我怎样才能在Python中自动执行此操作?
我认为DataFrame.interpolate这里应该有所帮助:
df1 = df.interpolate()
print (df1)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.280612
2 0.670749 0.825853 0.136707
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
Run Code Online (Sandbox Code Playgroud)
如果有多个连续的NaNsinterpolate则不会替换为mean:
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = np.NaN
data[6,0] = np.NaN
data[5,1] = np.NaN
data[7,1] = np.NaN
data[1,2] = np.NaN
data[2,2] = np.NaN
data[8,2] = np.NaN
data[6,2] = np.NaN
df = pd.DataFrame(data)
print (df)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 NaN
2 0.670749 0.825853 NaN
3 NaN 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 NaN 0.171941
6 NaN 0.274074 NaN
7 0.940030 NaN 0.336112
8 0.175410 0.372832 NaN
Run Code Online (Sandbox Code Playgroud)
df1 = df.interpolate()
print (df1)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.352746
2 0.670749 0.825853 0.280974
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
Run Code Online (Sandbox Code Playgroud)
均值的解:
df2 = df.ffill().add(df.bfill()).div(2)
print (df2)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.316860
2 0.670749 0.825853 0.316860
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
Run Code Online (Sandbox Code Playgroud)