在Python中的一些Dataframe列上使用Imputer

Mau*_*ile 9 python missing-data scikit-learn imputation

我正在学习如何在Python上使用Imputer.

这是我的代码:

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df["price"])

df["price"]=imp.transform(df["price"])
Run Code Online (Sandbox Code Playgroud)

但是,这会引发以下错误:ValueError:值的长度与索引的长度不匹配

我的代码有什么问题???

谢谢你的帮助

fri*_*ist 14

这是因为Imputer通常使用DataFrames而不是Series.可能的解决方案是:

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df[["price"]])
df["price"]=imp.transform(df[["price"]]).ravel()

# Or even 
imp=Imputer(missing_values="NaN", strategy="mean" )
df["price"]=imp.fit_transform(df[["price"]]).ravel()
Run Code Online (Sandbox Code Playgroud)

  • 为什么这里需要`ravel()`?没有它,它似乎返回正确的类型 (2认同)

Rya*_*yan 2

我认为您想指定输入器的轴,然后转置它返回的数组:

import pandas as pd
import numpy as np

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean",axis=1 ) #specify axis
q = imp.fit_transform(df["price"]).T #perform a transpose operation


df["price"]=q
print df 
Run Code Online (Sandbox Code Playgroud)