我有一个大熊猫数据成名df
.它有很多缺失.丢弃行/或逐行不是一种选择.输入中位数,平均值或最常见的值也不是一种选择(因此,插入pandas
和/或scikit
不幸的是没有做到这一点).
我遇到了一个看起来很整洁的包fancyimpute
(你可以在这里找到它).但我有一些问题.
这是我做的:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
Run Code Online (Sandbox Code Playgroud)
但是,它df_filled
是一个单一的向量,而不是填充的数据帧.如何通过插补来保持数据框?
我意识到,fancyimpute
需要一个numpay array
.我因此使用转换为df_numeric
数组as_matrix()
. …
我正在尝试使用 pip install 和 conda install 安装fancyimpute,并通过下载包并安装它,但在使用 pip isntall 时所有这些都失败了,它给了我以下错误
pip install fancyimpute
C:\Windows\system32>pip install fancyimpute
Processing c:\users\norah mahmoud\appdata\local\pip\cache\wheels\0e\65\31\fff6a8fa9d1df4c6204f5a9059340347d2085b971b67d3f0a0\fancyimpute-0.5.4-cp37-none-any.whl
Requirement already satisfied: keras>=2.0.0 in c:\users\norah mahmoud\appdata\local\programs\python\python37\lib\site-packages (from fancyimpute) (2.3.1)
Requirement already satisfied: tensorflow in c:\users\norah mahmoud\appdata\local\programs\python\python37\lib\site-packages (from fancyimpute) (2.0.0)
Requirement already satisfied: scipy in c:\users\norah mahmoud\appdata\local\programs\python\python37\lib\site-packages (from fancyimpute) (1.3.2)
Requirement already satisfied: numpy>=1.10 in c:\users\norah mahmoud\appdata\local\programs\python\python37\lib\site-packages (from fancyimpute) (1.17.4+mkl)
Requirement already satisfied: scikit-learn>=0.21.2 in c:\users\norah mahmoud\appdata\local\programs\python\python37\lib\site-packages (from fancyimpute) (0.21.3)
Collecting cvxpy>=1.0.6
Using cached https://files.pythonhosted.org/packages/d9/ed/90e0a13ad7ac4e7cdc2aeaefed26cebb4922f205bb778199268863fa2fbe/cvxpy-1.0.25.tar.gz
Requirement already satisfied: knnimpute in c:\users\norah …
Run Code Online (Sandbox Code Playgroud) 我正在尝试在anaconda py3.6上安装Fancyimpute,赢得10,64位.得到以下错误.
Collecting fancyimpute
Requirement already satisfied: downhill in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: numpy>=1.10 in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: scikit-learn>=0.17.1 in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: theano in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: scipy in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: climate in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: knnimpute in c:\anaconda3\lib\site-packages (from fancyimpute)
Requirement already satisfied: six in c:\anaconda3\lib\site-packages (from fancyimpute)
Collecting cvxpy (from fancyimpute)
Using cached cvxpy-0.4.10-py3-none-any.whl
Requirement already satisfied: click in …
Run Code Online (Sandbox Code Playgroud) Python 包Fancyimpute提供了几种方法来估算 Python 中的缺失值。该文档提供了以下示例:
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Model each feature with missing values as a function of other features, and
# use that estimate for imputation.
X_filled_ii = IterativeImputer().fit_transform(X_incomplete)
Run Code Online (Sandbox Code Playgroud)
当将插补方法应用于数据集时,这很有效X
。但是如果需要training/test
拆分呢?一次
X_train_filled = IterativeImputer().fit_transform(X_train_incomplete)
Run Code Online (Sandbox Code Playgroud)
被调用,我如何估算测试集并创建X_test_filled
?测试集需要使用来自训练集的信息进行估算。我想IterativeImputer()
应该返回和对象可以适合X_test_incomplete
。那可能吗?
请注意,对整个数据集进行插补然后拆分为训练和测试集是不正确的。
fancyimpute ×4
python ×3
python-3.x ×3
imputation ×2
anaconda ×1
missing-data ×1
pandas ×1
pip ×1