小编Nnn*_*Nnn的帖子

Python中的隔离森林

我目前正在使用Python 中的Isolation Forest检测数据集中的异常值，但我没有完全理解 scikit-learn 文档中给出的示例和解释

是否可以使用隔离森林来检测具有 258 行和 10 列的数据集中的异常值？

我需要一个单独的数据集来训练模型吗？如果是，是否有必要让训练数据集没有异常值？

这是我的代码：

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))

Run Code Online (Sandbox Code Playgroud)

我尝试将我的数据集加载到，X_train但这似乎不起作用。

outliers python-3.x scikit-learn anomaly-detection

Nnn*_*Nnn

2019 02-18

6
推荐指数

1
解决办法

5141
查看次数

隔离森林：分类数据

我正在尝试使用 sklearn 中的隔离森林检测乳腺癌数据集中的异常。我正在尝试将 Iolation Forest 应用于混合数据集，当我拟合模型时，它会给我值错误。

这是我的数据集：https : //archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/

这是我的代码：

from sklearn.model_selection import train_test_split
rng = np.random.RandomState(42)

X = data_cancer.drop(['Class'],axis=1)
y = data_cancer['Class'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)
X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))

clf = IsolationForest()
clf.fit(X_train)

Run Code Online (Sandbox Code Playgroud)

这是我得到的错误：