标签: scikit-learn

随机梯度增强提供了不可预测的结果

我正在使用用于Python的Scikit模块来实现随机梯度增强。我的数据集具有2700个实例和1700个特征（x），并包含二进制数据。我的输出向量是“ y”，并且包含0或1（二进制分类）。我的代码是

gb = GradientBoostingClassifier(n_estimators=1000,learn_rate=1,subsample=0.5) gb.fit(x,y)

print gb.score(x,y)

一旦运行，它的精度为1.0（100％），有时我的精度约为0.46（46％）。知道为什么其性能如此巨大的差距吗？

python machine-learning scikits scikit-learn

los*_*_19

lucky-day

0
推荐指数

1
解决办法

1536
查看次数

scikit-learn - explain_variance_score

我正在使用scikit-learn来构建一个由svm训练和测试的样本分类器.现在我想分析分类器并找到explain_variance_score,但我不明白这个分数.例如,我得到了clf的分类报告,它看起来像这样......

             precision    recall  f1-score   support

        0.0       0.80      0.80      0.80        10
        1.0       0.80      0.80      0.80        10

avg / total       0.80      0.80      0.80        20

Run Code Online (Sandbox Code Playgroud)

还不错,但EVS只是0.2......有时-0.X...... 它怎么会发生这种情况呢？拥有一个好的EVS是否重要？也许有人可以解释我这个......

Y_true和Y_pred:

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]

[ 1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  0.
  0.  0.]

Run Code Online (Sandbox Code Playgroud)

python classification machine-learning svm scikit-learn

Lin*_*nda

lucky-day

0
推荐指数

1
解决办法

1079
查看次数

如何训练scvit中的svm学习csv文件中存在的训练数据

我将训练数据放在CSV文件中,其第一个元素是结果,其余元素构成特征向量.

我正在使用Weka来训练和测试这种训练数据的各种算法.但是现在我想多次使用经过训练的模型来测试一个特征向量,这个特征向量不是训练数据的一部分,我对如何去做也不知道.我想我可以通过使用scikit-learn来做到这一点.请提供一些帮助.

csv machine-learning svm weka scikit-learn

Pee*_*ush

lucky-day

0
推荐指数

1
解决办法

3269
查看次数

在Ubuntu中安装scikit-learn时有些麻烦

我只是从https://pypi.python.org/pypi/scikit-learn/下载scikit-learn安装包.
在安装软件包之前,我使用apt-get安装了几个依赖软件包:

sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev

Run Code Online (Sandbox Code Playgroud)

在我转到安装目录后,我运行命令python setup.py install.但收到回复error: could not create '/usr/local/lib/python2.7/dist-packages/sklearn': Permission denied

我发现问题是关于ATLAS和BLAS,但我不熟悉它们.所以我需要一些帮助来解决它.我将详细信息粘贴在终端中:

Appending sklearn.__check_build configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')
Warning: Assuming default configuration (svm/tests/{setup_tests,setup}.py was not found)Appending sklearn.svm.tests configuration to sklearn.svm
Ignoring attempt to set 'name' (from 'sklearn.svm' to 'sklearn.svm.tests')
/usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1423: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg …

Run Code Online (Sandbox Code Playgroud)

python ubuntu blas atlas scikit-learn

Jun*_* HU

lucky-day

0
推荐指数

1
解决办法

1万
查看次数

如何强制scikit-learn DictVectorizer不丢弃功能？

我试图使用scikit-learn进行分类任务.我的代码从数据中提取特征,并将它们存储在字典中,如下所示:

feature_dict['feature_name_1'] = feature_1
feature_dict['feature_name_2'] = feature_2

Run Code Online (Sandbox Code Playgroud)

当我分割数据以便使用sklearn.cross_validation一切正常工作来测试它时.我遇到的问题是当测试数据是新集时,而不是学习集的一部分(尽管它对每个样本具有相同的确切特征).在我将分类器放在学习集上之后,当我尝试调用时,clf.predict我得到了这个错误:

ValueError: X has different number of features than during model fitting.

Run Code Online (Sandbox Code Playgroud)

我假设这与此有关(在DictVectorizer文档中):

在fit或fit_transform期间未遇到的命名要素将被默默忽略.

DictVectorizer 我已经删除了一些功能...我如何禁用/解决此功能？

谢谢

===编辑===

问题是larsMans建议我适应DictVectorizer两次.

python classification scikit-learn

Wea*_*Fox

2013 11-06

0
推荐指数

1
解决办法

1490
查看次数

在scikit-learn中使用python生成器

我想知道是否以及如何使用python生成器作为scikit-learn分类器的.fit()函数的数据输入？由于数据量巨大,这似乎对我有意义.

特别是我即将实施随机森林方法.

问候K.

python generator random-forest scikit-learn

Krn*_*Krn

lucky-day

0
推荐指数

1
解决办法

2720
查看次数

sklearn的LabelBinarizer可以和DictVectorizer类似吗？

我有一个数据集,其中包括数字和分类功能,其中分类功能可以包含标签列表.例如:

RecipeId   Ingredients    TimeToPrep
1          Flour, Milk    20
2          Milk           5
3          Unobtainium    100

Run Code Online (Sandbox Code Playgroud)

如果我每个配方只有一个Ingeredient,DictVecorizer将优雅地处理编码到适当的虚拟变量:

from sklearn feature_extraction import DictVectorizer
RecipeData=[{'RecipeID':1,'Ingredients':'Flour','TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5}
,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}
dc=DictVectorizer()
dc.fit_transform(RecipeData).toarray()

Run Code Online (Sandbox Code Playgroud)

给出输出:

array([[   1.,    0.,    0.,    1.,   20.],
       [   0.,    1.,    0.,    2.,    5.],
       [   0.,    0.,    1.,    3.,  100.]])

Run Code Online (Sandbox Code Playgroud)

在将分类标签编码为布尔特征时,可以正确处理整数要素.

但是,DictVectorizer无法处理列表值功能和扼流圈

RecipeData = [{'RecipeID':1,'成分':['面粉','牛奶'],'TimeToPrep':20},{'RecipeID':2,'成分':'牛奶','TimeToPrep': 5},{'RecipeID':3,'成分':'Unobtainium','TimeToPrep':100}

LabelBinarizer正确处理此问题,但必须分别提取和处理分类变量:

from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit_transform([('Flour','Milk'), ('Milk',), ('Unobtainium',)])
array([[1, 1, 0],
       [0, 1, 0],
       [0, 0, 1]])

Run Code Online (Sandbox Code Playgroud)

这就是我目前的做法 - 从混合数字/分类输入数组中提取包含标签列表的分类特征,使用LabelBinarizer转换它们,然后重新粘贴数字特征.

有更优雅的方式吗？

python machine-learning scikit-learn

Noa*_*men

2014 01-15

0
推荐指数

1
解决办法

1028
查看次数

AttributeError：“RandomOverSampler”对象没有属性“fit_sample”

我正在尝试使用 imblearn 中的 RandomOverSampler 但出现错误。

看看其他帖子，旧版本似乎有问题，但我检查了我的版本，发现：

sklearn.__version__
'0.24.1'

imblearn.__version__
'0.8.0'

Run Code Online (Sandbox Code Playgroud)

这是我试图运行的代码：

from imblearn.over_sampling import RandomOverSampler

OS = RandomOverSampler(sampling_strategy='auto', random_state=0)
osx, osy = OS.fit_sample(X, y)

Run Code Online (Sandbox Code Playgroud)

我得到的错误是：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-a080b92fc7bc> in <module>
      2 
      3 OS = RandomOverSampler(sampling_strategy='auto', random_state=0)
----> 4 osx, osy = OS.fit_sample(X, y)

AttributeError: 'RandomOverSampler' object has no attribute 'fit_sample'

Run Code Online (Sandbox Code Playgroud)

python scikit-learn imblearn

flb*_*rne

2021 03-18

0
推荐指数

1
解决办法

9210
查看次数

您能解释一下异常值过滤吗？

我有一个包含先验异常值的数据帧。我想至少从“降雨”变量中删除异常值。我按如下方式进行。它看起来有效，但我在第二个图中仍然有异常值。正常吗？

去除异常值之前

去除异常值

rainfall = df["Rainfall"]
q3 = np.quantile(rainfall, 0.75)
q1 = np.quantile(rainfall, 0.25)

iqr = q3 - q1

upper_bound = q1 + 1.5 * iqr
lower_bound = q3 - 1.5 * iqr

rainfall_wo_outliers = df[(rainfall <= lower_bound) | (rainfall >= upper_bound)]["Rainfall"]

Run Code Online (Sandbox Code Playgroud)