Scikit-learn脚本提供的结果与本教程截然不同,并且在更改数据框时出现错误

Ric*_*ich 1 python csv dataframe pandas scikit-learn

我正在研究包含以下部分的教程:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model.logistic import LogisticRegression
>>> from sklearn.cross_validation import train_test_split, cross_val_score
>>> df = pd.read_csv('data/sms.csv')
>>> X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
>>> vectorizer = TfidfVectorizer()
>>> X_train = vectorizer.fit_transform(X_train_raw)
>>> X_test = vectorizer.transform(X_test_raw)
>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
>>> print 'Precision', np.mean(precisions), precisions
>>> recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
>>> print 'Recalls', np.mean(recalls), recalls
Run Code Online (Sandbox Code Playgroud)

然后我做了一些修改就复制了:

ddir = (sys.argv[1])
df = pd.read_csv(ddir + '/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"])
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)


precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')


print 'Precision', np.mean(precisions), precisions
print 'Recalls', np.mean(recalls), recalls
Run Code Online (Sandbox Code Playgroud)

但是,尽管代码几乎没有差异,但本书中的结果比我的要好得多:

书: Precision 0.992137651822 [ 0.98717949 0.98666667 1. 0.98684211 1. ] Recall 0.677114261885 [ 0.7 0.67272727 0.6 0.68807339 0.72477064]

矿: Precision 0.108435683974 [ 2.33542342e-06 1.22271611e-03 1.68918919e-02 1.97530864e-01 3.26530612e-01] Recalls 0.235220281632 [ 0.00152053 0.03370787 0.125 0.44444444 0.57142857]

回到脚本,看看出了什么问题,我认为第18行:

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])
Run Code Online (Sandbox Code Playgroud)

是罪魁祸首,并改变(df['label'], df['message'])(df['message'], df['label'])。但这给了我一个错误:

Traceback (most recent call last):
  File "Chapter4[B-FLGTLG]C[Y-BCPM][G-PAR--[00].py", line 30, in <module>
    precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1433, in cross_val_score
    for train, test in cv)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
    self.results = batch()
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1550, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1606, in _score
    score = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.py", line 90, in __call__
    **self._kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 1203, in precision_score
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 984, in precision_recall_fscore_support
    (pos_label, present_labels))
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], 
      dtype='|S4')
Run Code Online (Sandbox Code Playgroud)

这可能是什么问题?数据在这里:http : //archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection,以防有人尝试。

Chr*_*den 7

stacktrace末尾的错误是了解此处发生情况的关键。

ValueError:pos_label = 1不是有效的标签:array(['ham','spam'],dtype ='| S4')

您正在尝试对模型进行精确度和召回评分。回想一下,这些评分方法是根据真实肯定,错误肯定和错误否定来制定的。但是如何sklearn知道什么是积极的,什么是消极的?是“火腿”还是“垃圾邮件”?我们需要一种方法来sklearn表明我们认为“垃圾邮件”是正面标签,而“火腿”是负面标签。根据sklearn文档,默认情况下,精确度和查全率记分员期望使用正号1,因此pos_label=1是错误消息的一部分。

至少有3种方法可以解决此问题。

1.直接从数据源中将“ ham”和“ spam”值分别编码为0和1,以适应精度/召回评分器:

# Map dataframe to encode values and put values into a numpy array
encoded_labels = df['label'].map(lambda x: 1 if x == 'spam' else 0).values # ham will be 0 and spam will be 1

# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)
Run Code Online (Sandbox Code Playgroud)

2.使用sklearn的内置函数(label_binarize)将分类数据转换为编码的整数,以适应精度/调用记分器:

这会将您的分类数据转换为整数。

# Encode labels
from sklearn.preprocessing import label_binarize
encoded_column_vector = label_binarize(df['label'], classes=['ham','spam']) # ham will be 0 and spam will be 1
encoded_labels = np.ravel(encoded_column_vector) # Reshape array

# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)
Run Code Online (Sandbox Code Playgroud)

3.使用以下自定义参数创建计分器对象pos_label

如文档所述,默认情况下,精度和查全率得分的pos_label参数为1,但是可以更改此值以通知计分员哪个字符串代表肯定标签。您可以使用构造具有不同参数的得分器对象make_scorer

# Start out as you did originally with string labels
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
# Fit classifier as normal ...


# Get precision and recall
from sklearn.metrics import precision_score, recall_score, make_scorer
# Precision
precision_scorer = make_scorer(precision_score, pos_label='spam')
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring=precision_scorer)
print 'Precision', np.mean(precisions), precisions

# Recall
recall_scorer = make_scorer(recall_score, pos_label='spam')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring=recall_scorer)
print 'Recalls', np.mean(recalls), recalls
Run Code Online (Sandbox Code Playgroud)

对您的代码进行任何这些更改之后,我得到的平均精度和回想度得分约为0.9900.704,与书中的数字一致。

在这三个选项中,我最推荐#3,因为它更容易出错。