如何使用特征hasher转换非数字离散数据,以便将其传递给SVM?

use*_*106 2 python numpy machine-learning feature-extraction scikit-learn

我正在尝试使用来自UCI机器学习库的CRX数据集.此特定数据集包含一些非连续变量的特征.因此,我需要将它们转换为数值,然后才能传递给SVM.

我最初研究使用单热解码器,它采用整数值并将它们转换为矩阵(例如,如果一个特征有三个可能的值,'红色''蓝色'和'绿色',这将被转换为三个二进制特征: "红色"为1,0,0,"蓝色"为"0,1,0","绿色"为"0,0".这对我的需求是理想的,除了它只能处理整数特征.

def get_crx_data(debug=False):

    with open("/Volumes/LocalDataHD/jt306/crx.data", "rU") as infile:
        features_array = []
        reader = csv.reader(infile,dialect=csv.excel_tab)
        for row in reader:
            features_array.append(str(row).translate(None,"[]'").split(","))
        features_array = np.array(features_array)
        print features_array.shape
        print features_array[0]
        labels_array = features_array[:,15]
        features_array = features_array[:,:15]
        print features_array.shape
        print labels_array.shape


        print("FeatureHasher on frequency dicts")

        hasher = FeatureHasher(n_features=44)
        X = hasher.fit_transform(line for line in features_array)

        print X.shape



get_crx_data()
Run Code Online (Sandbox Code Playgroud)

这回来了

Reading CRX data from disk
Traceback (most recent call last):
  File"/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 38, in <module>

get_crx_data()
  File "/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 32, in get_crx_data

X = hasher.fit_transform(line for line in features_array)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 426, in fit_transform
    return self.fit(X, **fit_params).transform(X)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 129, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

File "_hashing.pyx", line 44, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1649)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 125, in <genexpr>
    raw_X = (_iteritems(d) for d in raw_X)

File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
    return d.iteritems() if hasattr(d, "iteritems") else d.items()

AttributeError: 'numpy.ndarray' object has no attribute 'items'

(690, 16)
['0' ' 30.83' ' 0' ' u' ' g' ' w' ' v' ' 1.25' ' 1' ' 1' ' 1' ' 0' ' g'
 ' 202' ' 0' ' +']
(690, 15)
(690,)
FeatureHasher on frequency dicts

Process finished with exit code 1


How can I use feature hashing (or an alternative method) to convert this data from classes (some of which are strings, others are discrete numerical values) into data which can be handled by an SVM? I have also looked into using one-hot coding, but that only takes integers as input.
Run Code Online (Sandbox Code Playgroud)

sen*_*rle 5

问题是该FeatureHasher对象期望每行输入具有特定结构 - 或者实际上是三种不同可能结构之一.第一种可能性是对词典feature_name:value.第二个是(feature_name, value)元组列表.第三个是feature_names 的平面列表.在前两种情况下,要素名称将映射到矩阵中的列,并且给定值将存储在每行的这些列中.最后,列表中是否存在特征被隐含地理解为a TrueFalse值.以下是一些简单,具体的例子:

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='dict')
>>> X_new = hasher.fit_transform([{'a':1, 'b':2}, {'a':0, 'c':5}])
>>> X_new.toarray()
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])
Run Code Online (Sandbox Code Playgroud)

这说明了默认模式 - FeatureHasher如果您没有通过input_type,将会发生什么,如原始代码中所示.如您所见,预期输入是一个字典列表,每个字典对应一个输入样本或一行数据.每个字典包含任意数量的要素名称,映射到该行的值.

输出X_new包含数组的稀疏表示; 调用toarray()将一个新的数据副本作为一个vanilla numpy数组返回.

如果你想传递成对的元组,传递input_type='pairs'.然后你可以这样做:

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='pair')
>>> X_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])
>>> X_new.toarray()
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])
Run Code Online (Sandbox Code Playgroud)

最后,如果您只有布尔值,则根本不必显式传递值 - FeatureHasher只会假设如果存在要素名称,则其值为True(此处表示为浮点值1.0).

>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
...                                                   non_negative=True,
...                                                   input_type='string')
>>> X_new = hasher.fit_transform([['a', 'b'], ['a', 'c']])
>>> X_new.toarray()
array([[ 1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]])
Run Code Online (Sandbox Code Playgroud)

不幸的是,您的数据似乎并不总是采用这些格式中的任何一种.但是,修改适合格式或格式的内容应该不会太难.如果您需要帮助,请告诉我们; 在这种情况下,请详细说明您尝试转换的数据格式.'dict''pair'