为什么我的scikit学习HashingVectorizor给我带有binary = True set的浮点数？

Question

为什么我的scikit学习HashingVectorizor给我带有binary = True set的浮点数？

Mar*_*zzi 3 python feature-extraction scikit-learn

我正在尝试使用scikit-learn的伯努利朴素贝叶斯分类器.我使用CountVectorizor让分类器在一个小数据集上正常工作,但是当我尝试使用HashingVectorizor处理更大的数据集时遇到了麻烦.保持所有其它参数(培训文档,测试文档,分类和特征提取设置)不变,并从CountVectorizor只是切换到HashingVectorizor引起了我的分类总是吐出的所有文件相同的标签.

我编写了以下脚本来研究两个特征提取器之间的区别:

from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer

cv = CountVectorizer(binary=True, decode_error='ignore')
h = HashingVectorizer(binary=True, decode_error='ignore')

with open('moby_dick.txt') as fp:
    doc = fp.read()

cv_result = cv.fit_transform([doc])
h_result = h.transform([doc])

print cv_result
print repr(cv_result)
print h_result
print repr(h_result)

Run Code Online (Sandbox Code Playgroud)

(其中'moby_dick.txt'是项目gutenberg的moby dick副本)

(简明)结果:

  (0, 17319)    1
  (0, 17320)    1
  (0, 17321)    1
<1x17322 sparse matrix of type '<type 'numpy.int64'>'
    with 17322 stored elements in Compressed Sparse Column format>

  (0, 1048456)  0.00763203138591
  (0, 1048503)  0.00763203138591
  (0, 1048519)  0.00763203138591
<1x1048576 sparse matrix of type '<type 'numpy.float64'>'
    with 17168 stored elements in Compressed Sparse Row format>

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,CountVectorizor在二进制模式下为每个特征的值返回整数1(我们只希望看到1,因为只有一个文档); 另一方面,HashVectorizor返回浮点数(全部相同,但不同的文档产生不同的值).我怀疑我的问题源于将这些花车传递到伯努利NB上.

理想情况下,我希望从CountVectorizor获得与HashingVectorizor相同的二进制格式数据; 做不到这一点,我可以使用BernoulliNB双稳态参数,如果我知道一个健全的门槛,这个数据集,但我不是什么这些彩车代表了明确的(他们显然不是令牌计数,因为他们都是一样的,少比1).

任何帮助,将不胜感激.

Answer 1

Fre*_*Foo 5

在默认设置下,HashingVectorizer将要素向量标准化为单位欧几里德长度:

>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.57735027,
         0.        , -0.57735027,  0.        ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0

Run Code Online (Sandbox Code Playgroud)

设置binary=True仅推迟此标准化,直到二值化特征,即将所有非零值设置为1.你还必须设置norm=None为关闭它:

>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5,  0. ,  0. ,  0. ,  0.5,  0.5,  0.5,  0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.]])

Run Code Online (Sandbox Code Playgroud)

这也是它返回float数组的原因:规范化需要它们.虽然矢量化可以操纵返回另一个D类,这将需要转换的内部transform方法,可能一个回在未来的估计浮动.

归档时间：	11 年，8 月前
查看次数：	1067 次
最近记录：	11 年，7 月前

为什么我的scikit学习HashingVectorizo​​r给我带有binary = True set的浮点数？

为什么我的scikit学习HashingVectorizor给我带有binary = True set的浮点数？