TypeError:稀疏矩阵长度不明确; 使用RF分类器时使用getnnz()或shape [0]?

tum*_*eed 8 python nlp numpy machine-learning scikit-learn

我正在学习scikit学习中的随机森林,作为一个例子,我想使用随机森林分类器进行文本分类,使用我自己的数据集.所以首先我用tfidf对文本进行矢量化并进行分类:

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10) 
classifier.fit(X_train, y_train)           
prediction = classifier.predict(X_test)
Run Code Online (Sandbox Code Playgroud)

当我运行分类时,我得到了这个:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Run Code Online (Sandbox Code Playgroud)

然后我使用了.toarray()for X_train,我得到了以下内容:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
Run Code Online (Sandbox Code Playgroud)

从我之前的一个问题来看,我需要减少numpy数组的维数,所以我也这样做:

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing) 
Run Code Online (Sandbox Code Playgroud)

然后我得到了这个例外:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
Run Code Online (Sandbox Code Playgroud)

我尝试了以下方法:

prediction = classifier.predict(X_train.getnnz()) 
Run Code Online (Sandbox Code Playgroud)

得到了这个:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
TypeError: object of type 'int' has no len()
Run Code Online (Sandbox Code Playgroud)

从中提出了两个问题:如何使用随机森林进行正确分类?发生了什么事X_train

然后我尝试了以下内容:

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)
Run Code Online (Sandbox Code Playgroud)

hpa*_*ulj 8

我不太了解sklearn,虽然我模糊地回忆起一些由使用稀疏matricies转换引发的早期问题.在内部,一些矩阵必须由m.toarray()或替换m.todense().

但是为了让您了解错误消息的含义,请考虑一下

In [907]: A=np.array([[0,1],[3,4]])
In [908]: M=sparse.coo_matrix(A)
In [909]: len(A)
Out[909]: 2
In [910]: len(M)
...
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [911]: A.shape[0]
Out[911]: 2
In [912]: M.shape[0]
Out[912]: 2
Run Code Online (Sandbox Code Playgroud)

len()通常在Python中用于计算列表的第一级术语的数量.应用于2d数组时,它是行数.但这A.shape[0]是计算行数的更好方法.并且M.shape[0]是一样的.在这种情况下,您不感兴趣.getnnz,这是稀疏矩阵的非零项的数量. A没有这种方法,虽然可以从中派生出来A.nonzero().


JAB*_*JAB 5

有点不清楚是否将相同的数据结构(类型和形状)传递给分类器的fit方法和方法。predict随机森林将需要很长时间才能运行大量特征,因此建议减少您链接到的帖子中的维度。

您应该将 SVD 应用于训练和测试数据,以便分类器在与您希望预测的数据相同形状的输入上进行训练。检查拟合的输入,预测方法的输入具有相同数量的特征,并且都是数组而不是稀疏矩阵。

更新示例: 更新为使用数据框

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer(  use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.cross_validation import train_test_split

df= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
              ,'class': [0,0,0,1,1,1,0,3]})



X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=2)                                
X_reduced_train = pca.fit_transform(X)  

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray()) 
Run Code Online (Sandbox Code Playgroud)

请注意,SVD 发生在分割为训练集和测试集之前,以便传递给预测器的数组与调用nfit方法的数组相同。