Numpy hstack - "ValueError:所有输入数组必须具有相同数量的维度" - 但它们确实如此

Sim*_*ely 17 python arrays numpy pandas scikit-learn

我想加入两个numpy数组.在一个文本中运行TF-IDF后,我有一组列/功能.在另一个我有一个列/功能是一个整数.所以我读了一列火车和测试数据,在这上面运行TF-IDF,然后我想添加另一个整数列,因为我认为这将有助于我的分类器更准确地了解它应该如何表现.

不幸的是,当我尝试运行hstack将此单列添加到我的其他numpy数组时,我在标题中收到错误.

这是我的代码:

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))
Run Code Online (Sandbox Code Playgroud)

以下是输出和错误消息:

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions
Run Code Online (Sandbox Code Playgroud)

是什么原因引起了我的问题?我怎样才能解决这个问题?据我所知,我应该能够加入这些专栏?我误解了什么?

谢谢.

编辑:

使用下面的答案中的方法会收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly
Run Code Online (Sandbox Code Playgroud)

有趣的是,我试图打印dtypeX,这很好用:

X.dtype => float64
Run Code Online (Sandbox Code Playgroud)

但是,尝试打印AllAlexaAndGoogleInfo类似的dtype :

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype) 
Run Code Online (Sandbox Code Playgroud)

产生:

'DataFrame' object has no attribute 'dtype'
Run Code Online (Sandbox Code Playgroud)

YS-*_*S-L 18

由于X是一个稀疏数组,代替numpy.hstack,使用scipy.sparse.hstack加入阵列.在我看来,错误信息在这里有点误导.

这个最小的例子说明了这种情况

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works
Run Code Online (Sandbox Code Playgroud)

基于以下输出

X shape: (10, 10000)
xt shape: (10, 1)
Run Code Online (Sandbox Code Playgroud)

可以预期hstack以下行中的行将起作用,但事实是它会抛出此错误:

ValueError: all the input arrays must have same number of dimensions
Run Code Online (Sandbox Code Playgroud)

所以,scipy.sparse.hstack当你有一个稀疏数组来堆叠时使用.


事实上,我已经在你的另一个问题中回答了这个问题,你提到了另一条错误信息:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
Run Code Online (Sandbox Code Playgroud)

首先,AllAlexaAndGoogleInfo没有一个,dtype因为它是一个DataFrame.要获得它的底层numpy数组,只需使用AllAlexaAndGoogleInfo.values.检查一下dtype.基于该错误消息,它具有dtypeobject,这意味着它可能包含非数值元件等字符串.

这是一个重现这种情况的最小例子:

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape
Run Code Online (Sandbox Code Playgroud)

错误消息:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
Run Code Online (Sandbox Code Playgroud)

因此,AllAlexaAndGoogleInfo在进行堆叠之前,请检查是否存在任何非数值并进行修复.


Dre*_*ess 12

使用.column_stack.像这样:

X = np.column_stack((X, AllAlexaAndGoogleInfo))
Run Code Online (Sandbox Code Playgroud)

来自文档:

取一系列1-D阵列并将它们堆叠为列,以形成单个2-D阵列.二维数组按原样堆叠,就像hstack一样.