小编tum*_*eed的帖子

TypeError:稀疏矩阵长度不明确; 使用RF分类器时使用getnnz()或shape [0]？

我正在学习scikit学习中的随机森林,作为一个例子,我想使用随机森林分类器进行文本分类,使用我自己的数据集.所以首先我用tfidf对文本进行矢量化并进行分类:

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10) 
classifier.fit(X_train, y_train)           
prediction = classifier.predict(X_test)

Run Code Online (Sandbox Code Playgroud)

当我运行分类时,我得到了这个:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Run Code Online (Sandbox Code Playgroud)

然后我使用了.toarray()for X_train,我得到了以下内容:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

Run Code Online (Sandbox Code Playgroud)

从我之前的一个问题来看,我需要减少numpy数组的维数,所以我也这样做:

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing)

Run Code Online (Sandbox Code Playgroud)

然后我得到了这个例外:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X) …

Run Code Online (Sandbox Code Playgroud)

python nlp numpy machine-learning scikit-learn

tum*_*eed

2015 02-04

8
推荐指数

2
解决办法

2万
查看次数

如何按块将函数应用于 pandas 数据框？

我知道apply()函数可用于将函数应用于数据框的列：

df.applymap(my_fun)

Run Code Online (Sandbox Code Playgroud)

如何按my_fun块申请？例如 1、5、10 和 20 行的块？

python python-3.x pandas

tum*_*eed

lucky-day

8
推荐指数

1
解决办法

5748
查看次数

错误:使用自制软件安装时inreplace失败了吗？

我想在OS X中安装treetagger.为了使它更容易,我试图搜索是否可以使用Homebrew.所以我看一下网络,从pepijnkokke用户那里找到了这个公式.接下来,我尝试按如下方式安装treetagger:

user@MacBook-Pro-User-2:~$ brew install /Users/user/Downloads/tree-tagger.rb

Run Code Online (Sandbox Code Playgroud)

但是,我收到以下错误:

==> Installing dependencies for tree-tagger: openssl, wget
==> Installing tree-tagger dependency: openssl
==> Downloading https://homebrew.bintray.com/bottles/openssl-1.0.2g.el_capitan.
######################################################################## 100.0%
==> Pouring openssl-1.0.2g.el_capitan.bottle.tar.gz
==> Caveats
A CA file has been bootstrapped using certificates from the system
keychain. To add additional certificates, place .pem files in
  /usr/local/etc/openssl/certs

and run
  /usr/local/opt/openssl/bin/c_rehash

This formula is keg-only, which means it was not symlinked into /usr/local.

Apple has deprecated use of OpenSSL in favor of its own …

Run Code Online (Sandbox Code Playgroud)

ruby macos homebrew treetagger

tum*_*eed

2016 04-04

6
推荐指数

1
解决办法

384
查看次数

如何在同一台计算机上安装常规python(通过自制程序)和miniconda？

我下载了conda,但是我想使用pip和普通的python版本(自制软件)用于不同的目的,如果我通过brew安装python和pip然后我安装conda就可以了吗？

更新

安装miniconda后,我试图通过自制软件安装python,两个python版本崩溃.如何通过自制软件安装miniconda然后python？

python homebrew python-3.x anaconda

tum*_*eed

2017 10-17

6
推荐指数

1
解决办法

1294
查看次数

如何使用scikit-learn和matplotlib绘制不平衡数据集的SVC分类？

我有一个文本分类任务,包含2599个文档和5个标签,从1到5.文档是

label | texts
----------
5     |1190
4     |839
3     |239
1     |204
2     |127

Run Code Online (Sandbox Code Playgroud)

所有人都准备好将这些文本数据分类为非常低的性能,并且还会收到有关定义不明确的指标的警告:

Accuracy: 0.461057692308

score: 0.461057692308

precision: 0.212574195636

recall: 0.461057692308

  'precision', 'predicted', average, warn_for)
 confussion matrix:
[[  0   0   0   0 153]
  'precision', 'predicted', average, warn_for)
 [  0   0   0   0  94]
 [  0   0   0   0 194]
 [  0   0   0   0 680]
 [  0   0   0   0 959]]

 clasification report:
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       153
          2       0.00      0.00      0.00        94 …

Run Code Online (Sandbox Code Playgroud)

nlp artificial-intelligence machine-learning svm scikit-learn

tum*_*eed

2015 02-24

5
推荐指数

1
解决办法

1495
查看次数

将特征稀疏矩阵与 sklearn 混合的正确方法是什么？

前几天，我正在处理一项需要提取多种类型特征矩阵的机器学习任务。我将此特征矩阵保存为磁盘中的 numpy 数组，以便稍后在某些估计器中使用它们（这是一项分类任务）。毕竟，当我想使用所有特征时，我只是将矩阵连接起来以获得一个大的特征矩阵。当我获得这个大的特征矩阵时，我将它提交给了一个估算器。

我不知道这是否是处理包含大量模式（计数）的特征矩阵的正确方法。我应该使用哪些其他方法来正确混合多种类型的功能？. 但是，通过查看文档，我发现FeatureUnion似乎可以完成此任务。

例如，比方说，我想创建的3矢量化接近一大特征矩阵TfidfVectorizer，CountVectorizer并且HashingVectorizer这是我尝试下面的文档例子：

#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
                     header=0, sep=',', names=['id', 'text', 'labels'])

#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
                             sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))

#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))


#Combine the above vectorizers in one single feature matrix:

from sklearn.pipeline import  FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", …

Run Code Online (Sandbox Code Playgroud)

python numpy pandas scikit-learn

tum*_*eed

2015 09-04

5
推荐指数

1
解决办法

3017
查看次数

如何在 scikit-learn 中使用正确的 pyprind？

目前我正在使用pyprind，这是一个实现进度条的库：

#Compute training time elapsed
pbar = pyprind.ProgBar(45, width=120, bar_char='?')
for _ in range(45):
    #Fiting
    clf = SVC().fit(X_train, y_train)
    pbar.update()
#End of bar

Run Code Online (Sandbox Code Playgroud)

但是，我不知道这是否是使用的正确方法pbar，因为我想我拟合了 45 次clf。因此，我应该如何正确使用pbar？。

python python-2.7 python-3.x scikit-learn

tum*_*eed

lucky-day

5
推荐指数

1
解决办法

395
查看次数

如何调用pypdfocr函数在python脚本中使用它们？

最近我下载了pypdfocr,但是,在文档中没有关于如何将pypdfocr作为库调用的示例,有人可以帮我调用它来转换单个文件吗？我刚刚找到一个终端命令:

$ pypdfocr filename.pdf

Run Code Online (Sandbox Code Playgroud)

python pdfbox python-2.7 python-3.x

tum*_*eed

lucky-day

5
推荐指数

1
解决办法

1820
查看次数

如何从pandas read_html重新索引格式错误的列？

我正在从一个网站中检索一些内容,该网站有几个具有相同列数的表,带有pandas read_html.当我读取一个实际上有几个具有相同列数的表的链接时,pandas有效地将所有表读为一个(类似于平面/规范化表).但是,我有兴趣对网站的链接列表(即几个链接的单个平面表)做同样的事情,所以我尝试了以下方法:

在:

import multiprocessing
def process(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=False) 
    return df_url

links = ['link1.com','link2.com','link3.com',...,'linkN.com']

pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df

Run Code Online (Sandbox Code Playgroud)

尽管如此,我想我并没有指定corecctly read_html()哪个列,所以我得到这个列表格式错误:

日期:

[[                Form     Disponibility  \
  0  290090 01780-500-01)  Unavailable - no product available for release.   

                             Relation  \

     Relation drawbacks  
  0                  NaN                        Removed 
  1                  NaN                        Removed ],
 [                                        Form  \

                                   Relation  \
  0  American Regent is currently releasing the 0.4...   
  1  American Regent is currently releasing the 1mg...   

     drawbacks  
  0 …

Run Code Online (Sandbox Code Playgroud)

python multiprocessing dataframe python-3.x pandas

tum*_*eed

2016 11-17

5
推荐指数

1
解决办法

77
查看次数

不可散列的类型：使用 pandas 应用函数时的“dict”？

我正在使用 requests 库将 api 包装到函数中：

import pandas as pd
import requests, json

def foo(text):
    payload = {'key': '00ac1ef82687c7533d54be2e9', 'of': 'json', \
               'nko': text, \
               'woei': 'm', \
               'nvn': 'es'}

    r = requests.get('http://api.example.com/foo', params=payload)
    data = json.loads(r.text)
    return data

Run Code Online (Sandbox Code Playgroud)

然后，我想将上述函数应用于以下数据框：

df：

    colA
0   lore lipsum dolor done
1   lore lipsum
2   done lore
3   dolor lone lipsum

Run Code Online (Sandbox Code Playgroud)

因此，我尝试了以下方法：

df['new_col'] = df['colA'].apply(foo)
df

Run Code Online (Sandbox Code Playgroud)

但是，我遇到了以下异常：

/usr/local/lib/python3.5/site-packages/pandas/core/series.py 在 apply(self, func, Convert_dtype, args, **kwds) 2287 2288
如果 is_extension_type(self.dtype): -> 2289映射 = self._values.map(f) 2290 其他：2291 值 …

python python-3.x pandas python-requests

tum*_*eed

2017 01-03

5
推荐指数

1
解决办法

4140
查看次数