小编Max*_*ong的帖子

Python TfidfVectorizer throw:空词汇; 也许文件只包含停用词"

我正在尝试使用Python的Tfidf来转换文本语料库.但是,当我尝试fit_transform它时,我得到一个值错误ValueError:空词汇; 也许这些文件只包含停用词.

In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1217         vectors : array, [n_samples, n_features]
   1218         """
-> 1219         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1220         self._tfidf.fit(X)
   1221         # X is already a transformed view of raw_documents so

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    778         max_features = self.max_features
    779 
--> 780         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
    781         X = X.tocsc()
    782 

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    725             vocabulary = …

Run Code Online (Sandbox Code Playgroud)

python tf-idf pandas scikit-learn

Max*_*ong

2017 05-23

12
推荐指数

1
解决办法

2万
查看次数

为什么statsmodels GLM在结果中没有R ^ 2？

我在statsmodels中做了一个简单的GLM实验,并且很难找到为什么GLM结果不包含任何R ^ 2属性？

我觉得这里有一些非常简单的事情,为什么GLM没有R ^ 2计算,以及我自己可以计算的方法.

谢谢!

In [1]: import pandas as np

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import statsmodels.api as sm

In [5]: data = pd.DataFrame({'col1':np.arange(10),'col2':np.arange(
KeyboardInterrupt

In [5]: x  = np.arange(0,10,0.5)

In [6]: 

In [6]: y = np.zeros(len(x))

In [7]: y[0] = 0

In [8]: for i in range(1,len(x)):
   ...:         y[i] = 0.5*x[i] + 2.5*y[i-1] + 10*np.random.rand()
   ...:     

In [9]: print y
[  0.00000000e+00   9.35177024e-01   8.18487881e+00   2.95126464e+01
   8.08584645e+01   2.11423251e+02 …

Run Code Online (Sandbox Code Playgroud)

python statsmodels

Max*_*ong

2014 10-24

6
推荐指数

2
解决办法

2158
查看次数

如何使用Find - exec和Tr处理大量文件

我在bash中遇到find和exec有点麻烦:

假设我有一堆文件,我需要替换'\ r'中的字符.(上一个问题:在命令行上使用粘贴或PR无法工作连接列)对于每个文件,我想读取它,并替换所有'\r',然后将其写回相同的文件名:

我正在使用的命令是find . -exec cat {} | tr -d "\r" > {} \;,但我收到两个错误:

tr: extra operand `;'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
find: missing argument to `-exec'

Run Code Online (Sandbox Code Playgroud)

它好像tr在解释';' 作为一个论点,而-exec不是承认它.有没有办法改变这个？我也在目录中创建{}作为文件,而不是用{}代替文件名.

我也尝试过:

find . -exec cat {} | tr -d "\r" > "new_{}"; \;

Run Code Online (Sandbox Code Playgroud)

但是"new_{}"没有变成"new_filename",bash只是从字面上理解并创建一个名为的文件"new{}".

谢谢!

bash exec find tr

Max*_*ong

2017 05-23

5
推荐指数

1
解决办法

1914
查看次数

使用GroupBy获取Pandas的平均值 - 获取DataError:无需汇总的数字类型 -

我知道有很多关于这个的问题,比如用大熊猫获取每日平均值以及如何使用groupby获得大熊猫的月平均值,但我得到一个奇怪的错误.

简单数据集,带有一个索引列(类型时间戳)和一个值列.想获得数据的月平均值.

In [76]: df.head()
Out[76]: 
                          A
2008-01-02                1
2008-01-03                2
2008-01-04                3
2008-01-07                4
2008-01-08                5

Run Code Online (Sandbox Code Playgroud)

但是,当我分组时,我只得到索引的组而不是值

In [74]: df.head().groupby(lambda x: x.month).groups
Out[74]: 
{1: [Timestamp('2008-01-02 00:00:00'),
  Timestamp('2008-01-03 00:00:00'),
  Timestamp('2008-01-04 00:00:00'),
  Timestamp('2008-01-07 00:00:00'),
  Timestamp('2008-01-08 00:00:00')]}

Run Code Online (Sandbox Code Playgroud)

尝试采用means()会导致错误:

试过两个df.head().resample("M", how='mean')和df.head().groupby(lambda x: x.month).mean()

并得到错误: DataError: No numeric types to aggregate

In [75]: df.resample("M", how='mean')
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-75-79dc1a060ba4> in <module>()
----> 1 df.resample("M", how='mean')

/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, …

Run Code Online (Sandbox Code Playgroud)

python group-by pandas

Max*_*ong

2017 05-23

4
推荐指数

1
解决办法

1万
查看次数