我正在尝试使用Python的Tfidf来转换文本语料库.但是,当我尝试fit_transform它时,我得到一个值错误ValueError:空词汇; 也许这些文件只包含停用词.
In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
1217 vectors : array, [n_samples, n_features]
1218 """
-> 1219 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1220 self._tfidf.fit(X)
1221 # X is already a transformed view of raw_documents so
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
778 max_features = self.max_features
779
--> 780 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
781 X = X.tocsc()
782
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
725 vocabulary = …Run Code Online (Sandbox Code Playgroud) 我在statsmodels中做了一个简单的GLM实验,并且很难找到为什么GLM结果不包含任何R ^ 2属性?
我觉得这里有一些非常简单的事情,为什么GLM没有R ^ 2计算,以及我自己可以计算的方法.
谢谢!
In [1]: import pandas as np
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: import statsmodels.api as sm
In [5]: data = pd.DataFrame({'col1':np.arange(10),'col2':np.arange(
KeyboardInterrupt
In [5]: x = np.arange(0,10,0.5)
In [6]:
In [6]: y = np.zeros(len(x))
In [7]: y[0] = 0
In [8]: for i in range(1,len(x)):
...: y[i] = 0.5*x[i] + 2.5*y[i-1] + 10*np.random.rand()
...:
In [9]: print y
[ 0.00000000e+00 9.35177024e-01 8.18487881e+00 2.95126464e+01
8.08584645e+01 2.11423251e+02 …Run Code Online (Sandbox Code Playgroud) 我在bash中遇到find和exec有点麻烦:
假设我有一堆文件,我需要替换'\ r'中的字符.(上一个问题:在命令行上使用粘贴或PR无法工作连接列)对于每个文件,我想读取它,并替换所有'\r',然后将其写回相同的文件名:
我正在使用的命令是find . -exec cat {} | tr -d "\r" > {} \;,但我收到两个错误:
tr: extra operand `;'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.
find: missing argument to `-exec'
Run Code Online (Sandbox Code Playgroud)
它好像tr在解释';' 作为一个论点,而-exec不是承认它.有没有办法改变这个?我也在目录中创建{}作为文件,而不是用{}代替文件名.
我也尝试过:
find . -exec cat {} | tr -d "\r" > "new_{}"; \;
Run Code Online (Sandbox Code Playgroud)
但是"new_{}"没有变成"new_filename",bash只是从字面上理解并创建一个名为的文件"new{}".
谢谢!
我知道有很多关于这个的问题,比如用大熊猫获取每日平均值 以及如何使用groupby获得大熊猫的月平均值,但我得到一个奇怪的错误.
简单数据集,带有一个索引列(类型时间戳)和一个值列.想获得数据的月平均值.
In [76]: df.head()
Out[76]:
A
2008-01-02 1
2008-01-03 2
2008-01-04 3
2008-01-07 4
2008-01-08 5
Run Code Online (Sandbox Code Playgroud)
但是,当我分组时,我只得到索引的组而不是值
In [74]: df.head().groupby(lambda x: x.month).groups
Out[74]:
{1: [Timestamp('2008-01-02 00:00:00'),
Timestamp('2008-01-03 00:00:00'),
Timestamp('2008-01-04 00:00:00'),
Timestamp('2008-01-07 00:00:00'),
Timestamp('2008-01-08 00:00:00')]}
Run Code Online (Sandbox Code Playgroud)
尝试采用means()会导致错误:
试过两个df.head().resample("M", how='mean')和df.head().groupby(lambda x: x.month).mean()
并得到错误: DataError: No numeric types to aggregate
In [75]: df.resample("M", how='mean')
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-75-79dc1a060ba4> in <module>()
----> 1 df.resample("M", how='mean')
/usr/local/lib/python2.7/site-packages/pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, …Run Code Online (Sandbox Code Playgroud) python ×3
pandas ×2
bash ×1
exec ×1
find ×1
group-by ×1
scikit-learn ×1
statsmodels ×1
tf-idf ×1
tr ×1