Scikit-learn的CountVectorizer类允许您将字符串'english'传递给参数stop_words.我想在此预定义列表中添加一些内容.谁能告诉我怎么做?
我正在研究逻辑回归模型,我无法理解如何将模型拟合从我的训练集到我的测试集.对不起,我是python的新手,也是statsmodels的新手.
import pandas as pd
import statsmodels.api as sm
from sklearn import cross_validation
independent_vars = phy_train.columns[3:]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0)
X_train = pd.DataFrame(X_train)
X_train.columns = independent_vars
X_test = pd.DataFrame(X_test)
X_test.columns = independent_vars
y_train = pd.DataFrame(y_train)
y_train.columns = ['target']
y_test = pd.DataFrame(y_test)
y_test.columns = ['target']
logit = sm.Logit(y_train,X_train[subset],missing='drop')
result = logit.fit()
print result.summary()
y_pred = logit.predict(X_test[subset])
Run Code Online (Sandbox Code Playgroud)
从最后一行,我得到这个错误:
y_pred = logit.predict(X_test [subset])Traceback(最近一次调用last):File"C:\ Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5"中的文件"",第1行. amd64\lib\site-packages\statsmodels\discrete\discrete_model.py",第378行,在预测中返回self.cdf(np.dot(exog,params))ValueError:矩阵未对齐
我的训练和测试数据集具有相同数量的变量,所以我确信我误解了logit.predict()实际上在做什么.
我在Windows 7操作系统上运行Python 2.7
这是我运行的:
>>> import matplotlib.pyplot as plt
Run Code Online (Sandbox Code Playgroud)
然后我明白了:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
import matplotlib.pyplot as plt
File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 29, in <module>
from matplotlib.figure import Figure, figaspect
File "C:\Python27\lib\site-packages\matplotlib\figure.py", line 36, in <module>
from matplotlib.axes import Axes, SubplotBase, subplot_class_factory
File "C:\Python27\lib\site-packages\matplotlib\axes.py", line 20, in <module>
import matplotlib.dates as _ # <-registers a date unit converter
File "C:\Python27\lib\site-packages\matplotlib\dates.py", line 119, in <module>
from dateutil.rrule import (rrule, MO, TU, WE, TH, FR, SA, …Run Code Online (Sandbox Code Playgroud) 我正在通过Wes McKinney的书"Python For Data Analysis"和第139页"Correlation and Covariance"中的工作,当我尝试运行他的代码从Yahoo!获取数据时,我收到了一个错误.金融.
这是我正在运行的:
#CORRELATION AND COVARIANCE
import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2003', '1/1/2013')
price = DataFrame({tic: data['Adj Close']
for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
for tic, data in all_data.iteritems()})
Run Code Online (Sandbox Code Playgroud)
这是我得到的错误:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\io\data.py", line 390, in get_data_yahoo
adjust_price, ret_index, chunksize, 'yahoo', name)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\io\data.py", line 336, in _get_data_from
hist_data = …Run Code Online (Sandbox Code Playgroud) 我想在R包中包含一个Fortran子例程.我一直只使用devtools和roxygen构建包(所以我的知识可能非常有限).我收到一个错误,阻止我在安装包之后安装它不是Win32应用程序...
我使用的是Rtools 3.3.我的会话信息:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] roxygen2_5.0.1 devtools_1.9.1
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.1 memoise_0.2.1 stringi_1.0-1 stringr_1.0.0 digest_0.6.8
Run Code Online (Sandbox Code Playgroud)
要初始构建包,我运行这个:
library(devtools)
library(roxygen2)
setwd("C:/panterasBox")
create("myPack")
setwd("C:/panterasBox/myPack")
dir.create("C:/panterasBox/myPack/src")
Run Code Online (Sandbox Code Playgroud)
这是fortran代码,在/src文件中保存为myFunc.f :
subroutine myFunc(x)
implicit …Run Code Online (Sandbox Code Playgroud) 我正在对pandas数据框中的变量进行转换,然后我想用新值替换该列.问题似乎是在转换之后,数组的长度与我的数据帧索引的长度不同.我不认为这是真的.
>>> df['variable'] = stats.boxcox(df.variable)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2119, in __setitem__
self._set_item(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2165, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2205, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
Run Code Online (Sandbox Code Playgroud)
当我检查长度时,这些长度似乎不一致.len(数组)说它是2但是当我打电话给stats.boxcox它说它是50000.这里发生了什么?
>>> len(df)
50000
>>> len(stats.boxcox(df.variable))
2
>>> stats.boxcox(df.variable)
(0 -0.079496
1 -0.117982
2 -0.104637
...
49985 -0.041300
49986 0.651771
49987 -0.115660 …Run Code Online (Sandbox Code Playgroud) 在我的工作中,我们的HIVEql环境中有哪些功能可用.是否有可以运行的语句列出所有可用的功能?例如:
SELECT*FROM all_available_functions;
我有一些数据需要分组到箱子里.而不是将箱子表示为0,1,2,3 ......等.我希望它输出每个bin的平均值或中值.有没有办法做到这一点?
我想将两个列表转换为pyspark数据框,其中列表分别是列。
我试过了
a=[1, 2, 3, 4]
b=[2, 3, 4, 5]
sqlContext.createDataFrame([a, b], schema=['a', 'b']).show()
Run Code Online (Sandbox Code Playgroud)
但是我得到了
+---+---+---+---+
| a| b| _3| _4|
+---+---+---+---+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
+---+---+---+---+
Run Code Online (Sandbox Code Playgroud)
我真正想要的是:
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+
Run Code Online (Sandbox Code Playgroud)
是否有方便的方法来创建此结果?
我正在尝试将其他人的 fortran 程序转换为子例程,以便我可以从 R 调用它。我正在通过调用来编译 fortran 程序(称为“midpSS9.f”)
\n\nR CMD SHLIB midpSS9.f\ngfortran -m64 -02 -mtune=core2 -c midpSS9.f -o midpSS9.o\nRun Code Online (Sandbox Code Playgroud)\n\n但我收到几个(本质上相同的)警告:
\n\nWarning: Real constant underflows its kind at (1)\nmidpSS9.f:59.44\n if (part3 .e. 0.0) part3 = 1.0E-307\n 1\nRun Code Online (Sandbox Code Playgroud)\n\n我在子例程的顶部将第 3 部分变量声明为实数。根据我的理解(取自这里),如果您使用的是 64 位计算机(我就是),则最小数字应该是 0.5E\xe2\x80\x93308。那么,为什么会在这里抱怨呢?
\n\nPS:这是我第一次使用 Fortran,如果这是显而易见的,抱歉。
\npython ×7
fortran ×2
pandas ×2
r ×2
dataframe ×1
devtools ×1
hive ×1
hiveql ×1
matplotlib ×1
numpy ×1
precision ×1
pyspark ×1
r-package ×1
roxygen2 ×1
scikit-learn ×1
scipy ×1
six-python ×1
sql ×1
statsmodels ×1
stop-words ×1