我如何绘制线性回归结果用于大熊猫的线性回归?
import pandas as pd
from pandas.stats.api import ols
df = pd.read_csv('Samples.csv', index_col=0)
control = ols(y=df['Control'], x=df['Day'])
one = ols(y=df['Sample1'], x=df['Day'])
two = ols(y=df['Sample2'], x=df['Day'])
Run Code Online (Sandbox Code Playgroud)
我试过plot()但它没用.我想在一个图上绘制所有三个样本是否有任何pandas代码或matplotlib代码以这些摘要的格式的hadle数据?
无论如何,结果看起来像这样:
控制
------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 7
Number of Degrees of Freedom: 2
R-squared: 0.5642
Adj R-squared: 0.4770
Rmse: 4.6893
F-stat (1, 5): 6.4719, p-value: 0.0516
Degrees of Freedom: model 1, resid 5
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value …Run Code Online (Sandbox Code Playgroud) 如何使用statsmodels ARMA过程来拟合表单的差分方程.
y[k] = - a1 * y[k-1] + b0 * u[k] + b1 * u[k-1] + c0 * e[k] + c1 * e[k-1]
Run Code Online (Sandbox Code Playgroud)
我不知道如何设置exog矩阵.例如
import statsmodels.api as sm
# some stupid data
y = np.random.randn(100)
u = np.ones((100,2))
armax = sm.tsa.ARMA(y, order=(1, 1), exog=u).fit()
Run Code Online (Sandbox Code Playgroud)
结果是
ValueError: could not broadcast input array from shape (2) into shape (3)
Run Code Online (Sandbox Code Playgroud)
它可能很容易解决,但我是该领域的新手.
谢谢.
(我正在使用statsmodels 0.6)
Seaborn是一个很棒的包,用于进行一些具有漂亮输出的高级绘图.然而,我正在努力使用Seaborn来覆盖来自外部拟合模型的数据和模型预测.在这个例子中,我在Statsmodels中拟合模型,这些模型对于Seaborn而言过于复杂而无法开箱即用,但我认为问题更为一般(即如果我有模型预测并希望使用Seaborn将它们和数据可视化).
让我们从导入和数据集开始:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import patsy
import itertools
import matplotlib.pyplot as plt
np.random.seed(12345)
# make a data frame with one continuous and two categorical variables:
df = pd.DataFrame({'x1': np.random.normal(size=100),
'x2': np.tile(np.array(['a', 'b']), 50),
'x3': np.repeat(np.array(['c', 'd']), 50)})
# create a design matrix using patsy:
X = patsy.dmatrix('x1 * x2 * x3', df)
# some random beta weights:
betas = np.random.normal(size=X.shape[1])
# create the response variable as …Run Code Online (Sandbox Code Playgroud) 试图弄清楚如何计算Tukey的HSD statsmodel.我可以使它工作,结果看起来很棒,但有一个我无法看到的手段差异的情节.一定是我正在做的傻事.
它是plot_simultaneous来自对象的方法TukeyHSDResults(参见doc).
这是我用来尝试的代码:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from scipy import stats
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
MultiComparison)
red_wine = pd.DataFrame.from_csv(".../winequality-red.csv",
sep=';', header=0, index_col=False)
white_wine = pd.DataFrame.from_csv(".../winequality-white.csv",
sep=';', header=0, index_col=False)
white1, white2 = train_test_split(white_wine['quality'], test_size=0.5, random_state=1812)
# compute anova
f, p = stats.f_oneway(red_wine['quality'], white1, white2)
print("F value: " + str(f))
print("p value: " + str(p))
# tukey HSD
red = pd.DataFrame(red_wine['quality'], columns=['quality'])
red['wine'] = map(lambda x: …Run Code Online (Sandbox Code Playgroud) 如果我正确理解OLS模型,那绝不应该这样吗?
trades['const']=1
Y = trades['ret']+trades['comms']
#X = trades[['potential', 'pVal', 'startVal', 'const']]
X = trades[['potential', 'pVal', 'startVal']]
from statsmodels.regression.linear_model import OLS
ols=OLS(Y, X)
res=ols.fit()
res.summary()
Run Code Online (Sandbox Code Playgroud)
如果我打开常数,我得到一个0.22的平方并且关闭它,我得到0.43.这怎么可能呢?
在使用statsmodels时,我遇到了这个奇怪的错误:ValueError: endog must be in the unit interval.有人能给我更多关于这个错误的信息吗?谷歌没有帮助.
产生错误的代码:
"""
Multiple regression with dummy variables.
"""
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv('cost_data.csv')
df.columns = ['Cost', 'R(t)', 'Day of Week']
dummy_ranks = pd.get_dummies(df['Day of Week'], prefix='days')
cols_to_keep = ['Cost', 'R(t)']
data = df[cols_to_keep].join(dummy_ranks.ix[:,'days_2':])
data['intercept'] = 1.0
print(data)
train_cols = data.columns[1:]
logit = sm.Logit(data['Cost'], data[train_cols])
result = logit.fit()
print(result.summary())
Run Code Online (Sandbox Code Playgroud)
追溯:
Traceback (most recent call last):
File "multiple_regression_dummy.py", …Run Code Online (Sandbox Code Playgroud) 我想使用pandas和statsmodels在数据帧的子集上拟合线性模型并返回预测值.但是,我无法弄清楚使用正确的熊猫成语.这是我想要做的:
import pandas as pd
import statsmodels.formula.api as sm
import seaborn as sns
tips = sns.load_dataset("tips")
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
Run Code Online (Sandbox Code Playgroud)
这会引发以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-139-b3d2575e2def> in <module>()
----> 1 tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
3033 return self._transform_general(func, *args, **kwargs)
3034 except:
-> 3035 return self._transform_general(func, *args, **kwargs)
3036
3037 # a reduction transform
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _transform_general(self, func, *args, **kwargs)
2988 group.T.values[:] = …Run Code Online (Sandbox Code Playgroud) 当我尝试在Statsmodels中运行多变量线性回归时出现错误。当我仅硬编码XData变量中的一个X列时,一切工作正常。
有人可以给我一些建议,因为我在这里缺少什么吗?我将不胜感激。
错误:
ValueError:形状(747,2)和(747,2)不对齐:2(dim 1)!= 747(dim 0)
码:
import pandas as pd
import statsmodels.api as sm
import itertools
data = pd.read_csv("deaconFoodData.csv")
for i in range(2,10,1):
xCombinations = itertools.combinations(["Food Exp","HH Size","HH Inc","Highest Ed Head","Age Head","Shopping Time","Kid <6","Kid 6-18","Eating Healthy"], i)
print(str(i) + " variables")
for combination in xCombinations:
comb = list(combination)
print(comb)
xData = data[["Food Exp", "HH Size"]] # data[comb]
yData = data["Shopping LH"]
yData = sm.add_constant(yData, prepend=False)
print(yData)
# Fit and summarize OLS model
mod = sm.OLS(xData, yData)
res = …Run Code Online (Sandbox Code Playgroud) 我有两个变量的相关图,x轴上的预测变量(温度)和y轴上的响应变量(密度)。我最适合的最小二乘回归线是二阶多项式。我还要绘制置信度和预测间隔。此答案中描述的方法似乎很完美。但是,我的数据集(n = 2340)对许多(x,y)对都有重复的条目。我得到的情节看起来像这样:

这是我的相关代码(从上面的链接答案中略作修改):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp …Run Code Online (Sandbox Code Playgroud) 假设我们有一个时间序列数据,其中包含最近两年的每日订单计数:

我们可以使用Python的statsmodels库来预测未来的订单:
fit = statsmodels.api.tsa.statespace.SARIMAX(
train.Count, order=(2, 1, 4),seasonal_order=(0,1,1,7)
).fit()
y_hat_avg['SARIMA'] = fit1.predict(
start="2018-06-16", end="2018-08-14", dynamic=True
)
Run Code Online (Sandbox Code Playgroud)
结果(不介意数字):
现在假设由于公司假期或促销活动,我们的输入数据有一些异常的增加或减少。因此,我们添加了两列,分别说明每天是否是“假期”以及公司是否有“促销”。
是否有一种方法(以及在Python中实现它的方法)使用这种新型的输入数据,并帮助该模型理解异常值的原因,并通过提供“假期”和“ promotion_day”信息来预测未来的订单?
fit1.predict('2018-08-29', holiday=True, is_promotion=False)
# or
fit1.predict(start="2018-08-20", end="2018-08-25", holiday=[0,0,0,1,1,0], is_promotion=[0,0,1,1,0,1])
Run Code Online (Sandbox Code Playgroud) python ×10
statsmodels ×10
pandas ×4
matplotlib ×2
regression ×2
plot ×1
seaborn ×1
time-series ×1