来自statsmodels的ARIMA给出了我输出的不准确答案.我想知道是否有人可以帮我理解我的代码有什么问题.
这是一个示例:
import pandas as pd
import numpy as np
import datetime as dt
from statsmodels.tsa.arima_model import ARIMA
# Setting up a data frame that looks twenty days into the past,
# and has linear data, from approximately 1 through 20
counts = np.arange(1, 21) + 0.2 * (np.random.random(size=(20,)) - 0.5)
start = dt.datetime.strptime("1 Nov 01", "%d %b %y")
daterange = pd.date_range(start, periods=20)
table = {"count": counts, "date": daterange}
data = pd.DataFrame(table)
data.set_index("date", inplace=True)
print data
count
date
2001-11-01 0.998543 …Run Code Online (Sandbox Code Playgroud) 我正在使用ARIMA模型来预测产品的销售量。数据位于2015年1月1日至2016年11月24日的csv文件中,间隔为1周。我正在尝试预测9步,即未来9周。
CSV中的数据:
"01-01-2015",9
"08-01-2015",8
"15-01-2015",13
"22-01-2015",10
"29-01-2015",12
"05-02-2015",5
"12-02-2015",4
"19-02-2015",6
"26-02-2015",9
"05-03-2015",3
"12-03-2015",3
"19-03-2015",2
...
Run Code Online (Sandbox Code Playgroud)
这是我正在使用的代码:
def parser(x):
return datetime.datetime.strptime(x, '%d-%m-%Y')
fn = 'filename.csv'
y = pd.read_csv(fn, header = 0, parse_dates = [0], index_col = 0, squeeze = True, date_parser = parser)
newmod = sm.tsa.statespace.SARIMAX(y,order=(1, 1, 0),seasonal_order=(1, 1, 0, 12),enforce_stationarity=False,enforce_invertibility=False)
newresults = newmod.fit()
pred_uc = newresults.get_forecast(steps = 9)
pred_ci = pred_uc.conf_int()
y1 = pred_ci.iloc[:,0]
y2 = pred_ci.iloc[:,1]
ax = y.plot(label = "observed")
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index, pred_ci.iloc[:,0],pred_ci.iloc[:,1], color = 'k' , …Run Code Online (Sandbox Code Playgroud) 我正在使用 statsmodel 进行 OLS 回归。当我对标准错误进行聚类时,出现了一条警告消息,表明存在多重共线性问题。但是,如果我只是拟合没有聚集错误的模型,则不会出现此类警告。
mod = smf.ols(formula = var ~ treatment_r1 + block + has_multiple_treat', data = df)
mod_res = mod.fit(cov_type='cluster', cov_kwds={'groups': df['block']}, use_t=True)
ValueWarning: covariance of constraints does not have full rank. The number of constraints is 3, but rank is 1
'rank is %d' % (J, J_), ValueWarning)
Run Code Online (Sandbox Code Playgroud)
我在这篇文章《Capturing high multi-collinearity in statsmodels》之后检查了共线性,没有发现任何问题。
corr = np.corrcoef(df_new[["var", "has_multiple_treat", "treatment_r1", "block1"]], rowvar=0)
w, v = np.linalg.eig(corr)
w
np.linalg.det(corr)
Run Code Online (Sandbox Code Playgroud)
var可以是0或1变量或者连续变量;Treatment_r1、has_multiple_treat是 0 …
我正在尝试实现此“R”代码的 Python 版本,通过查找偏差统计数据来比较 2 个或更多 Logistic 回归模型
anova(LogisticModel.1, LogisticModel.2)
Run Code Online (Sandbox Code Playgroud)
线性模型的 anova 测试有一个statsmodels实现,其工作原理如下:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
m01 = ols('sales~adverts', data=df).fit()
m02 = ols('sales~adverts+airplay', data=df).fit()
m03 = ols('sales~adverts+airplay+attract', data=df).fit()
anovaResults = anova_lm(m01, m02, m03)
print(anovaResults)
Run Code Online (Sandbox Code Playgroud)
我已经通过手动计算计算了 Logistic 回归表中描述的残差 df、残差偏差、偏差,但我想知道是否有任何东西可以使用任何库在 Python 中自动执行此操作。
这里已经提出了类似的问题,但仍未得到答复。
ccf()我在使用(Python)库中的方法时遇到一些问题statsmodels。等效操作在 R 中运行良好。
ccf产生两个变量之间的互相关函数,在我的示例A中B。A我有兴趣了解领先指标的程度B。
我正在使用以下内容:
import pandas as pd
import numpy as np
import statsmodels.tsa.stattools as smt
Run Code Online (Sandbox Code Playgroud)
我可以模拟A如下B:
np.random.seed(123)
test = pd.DataFrame(np.random.randint(0,25,size=(79, 2)), columns=list('AB'))
Run Code Online (Sandbox Code Playgroud)
当我运行时ccf,我得到以下信息:
ccf_output = smt.ccf(test['A'],test['B'], unbiased=False)
ccf_output
array([ 0.09447372, -0.12810284, 0.15581492, -0.05123683, 0.23403344,
0.0771812 , 0.01434263, 0.00986775, -0.23812752, -0.03996113,
-0.14383829, 0.0178347 , 0.23224969, 0.0829421 , 0.14981321,
-0.07094772, -0.17713121, 0.15377192, -0.19161986, 0.08006699,
-0.01044449, -0.04913098, 0.06682942, -0.02087582, 0.06453489,
0.01995989, -0.08961562, 0.02076603, …Run Code Online (Sandbox Code Playgroud) from statsmodels.tsa.seasonal import seasonal_decompose
def seasonal_decomp(df, model="additive"):
seasonal_df = None
seasonal_df = seasonal_decompose(df, model='additive')
return seasonal_df
seasonal_decomp(df)
Run Code Online (Sandbox Code Playgroud)
from statsmodels.tsa.seasonal import seasonal_decompose
def seasonal_decomp(df, model="additive"):
seasonal_df = None
seasonal_df = seasonal_decompose(df, model='additive')
return seasonal_df
seasonal_decomp(df)
Run Code Online (Sandbox Code Playgroud)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-93-00543113a58a> in <module>
----> 1 seasonal_decompose(df, model='additive')
e:\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
e:\Anaconda3\lib\site-packages\statsmodels\tsa\seasonal.py in seasonal_decompose(x, model, filt, period, two_sided, extrapolate_trend)
185 for …Run Code Online (Sandbox Code Playgroud) 我最近正在阅读 Susan Li 撰写的关于 Python 时间序列分析的教程。我正在以下系列上拟合时间序列 SARIMAX 模型:
y['2017':]
OUT:
Order Date
2017-01-01 397.602133
2017-02-01 528.179800
2017-03-01 544.672240
2017-04-01 453.297905
2017-05-01 678.302328
2017-06-01 826.460291
2017-07-01 562.524857
2017-08-01 857.881889
2017-09-01 1209.508583
2017-10-01 875.362728
2017-11-01 1277.817759
2017-12-01 1256.298672
Freq: MS, Name: Sales, dtype: float64
Run Code Online (Sandbox Code Playgroud)
使用以下内容:
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
Run Code Online (Sandbox Code Playgroud)
现在,直到这里为止效果都很好,但是当我尝试可视化结果时,我收到以下错误:
results.plot_diagnostics(figsize=(16, 8))
Run Code Online (Sandbox Code Playgroud)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-16-6cfeaa52b7c1> in <module>
----> 1 results.plot_diagnostics(figsize=(16, 8))
2 plt.show()
~/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/statespace/mlemodel.py in …Run Code Online (Sandbox Code Playgroud) 由于statsmodels.tseries模型需要具有给定频率的索引来进行预测,因此我需要我的数据具有非标准频率。
因此,我想创建一个新频率来分配给pandas.DateTimeIndex。\n这是dekad一年中有 36 个周期的频率。每个月三个。第一个总是在该月的 10 日,第二个是该月的 20 日,最后一个是该月的最后一天。
困难在于该月的最后一天:
\n然而,最终它是一个固定的频率(每月 3 次,每年 36 个周期)。
\n原因是statsmodels.tsa.holtwinters模型需要具有给定频率的索引来进行预测。当我尝试运行holtwinters预测时,我收到以下警告消息:
/home/tommy/miniconda3/envs/ml/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:216: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.\nRun Code Online (Sandbox Code Playgroud)\n/home/tommy/miniconda3/envs/ml/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:216: ValueWarning: A date index has been provided, but it has no associated frequency …Run Code Online (Sandbox Code Playgroud) 我目前正在尝试从 statsmodels.tsa.seasonal 的 MSTL 模块(https://www.statsmodels.org/devel/ generated/statsmodels.tsa.seasonal.MSTL.html )导入 MSTL,但它返回一个 ImportError 。我已经在 MAC M1 2020 上安装了 conda 的 statsmodels
statistics machine-learning time-series statsmodels deep-learning
我有以下 Python 代码,我已经尝试使用 REML 执行 VCA 分析:
\nimport pandas as pd\nimport statsmodels.api as sm\nfrom statsmodels.formula.api import ols\ndata = {\'Part\':[1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3],\n \'Employee\':[1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3],\n \'Measurement\':[103.3, 103.1, 103.1, 103.3, 102.9, 103.6, 103.2, 103.6, 103.1, 104.5, 104.8, 103.9, 104.5, 104, 103.8, 103.8, 103.6, 104, 104, 103.6, 103.5, 103.9, 104.1, 104.5, 104.3, 103.9, 103.8]} \ndf = pd.DataFrame(data)\nlm = ols(\'Measurement ~ C(Part) + C(Part):(Employee)\', data=df).fit()\ntable = sm.stats.anova_lm(lm, typ=2)\ntable[\'Percentage of Total Variance\'] = (table[\'sum_sq\'] / table[\'sum_sq\'].sum()) * 100\nRun Code Online (Sandbox Code Playgroud)\n\n当我通过运行以下步骤在 JMP 中运行此 An\xc3\xa1lises 时:
\nstatsmodels ×10
python ×9
time-series ×4
arima ×2
pandas ×2
statistics ×2
anova ×1
dataframe ×1
datetime ×1
matplotlib ×1
python-3.x ×1
r ×1
sas-jmp ×1