将statsmodels摘要对象转换为Pandas Dataframe

Sag*_*tha 10 python pandas statsmodels

statsmodels.formula.api在Windows 10上使用(版本0.9.0)进行多元线性回归.在拟合模型并使用以下行获取摘要后,我将以摘要对象格式获得摘要.

X_opt  = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()


                          OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     296.0
Date:                Wed, 08 Aug 2018   Prob (F-statistic):           4.53e-30
Time:                        00:46:48   Log-Likelihood:                -525.39
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      46   BIC:                             1066.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.012e+04   6572.353      7.626      0.000    3.69e+04    6.34e+04
x1             0.8057      0.045     17.846      0.000       0.715       0.897
x2            -0.0268      0.051     -0.526      0.602      -0.130       0.076
x3             0.0272      0.016      1.655      0.105      -0.006       0.060
==============================================================================
Omnibus:                       14.838   Durbin-Watson:                   1.282
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               21.442
Skew:                          -0.949   Prob(JB):                     2.21e-05
Kurtosis:                       5.586   Cond. No.                     1.40e+06
==============================================================================
Run Code Online (Sandbox Code Playgroud)

我想对显着性水平0.05的P值进行反向消除.为此,我需要删除具有最高P值的预测变量并再次运行代码.

我想知道是否有一种方法可以从摘要对象中提取P值,这样我就可以运行带有条件语句的循环并找到重要变量而无需手动重复这些步骤.

谢谢.

Zax*_*axR 12

@Michael B的答案效果很好,但需要"重新创建"表格.表本身实际上可以直接从summary().tables属性中获得.此属性中的每个表(表格列表)都是SimpleTable,它具有输出不同格式的方法.然后我们可以将这些格式作为pd.DataFrame读取:

import statsmodels.api as sm

model = sm.OLS(y,x)
results = model.fit()
results_summary = results.summary()

# Note that tables is a list. The table at index 1 is the "core" table. Additionally, read_html puts dfs in a list, so we want index 0
results_as_html = results_summary.tables[1].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0]
Run Code Online (Sandbox Code Playgroud)

  • 这在使用公式 API 时不起作用。`AttributeError:'OLSResults'对象没有属性'tables'` (2认同)
  • 我怎么没想到呢?边缘hacky但非常整洁。这是使用 `csv` 方法的替代方法,以防它派上用场:`pd.read_csv(pd.compat.StringIO(table.as_csv()), index_col=0)` (2认同)

Mic*_*l B 11

将模型拟合存储为变量results,如下所示:

import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()
Run Code Online (Sandbox Code Playgroud)

然后创建一个如下所示的函数:

def results_summary_to_dataframe(results):
    '''take the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df
Run Code Online (Sandbox Code Playgroud)

您可以results使用dir()进行打印,然后将它们添加到函数和df中,从而进一步探索对象的所有属性.


Dan*_*hou 5

一个简单的解决方案就是一行代码:

LRresult = (result.summary2().tables[1])
Run Code Online (Sandbox Code Playgroud)

这将为您提供一个数据框对象:

type(LRresult)
Run Code Online (Sandbox Code Playgroud)

pandas.core.frame.DataFrame

要获取重要变量并再次运行测试:

newlist = list(LRresult[LRresult['P>|z|']<=0.05].index)[1:]
myform1 = 'binary_Target' + ' ~ ' + ' + '.join(newlist)

M1_test2 = smf.logit(formula=myform1,data=myM1_1)

result2 = M1_test2.fit(maxiter=200)
LRresult2 = (result2.summary2().tables[1])
LRresult2
Run Code Online (Sandbox Code Playgroud)

  • 也适用于summary()。这应该是公认的答案 (2认同)
  • @user3357177,不,不。`.summary2()` 返回 pandas.DataFrame,但 `.summary()` 返回 `statsmodels.SimpleTable`。 (2认同)