使用 pandas 或 statsmodel 创建虚拟变量以实现两列的交互

Question

使用 pandas 或 statsmodel 创建虚拟变量以实现两列的交互

Meh*_*hdi 1 python pandas statsmodels patsy

我有一个像这样的数据框：

Index ID  Industry  years_spend       asset
6646  892         4            4  144.977037
2347  315        10            8  137.749138
7342  985         1            5  104.310217
137    18         5            5  156.593396
2840  381        11            2  229.538828
6579  883        11            1  171.380125
1776  235         4            7  217.734377
2691  361         1            2  148.865341
815   110        15            4  233.309491
2932  393        17            5  187.281724

Run Code Online (Sandbox Code Playgroud)

我想为 Industry X Years_spend 创建虚拟变量，它会创建变量len(df.Industry.value_counts()) * len(df.years_spend.value_counts())，例如，对于具有 Industry==1 的所有行，d_11_4 = 1，并且 Years Spend=4，否则 d_11_4 = 0。然后我可以使用这些变量进行一些回归工作。

我知道我可以使用 df.groupby(['Industry','years_spend']) 创建像我想要的那样的组，并且我知道我可以使用以下patsy语法为一列创建这样的变量statsmodels：

import statsmodels.formula.api as smf

mod = smf.ols("income ~   C(Industry)", data=df).fit()

Run Code Online (Sandbox Code Playgroud)

但如果我想处理 2 列，我会收到一个错误： IndexError: tuple index out of range

我如何使用 pandas 或使用 statsmodels 中的某些函数来做到这一点？

Answer 1

Nat*_*ith 7

使用 patsy 语法，它只是：

import statsmodels.formula.api as smf

mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()

Run Code Online (Sandbox Code Playgroud)

这个:字的意思是“互动”；您还可以将其推广到两个以上项目的交互 ( C(a):C(b):C(c))、数值和分类值之间的交互等。您可能会发现patsy 文档很有用。

归档时间：	8 年，6 月前
查看次数：	11131 次
最近记录：	8 年，6 月前