ekt*_*kta 3 python-2.7 statsmodels logistic-regression
我有一个包含列的数据框,我打算将它们视为分类变量.
第一列是国家,其中包含SGP,AUS,MYS等值.第二列是时间,其中包含24小时格式的值,如00,11,14,15等.事件是二进制变量,有1/0标志.我理解为了对它们进行分类,我需要在运行Logistic回归之前使用patsy.这个,我使用dmatrices构建.
用例:仅考虑country&time_day的交互效果(以及其他属性说"操作系统")
f= 'event_int ~ time_day:country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')
Run Code Online (Sandbox Code Playgroud)
我希望只看到具有BOTH country&time_day的列名,但事实并非如此.我可以通过指定手动获取子集
X = X.ix[:,range(7,len(X.columns))],但这意味着对每个数据集进行HARDCODING.
我的理解是A*B与A:B的不同之处在于它没有列出A + B有趣的东西虽然我在上面的输出中没有看到A,即单独的time_day的分类值.
另外,当我执行以下操作时,要单独从"X"数据框中排除"country",它就不起作用,我得到与上面相同的输出.
f='event_int ~ time_day:country-country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')
Run Code Online (Sandbox Code Playgroud)
这让我觉得":"是简化形式的"*",因为它只缺少一个分类变量.我认为它无法理解BOTH是绝对的变种?
f='event_int ~ time_day*country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'time_day[T.02]', u'time_day[T.03]', u'time_day[T.04]', u'time_day[T.05]', u'time_day[T.06]', u'time_day[T.07]', u'time_day[T.08]', u'time_day[T.09]', u'time_day[T.10]', u'time_day[T.11]', u'time_day[T.12]', u'time_day[T.NA]', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[T.HKG]', u'time_day[T.03]:country[T.HKG]', u'time_day[T.04]:country[T.HKG]', u'time_day[T.05]:country[T.HKG]', u'time_day[T.06]:country[T.HKG]', u'time_day[T.07]:country[T.HKG]', u'time_day[T.08]:country[T.HKG]', u'time_day[T.09]:country[T.HKG]', u'time_day[T.10]:country[T.HKG]', u'time_day[T.11]:country[T.HKG]', u'time_day[T.12]:country[T.HKG]', u'time_day[T.NA]:country[T.HKG]', u'time_day[T.02]:country[T.IDN]', u'time_day[T.03]:country[T.IDN]', u'time_day[T.04]:country[T.IDN]', u'time_day[T.05]:country[T.IDN]', u'time_day[T.06]:country[T.IDN]', u'time_day[T.07]:country[T.IDN]', u'time_day[T.08]:country[T.IDN]', u'time_day[T.09]:country[T.IDN]', u'time_day[T.10]:country[T.IDN]', u'time_day[T.11]:country[T.IDN]', u'time_day[T.12]:country[T.IDN]', u'time_day[T.NA]:country[T.IDN]', u'time_day[T.02]:country[T.IND]', u'time_day[T.03]:country[T.IND]', u'time_day[T.04]:country[T.IND]', u'time_day[T.05]:country[T.IND]', u'time_day[T.06]:country[T.IND]', u'time_day[T.07]:country[T.IND]', u'time_day[T.08]:country[T.IND]', u'time_day[T.09]:country[T.IND]', u'time_day[T.10]:country[T.IND]', u'time_day[T.11]:country[T.IND]', u'time_day[T.12]:country[T.IND]', u'time_day[T.NA]:country[T.IND]', u'time_day[T.02]:country[T.MYS]', u'time_day[T.03]:country[T.MYS]', u'time_day[T.04]:country[T.MYS]', u'time_day[T.05]:country[T.MYS]', u'time_day[T.06]:country[T.MYS]', u'time_day[T.07]:country[T.MYS]', u'time_day[T.08]:country[T.MYS]', u'time_day[T.09]:country[T.MYS]', u'time_day[T.10]:country[T.MYS]', u'time_day[T.11]:country[T.MYS]', u'time_day[T.12]:country[T.MYS]', u'time_day[T.NA]:country[T.MYS]', u'time_day[T.02]:country[T.NZL]', u'time_day[T.03]:country[T.NZL]', u'time_day[T.04]:country[T.NZL]', u'time_day[T.05]:country[T.NZL]', u'time_day[T.06]:country[T.NZL]', u'time_day[T.07]:country[T.NZL]', u'time_day[T.08]:country[T.NZL]', u'time_day[T.09]:country[T.NZL]', u'time_day[T.10]:country[T.NZL]', u'time_day[T.11]:country[T.NZL]', u'time_day[T.12]:country[T.NZL]', u'time_day[T.NA]:country[T.NZL]', u'time_day[T.02]:country[T.PHL]', u'time_day[T.03]:country[T.PHL]', u'time_day[T.04]:country[T.PHL]', u'time_day[T.05]:country[T.PHL]', u'time_day[T.06]:country[T.PHL]', u'time_day[T.07]:country[T.PHL]', u'time_day[T.08]:country[T.PHL]', u'time_day[T.09]:country[T.PHL]', u'time_day[T.10]:country[T.PHL]', u'time_day[T.11]:country[T.PHL]', u'time_day[T.12]:country[T.PHL]', u'time_day[T.NA]:country[T.PHL]', u'time_day[T.02]:country[T.SGP]', u'time_day[T.03]:country[T.SGP]', u'time_day[T.04]:country[T.SGP]', u'time_day[T.05]:country[T.SGP]', u'time_day[T.06]:country[T.SGP]', u'time_day[T.07]:country[T.SGP]', u'time_day[T.08]:country[T.SGP]', u'time_day[T.09]:country[T.SGP]', ...], dtype='object')
Run Code Online (Sandbox Code Playgroud)
如果我明确地将它们声明为"绝对"变量,我会得到这个 - :
f='event_int ~ C(time_day):C(country)'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'C(country)[T.HKG]', u'C(country)[T.IDN]', u'C(country)[T.IND]', u'C(country)[T.MYS]', u'C(country)[T.NZL]', u'C(country)[T.PHL]', u'C(country)[T.SGP]', u'C(time_day)[T.02]:C(country)[AUS]', u'C(time_day)[T.03]:C(country)[AUS]', u'C(time_day)[T.04]:C(country)[AUS]', u'C(time_day)[T.05]:C(country)[AUS]', u'C(time_day)[T.06]:C(country)[AUS]', u'C(time_day)[T.07]:C(country)[AUS]', u'C(time_day)[T.08]:C(country)[AUS]', u'C(time_day)[T.09]:C(country)[AUS]', u'C(time_day)[T.10]:C(country)[AUS]', u'C(time_day)[T.11]:C(country)[AUS]', u'C(time_day)[T.12]:C(country)[AUS]', u'C(time_day)[T.NA]:C(country)[AUS]', u'C(time_day)[T.02]:C(country)[HKG]', u'C(time_day)[T.03]:C(country)[HKG]', u'C(time_day)[T.04]:C(country)[HKG]', u'C(time_day)[T.05]:C(country)[HKG]', u'C(time_day)[T.06]:C(country)[HKG]', u'C(time_day)[T.07]:C(country)[HKG]', u'C(time_day)[T.08]:C(country)[HKG]', u'C(time_day)[T.09]:C(country)[HKG]', u'C(time_day)[T.10]:C(country)[HKG]', u'C(time_day)[T.11]:C(country)[HKG]', u'C(time_day)[T.12]:C(country)[HKG]', u'C(time_day)[T.NA]:C(country)[HKG]', u'C(time_day)[T.02]:C(country)[IDN]', u'C(time_day)[T.03]:C(country)[IDN]', u'C(time_day)[T.04]:C(country)[IDN]', u'C(time_day)[T.05]:C(country)[IDN]', u'C(time_day)[T.06]:C(country)[IDN]', u'C(time_day)[T.07]:C(country)[IDN]', u'C(time_day)[T.08]:C(country)[IDN]', u'C(time_day)[T.09]:C(country)[IDN]', u'C(time_day)[T.10]:C(country)[IDN]', u'C(time_day)[T.11]:C(country)[IDN]', u'C(time_day)[T.12]:C(country)[IDN]', u'C(time_day)[T.NA]:C(country)[IDN]', u'C(time_day)[T.02]:C(country)[IND]', u'C(time_day)[T.03]:C(country)[IND]', u'C(time_day)[T.04]:C(country)[IND]', u'C(time_day)[T.05]:C(country)[IND]', u'C(time_day)[T.06]:C(country)[IND]', u'C(time_day)[T.07]:C(country)[IND]', u'C(time_day)[T.08]:C(country)[IND]', u'C(time_day)[T.09]:C(country)[IND]', u'C(time_day)[T.10]:C(country)[IND]', u'C(time_day)[T.11]:C(country)[IND]', u'C(time_day)[T.12]:C(country)[IND]', u'C(time_day)[T.NA]:C(country)[IND]', u'C(time_day)[T.02]:C(country)[MYS]', u'C(time_day)[T.03]:C(country)[MYS]', u'C(time_day)[T.04]:C(country)[MYS]', u'C(time_day)[T.05]:C(country)[MYS]', u'C(time_day)[T.06]:C(country)[MYS]', u'C(time_day)[T.07]:C(country)[MYS]', u'C(time_day)[T.08]:C(country)[MYS]', u'C(time_day)[T.09]:C(country)[MYS]', u'C(time_day)[T.10]:C(country)[MYS]', u'C(time_day)[T.11]:C(country)[MYS]', u'C(time_day)[T.12]:C(country)[MYS]', u'C(time_day)[T.NA]:C(country)[MYS]', u'C(time_day)[T.02]:C(country)[NZL]', u'C(time_day)[T.03]:C(country)[NZL]', u'C(time_day)[T.04]:C(country)[NZL]', u'C(time_day)[T.05]:C(country)[NZL]', u'C(time_day)[T.06]:C(country)[NZL]', u'C(time_day)[T.07]:C(country)[NZL]', u'C(time_day)[T.08]:C(country)[NZL]', u'C(time_day)[T.09]:C(country)[NZL]', u'C(time_day)[T.10]:C(country)[NZL]', u'C(time_day)[T.11]:C(country)[NZL]', u'C(time_day)[T.12]:C(country)[NZL]', u'C(time_day)[T.NA]:C(country)[NZL]', u'C(time_day)[T.02]:C(country)[PHL]', u'C(time_day)[T.03]:C(country)[PHL]', u'C(time_day)[T.04]:C(country)[PHL]', u'C(time_day)[T.05]:C(country)[PHL]', u'C(time_day)[T.06]:C(country)[PHL]', u'C(time_day)[T.07]:C(country)[PHL]', u'C(time_day)[T.08]:C(country)[PHL]', u'C(time_day)[T.09]:C(country)[PHL]', u'C(time_day)[T.10]:C(country)[PHL]', u'C(time_day)[T.11]:C(country)[PHL]', u'C(time_day)[T.12]:C(country)[PHL]', u'C(time_day)[T.NA]:C(country)[PHL]', u'C(time_day)[T.02]:C(country)[SGP]', u'C(time_day)[T.03]:C(country)[SGP]', u'C(time_day)[T.04]:C(country)[SGP]', u'C(time_day)[T.05]:C(country)[SGP]', u'C(time_day)[T.06]:C(country)[SGP]', u'C(time_day)[T.07]:C(country)[SGP]', u'C(time_day)[T.08]:C(country)[SGP]', u'C(time_day)[T.09]:C(country)[SGP]', ...], dtype='object')
Run Code Online (Sandbox Code Playgroud)
问题:
1.如何仅包含交互效果而不包含此类变量?
2. 为什么-country在第二种情况下没有工作的国家被排除在外?
相关:Statsmodels公式API(patsy):如何排除交互组件的子集?
根据@Nathaniel J. Smith在下面的回答编辑排序自我排除故障 - :
f2='event_int ~ country:time_day'
y2,X2 = patsy.dmatrices(f2, df, return_type='dataframe')
X2.design_info.term_names
['Intercept', 'country:time_day']
f1='event_int ~ country:time_day-1'
y1,X1 = patsy.dmatrices(f1, df, return_type='dataframe')
X1.design_info.term_names
['country:time_day']
Run Code Online (Sandbox Code Playgroud)
简答:试试 event_int ~ -1 + time_day:country
答案很长:
首先要理解的是,patsy决定如何构建设计矩阵有两个不同的阶段.首先,它确定要包括哪些术语.条款是a,或者a:b.(该a和b在a:b被称为因素 ;该术语 a包含单个因子,其还阐述a.)弄清楚其存在方面涉及扩展和简化你给它的配方,直到你有仅使用一个表达式+和:.a*b扩展为a + b + a:b等等.减法是在此阶段发生的操作:a + b - a简化为简单b.所以a*b - a扩展到a + b + a:b - a简化为b + a:b,但是a:b - a相同a:b,因为没有a减去,所以- a只是被忽略.这就是写作time_day:country - country与写作相同的原因time_day:country.
然后在第二阶段,一旦帕齐已经决定要包括哪些方面,它必须决定如何编写这些条款.这是你遇到麻烦的阶段.
一般规则是,patsy会遍历每个具有分类因素的术语,并计算出一组可以使用的列,这些列将使模型足够灵活以包含指定的交互,但不会对已经存在的任何术语多余已被添加.
在这种情况下,你的麻烦是由patsy默认添加的拦截术语引起的:event_int ~ time_day:country被解释为event_int ~ 1 + time_day:country.这告诉patsy你想让一列代表单独的拦截术语,然后是第二组覆盖交互的列 - 但不与拦截重叠.哑编码两个明显的方法time_day和country将是多余的(共线)与拦截,让懦夫,而不是发现没有这个属性有点复杂的方案.如果删除了拦截,然后你告诉帕齐,它可以继续使用简单的方案,它确实是这样.
有关patsy如何选择编码方案的详细信息,请参见:http://patsy.readthedocs.org/en/latest/formulas.html#redundancy-and-categorical-factors
该手册部分的第一部分可能有一点太多的数学,但如果你向下滚动,有一些希望很好的图表可以让它更清楚发生了什么(并为数学提供了一些上下文).如果你搜索你,y ~ 1 + a:b你会看到专门显示你输入的情况的图表event_int ~ time_day:country.如果你搜索你,y ~ 1 + a + b + a:b你会看到一个event_int ~ time_day*country案例中发生了什么的图片.
除了看X.columns,就看是有益的X.design_info.term_names和X.design_info.term_slices,这显示出哪些"术语"懦夫认为存在,这列它们对应.(a并且a:b是术语;每个都生成多列.)图中的粗轮廓y ~ 1 + a:b旨在表示在这种情况下,单个术语a:b生成两组列:一组b用治疗编码编码的列,另一组用编码伪编码b和处理编码的成对产品的列a.
最后,有两个解释你得到的输出的技巧:(1)你可以确定patsy实际上将这些因素视为分类,因为列名看起来像varname[something involving the var's value].数值因子看起来像varname或(在极少数情况下,您将2d矩阵作为预测变量)varname[column index].(2)注意country[T.HKG]和之间的区别country[HKG]- 前者表示patsy使用降秩"处理"编码来避免冗余,而后者表示简单的伪编码.当然,事实证明,就单个列而言,它们是相同的,但从概念上讲,差异非常重要 - T.模式意味着它删除了其中一列(注意没有country[T.AUS]),因此像您考虑的那样对列进行子集化做得不好!
希望这可以帮助!
| 归档时间: |
|
| 查看次数: |
3600 次 |
| 最近记录: |