Pandas 在更广泛的数据框中转换虚拟变量列表

ocu*_*cut 1 python list dataframe pandas dummy-variable

我已经导入了一个 json 文件,现在有一个数据框,其中一列(代码)是一个列表。

index year   gvkey    code
0    1998    15686    ['TAX', 'ENVR', 'HEALTH']
1    2005    15372    ['EDUC', 'TAX', 'HEALTH', 'JUST']
2    2001    27486    ['LAB', 'TAX', 'HEALTH']
3    2008    84967    ['HEALTH','LAB', 'JUST']
Run Code Online (Sandbox Code Playgroud)

我想要得到的是如下内容:

index year   gvkey  TAX  ENVR HEALTH EDUC JUST LAB
0    1998    15686   1     1     1    0    0    0
1    2005    15372   1     0     1    0    1    0
2    2001    27486   1     0     1    0    1    0
3    2008    84967   0     0     1    0    1    1
Run Code Online (Sandbox Code Playgroud)

在Pandas 将一列列表转换为虚拟对象之后,我尝试了以下代码(其中 df 是我的数据框):

s = pd.Series(df["code"])
l = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Run Code Online (Sandbox Code Playgroud)

我正确获取了数据的第二部分(变量 TAX、ENVR、HEALTH、EDUC、JUST 和 LAB),但丢失了第一部分(年份和 gvkey)。

如何保持年份和 gvkey 变量?

jez*_*ael 5

I think better solution here is use DataFrame.pop with Series.str.join and Series.str.get_dummies:

df = df.join(df.pop('code').str.join('|').str.get_dummies())
print (df)
       year  gvkey  EDUC  ENVR  HEALTH  JUST  LAB  TAX
index                                                 
0      1998  15686     0     1       1     0    0    1
1      2005  15372     1     0       1     1    0    1
2      2001  27486     0     0       1     0    1    1
3      2008  84967     0     0       1     1    1    0
Run Code Online (Sandbox Code Playgroud)

If performance is important use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df.pop('code')),columns=mlb.classes_)

df = df.join(df1)
print (df)
       year  gvkey  EDUC  ENVR  HEALTH  JUST  LAB  TAX
index                                                 
0      1998  15686     0     1       1     0    0    1
1      2005  15372     1     0       1     1    0    1
2      2001  27486     0     0       1     0    1    1
3      2008  84967     0     0       1     1    1    0
Run Code Online (Sandbox Code Playgroud)

Your solution is possible, but slow, so better avoid it, also sum working only for unique values, for general solution need max:

df = df.join(pd.get_dummies(df.pop('code').apply(pd.Series).stack()).max(level=0))
print (df)
       year  gvkey  EDUC  ENVR  HEALTH  JUST  LAB  TAX
index                                                 
0      1998  15686     0     1       1     0    0    1
1      2005  15372     1     0       1     1    0    1
2      2001  27486     0     0       1     0    1    1
3      2008  84967     0     0       1     1    1    0
Run Code Online (Sandbox Code Playgroud)