Mel*_*uce 47 python numpy pandas scikit-learn sklearn-pandas
我想将一个由一系列元素组成的pandas列分解为多个列,因为它们具有唯一的元素,即one-hot-encode它们(其值1表示存在于行0中的给定元素,在缺少的情况下).
例如,采用数据帧df
Col1 Col2 Col3
C 33 [Apple, Orange, Banana]
A 2.5 [Apple, Grape]
B 42 [Banana]
Run Code Online (Sandbox Code Playgroud)
我想将其转换为:
DF
Col1 Col2 Apple Orange Banana Grape
C 33 1 1 1 0
A 2.5 1 0 0 1
B 42 0 0 1 0
Run Code Online (Sandbox Code Playgroud)
我如何使用pandas/sklearn来实现这一目标?
Max*_*axU 51
我们也可以使用sklearn.preprocessing.MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
columns=mlb.classes_,
index=df.index))
Run Code Online (Sandbox Code Playgroud)
结果:
In [77]: df
Out[77]:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
piR*_*red 37
选项1
简答
pir_slow
df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies())
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
选项2
快速回答
pir_fast
v = df.Col3.values
l = [len(x) for x in v.tolist()]
f, u = pd.factorize(np.concatenate(v))
n, m = len(v), u.size
i = np.arange(n).repeat(l)
dummies = pd.DataFrame(
np.bincount(i * m + f, minlength=n * m).reshape(n, m),
df.index, u
)
df.drop('Col3', 1).join(dummies)
Col1 Col2 Apple Orange Banana Grape
0 C 33.0 1 1 1 0
1 A 2.5 1 0 0 1
2 B 42.0 0 0 1 0
Run Code Online (Sandbox Code Playgroud)
选项3
pir_alt1
df.drop('Col3', 1).join(
pd.get_dummies(
pd.DataFrame(df.Col3.tolist()).stack()
).astype(int).sum(level=0)
)
Col1 Col2 Apple Orange Banana Grape
0 C 33.0 1 1 1 0
1 A 2.5 1 0 0 1
2 B 42.0 0 0 1 0
Run Code Online (Sandbox Code Playgroud)
时间结果
代码如下
def maxu(df):
mlb = MultiLabelBinarizer()
d = pd.DataFrame(
mlb.fit_transform(df.Col3.values)
, df.index, mlb.classes_
)
return df.drop('Col3', 1).join(d)
def bos(df):
return df.drop('Col3', 1).assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0))
def psi(df):
return pd.concat([
df.drop("Col3", 1),
df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0)
], axis=1)
def alex(df):
return df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3]
for fruit in set(fruit for fruits in df.Col3
for fruit in fruits)})
def pir_slow(df):
return df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies())
def pir_alt1(df):
return df.drop('Col3', 1).join(pd.get_dummies(pd.DataFrame(df.Col3.tolist()).stack()).astype(int).sum(level=0))
def pir_fast(df):
v = df.Col3.values
l = [len(x) for x in v.tolist()]
f, u = pd.factorize(np.concatenate(v))
n, m = len(v), u.size
i = np.arange(n).repeat(l)
dummies = pd.DataFrame(
np.bincount(i * m + f, minlength=n * m).reshape(n, m),
df.index, u
)
return df.drop('Col3', 1).join(dummies)
results = pd.DataFrame(
index=(1, 3, 10, 30, 100, 300, 1000, 3000),
columns='maxu bos psi alex pir_slow pir_fast pir_alt1'.split()
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
Run Code Online (Sandbox Code Playgroud)
Myk*_*tko 13
您可以使用这些功能explode(0.25.0 版中的新功能。)和crosstab:
s = df['Col3'].explode()
df[['Col1', 'Col2']].join(pd.crosstab(s.index, s))
Run Code Online (Sandbox Code Playgroud)
输出:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
用途get_dummies:
df_out = df.assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0))
Run Code Online (Sandbox Code Playgroud)
输出:
Col1 Col2 Col3 Apple Banana Grape Orange
0 C 33.0 [Apple, Orange, Banana] 1 1 0 1
1 A 2.5 [Apple, Grape] 1 0 1 0
2 B 42.0 [Banana] 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
清理栏:
df_out.drop('Col3',axis=1)
Run Code Online (Sandbox Code Playgroud)
输出:
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
你可以通过循环Col3带apply和每个元素转换成与该列表作为成为在结果数据帧报头中的索引中的系列:
pd.concat([
df.drop("Col3", 1),
df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0)
], axis=1)
#Col1 Col2 Apple Banana Grape Orange
#0 C 33.0 1.0 1.0 0.0 1.0
#1 A 2.5 1.0 0.0 1.0 0.0
#2 B 42.0 0.0 1.0 0.0 0.0
Run Code Online (Sandbox Code Playgroud)
您可以Col3使用set comprehension 获得所有独特的成果,如下所示:
set(fruit for fruits in df.Col3 for fruit in fruits)
Run Code Online (Sandbox Code Playgroud)
使用字典理解,您可以浏览每个独特的水果,看看它是否在列中.
>>> df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3]
for fruit in set(fruit for fruits in df.Col3
for fruit in fruits)})
Col1 Col2 Apple Banana Grape Orange
0 C 33.0 1 1 0 1
1 A 2.5 1 0 1 0
2 B 42.0 0 1 0 0
Run Code Online (Sandbox Code Playgroud)
计时
dfs = pd.concat([df] * 1000) # Use 3,000 rows in the dataframe.
# Solution 1 by @Alexander (me)
%%timeit -n 1000
dfs[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in dfs.Col3]
for fruit in set(fruit for fruits in dfs.Col3 for fruit in fruits)})
# 10 loops, best of 3: 4.57 ms per loop
# Solution 2 by @Psidom
%%timeit -n 1000
pd.concat([
dfs.drop("Col3", 1),
dfs.Col3.apply(lambda x: pd.Series(1, x)).fillna(0)
], axis=1)
# 10 loops, best of 3: 748 ms per loop
# Solution 3 by @MaxU
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
%%timeit -n 10
dfs.join(pd.DataFrame(mlb.fit_transform(dfs.Col3),
columns=mlb.classes_,
index=dfs.index))
# 10 loops, best of 3: 283 ms per loop
# Solution 4 by @ScottBoston
%%timeit -n 10
df_out = dfs.assign(**pd.get_dummies(dfs.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0))
# 10 loops, best of 3: 512 ms per loop
But...
>>> print(df_out.head())
Col1 Col2 Col3 Apple Banana Grape Orange
0 C 33.0 [Apple, Orange, Banana] 1000 1000 0 1000
1 A 2.5 [Apple, Grape] 1000 0 1000 0
2 B 42.0 [Banana] 0 1000 0 0
0 C 33.0 [Apple, Orange, Banana] 1000 1000 0 1000
1 A 2.5 [Apple, Grape] 1000 0 1000 0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
11370 次 |
| 最近记录: |