Ana*_*tha 4 python numpy pandas
我有一个带有二进制值的数据框,在使用 Pandas 执行 get_dummies 后
df=
Values A1 A2 B1 B2 B3 B4 C1 C2 C3
10 1 0 1 0 0 0 1 0 0
12 0 1 0 0 1 0 0 1 0
3 0 1 0 1 0 0 0 0 1
5 1 0 0 0 0 1 1 0 0
Run Code Online (Sandbox Code Playgroud)
我想要一个新列,它包含所有列的组合,其中包含 1
Expected output:
Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
5 1 0 0 0 0 1 1 0 0 A1~~B4~~C3
Run Code Online (Sandbox Code Playgroud)
实际矩阵可以是25000+行*1000+列
在 R 中有一个类似的解决方案,但我在 Python 中需要它 bcoz 所有其他依赖项都在 python 中,而 R 对我来说是新的。
Codes in R below, & need similar one or any other code in python which can help me to arrive at my expected output
Solution 1 :
as.matrix(apply(m==1,1,function(a) paste0(colnames(m)[a], collapse = "")))
Solution 2:
t <- which(m==1, arr.ind = TRUE)
as.matrix(aggregate(col~row, cbind(row=rownames(t), col=t[,2]), function(x)
paste0(colnames(m)[x], collapse = "")))
Run Code Online (Sandbox Code Playgroud)
类似的东西怎么可能达到我在 Python 中的预期输出?
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
# Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
# 0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
# 1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
# 2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
# 3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)
说明:
Combination,请忽略Values列。有几种可能的方法(请参阅本主题)。这里我使用drop:df.drop("Values", axis=1)。apply使用和在每一行上应用自定义函数axis=10使用x[x != 0].indexstr.join匹配所需的输出:"~~".join(x[x != 0].index)完整说明:
# Step 1
print(df.drop("Values", axis=1))
# A1 A2 B1 B2 B3 B4 C1 C2 C3
# 0 1 0 1 0 0 0 1 0 0
# 1 0 1 0 0 1 0 0 1 0
# 2 0 1 0 1 0 0 0 0 1
# 3 1 0 0 0 0 1 1 0 0
# Step 3
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0], axis=1))
# A1 A2 B1 B2 B3 B4 C1 C2 C3
# 0 1.0 NaN 1.0 NaN NaN NaN 1.0 NaN NaN
# 1 NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN
# 2 NaN 1.0 NaN 1.0 NaN NaN NaN NaN 1.0
# 3 1.0 NaN NaN NaN NaN 1.0 1.0 NaN NaN
# Step 4
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0].index, axis=1))
# 0 Index(['A1', 'B1', 'C1'], dtype='object')
# 1 Index(['A2', 'B3', 'C2'], dtype='object')
# 2 Index(['A2', 'B2', 'C3'], dtype='object')
# 3 Index(['A1', 'B4', 'C1'], dtype='object')
# Step 5
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
# Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
# 0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
# 1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
# 2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
# 3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)
df["Combination"] = df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]
Run Code Online (Sandbox Code Playgroud)
我们选择除Valueswith之外的列iloc,然后形成一个点积,其中第二个操作数是添加到末尾的dfwith 的相应列~~。结果也给出~~了最后,所以我们用.str[:-2]
要得到
Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
58 次 |
| 最近记录: |