如果值 =1(二进制值),则提取列名并将它们与分隔符组合并将其放入新列中

Ana*_*tha 4 python numpy pandas

我有一个带有二进制值的数据框,在使用 Pandas 执行 get_dummies 后

df= 
Values  A1  A2  B1  B2  B3  B4  C1  C2  C3
10      1   0   1   0   0   0   1   0   0
12      0   1   0   0   1   0   0   1   0
3       0   1   0   1   0   0   0   0   1
5       1   0   0   0   0   1   1   0   0
Run Code Online (Sandbox Code Playgroud)

我想要一个新列,它包含所有列的组合,其中包含 1

Expected output:

Values  A1  A2  B1  B2  B3  B4  C1  C2  C3  Combination
10      1   0   1   0   0   0   1   0   0   A1~~B1~~C1
12      0   1   0   0   1   0   0   1   0   A2~~B3~~C2
3       0   1   0   1   0   0   0   0   1   A2~~B2~~C3
5       1   0   0   0   0   1   1   0   0   A1~~B4~~C3
Run Code Online (Sandbox Code Playgroud)

实际矩阵可以是25000+行*1000+列

在 R 中有一个类似的解决方案,但我在 Python 中需要它 bcoz 所有其他依赖项都在 python 中,而 R 对我来说是新的。

提取二进制矩阵中值为 1 的列名

Codes in R below, & need similar one or any other code in python which can help me to arrive at my expected output
Solution 1 : 
as.matrix(apply(m==1,1,function(a) paste0(colnames(m)[a], collapse = "")))

Solution 2: 
t <- which(m==1, arr.ind = TRUE)
as.matrix(aggregate(col~row, cbind(row=rownames(t), col=t[,2]), function(x) 
                                                    paste0(colnames(m)[x], collapse = "")))
Run Code Online (Sandbox Code Playgroud)

类似的东西怎么可能达到我在 Python 中的预期输出?

Ale*_* B. 5

您可以尝试apply使用str.join

df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)

print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)

说明

  1. 为了计算 Combination,请忽略Values列。有几种可能的方法(请参阅本主题)。这里我使用dropdf.drop("Values", axis=1)
  2. apply使用和在每一行上应用自定义函数axis=1
  3. 在函数中,过滤值不同于0使用x[x != 0]
  4. 使用选择列名称(这里有系列的索引).index
  5. 用于str.join匹配所需的输出:"~~".join(x[x != 0].index)

完整说明

# Step 1
print(df.drop("Values", axis=1))
#    A1  A2  B1  B2  B3  B4  C1  C2  C3
# 0   1   0   1   0   0   0   1   0   0
# 1   0   1   0   0   1   0   0   1   0
# 2   0   1   0   1   0   0   0   0   1
# 3   1   0   0   0   0   1   1   0   0

# Step 3
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0], axis=1))
#     A1   A2   B1   B2   B3   B4   C1   C2   C3
# 0  1.0  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN
# 1  NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN
# 2  NaN  1.0  NaN  1.0  NaN  NaN  NaN  NaN  1.0
# 3  1.0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  NaN

# Step 4
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0].index, axis=1))
# 0    Index(['A1', 'B1', 'C1'], dtype='object')
# 1    Index(['A2', 'B3', 'C2'], dtype='object')
# 2    Index(['A2', 'B2', 'C3'], dtype='object')
# 3    Index(['A1', 'B4', 'C1'], dtype='object')

# Step 5
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)


Mus*_*dın 5

df["Combination"] = df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]
Run Code Online (Sandbox Code Playgroud)

我们选择除Valueswith之外的列iloc,然后形成一个点积,其中第二个操作数是添加到末尾的dfwith 的相应列~~。结果也给出~~了最后,所以我们用.str[:-2]

要得到

   Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1
Run Code Online (Sandbox Code Playgroud)