将列表的pandas列转换为矩阵表示形式（一种热编码）

Question

将列表的pandas列转换为矩阵表示形式（一种热编码）

我有一个pandas列，其中列出了各种长度的值，如下所示：

  idx lists

    0 [1,3,4,5]
    1 [2]
    2 [3,5]
    3 [2,3,5]

Run Code Online (Sandbox Code Playgroud)

我想将它们转换成矩阵格式，其中每个可能的值代表一列，如果该值存在，则每一行填充1，否则填充0，例如：

idx  1 2 3 4 5 

  0  1 0 1 1 1
  1  0 1 0 0 0
  2  0 0 1 0 1
  3  0 1 1 0 1

Run Code Online (Sandbox Code Playgroud)

我以为这个术语是一种热编码，但是我尝试使用pd.get_dummies方法，该方法指出它可以进行一热编码，但是当我尝试提供上述输入时：

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])
pd.get_dummies(test_hot)

Run Code Online (Sandbox Code Playgroud)

我收到以下错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 899, in get_dummies
    dtype=dtype)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 906, in _get_dummies_1d
    codes, levels = _factorize_from_iterable(Series(data))
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
    cat = Categorical(values, ordered=True)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 347, in __init__
    codes, categories = factorize(values, sort=False)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'list'

Run Code Online (Sandbox Code Playgroud)

如果我只输入一个值列表，则该方法可以正常工作：

[1,2,3,4,5]

Run Code Online (Sandbox Code Playgroud)

它将显示一个5x5矩阵，但仅用1填充一行。我正在尝试对此进行扩展，以便通过填充一列列表可以为每行填充1个以上的值。

Answer 1

jez*_*ael 3

如果性能很重要，请使用MultiLabelBinarizer：

test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(test_hot),columns=mlb.classes_)
print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Run Code Online (Sandbox Code Playgroud)

您的解决方案应该使用 create DataFrame、 reshape 和进行更改DataFrame.stack，最后使用get_dummieswithDataFrame.max进行聚合：

df = pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
       .max(level=0, axis=0)

print (df)
   1  2  3  4  5  6
0  1  1  1  0  0  0
1  0  0  1  1  1  0
2  1  0  0  0  0  1

Run Code Online (Sandbox Code Playgroud)

细节：

创建MultiIndex Series：

print(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
0  0    1
   1    2
   2    3
1  0    3
   1    4
   2    5
2  0    1
   1    6
dtype: int32

Run Code Online (Sandbox Code Playgroud)

称呼pd.get_dummies：

print (pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int)))
     1  2  3  4  5  6
0 0  1  0  0  0  0  0
  1  0  1  0  0  0  0
  2  0  0  1  0  0  0
1 0  0  0  1  0  0  0
  1  0  0  0  1  0  0
  2  0  0  0  0  1  0
2 0  1  0  0  0  0  0
  1  0  0  0  0  0  1

Run Code Online (Sandbox Code Playgroud)

以及每个第一级的最后总计max。

归档时间：	6 年，10 月前
查看次数：	153 次
最近记录：	6 年，9 月前