Ben*_*ang 6 python list pandas
我有一个pandas列,其中列出了各种长度的值,如下所示:
idx lists
0 [1,3,4,5]
1 [2]
2 [3,5]
3 [2,3,5]
Run Code Online (Sandbox Code Playgroud)
我想将它们转换成矩阵格式,其中每个可能的值代表一列,如果该值存在,则每一行填充1,否则填充0,例如:
idx 1 2 3 4 5
0 1 0 1 1 1
1 0 1 0 0 0
2 0 0 1 0 1
3 0 1 1 0 1
Run Code Online (Sandbox Code Playgroud)
我以为这个术语是一种热编码,但是我尝试使用pd.get_dummies方法,该方法指出它可以进行一热编码,但是当我尝试提供上述输入时:
test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])
pd.get_dummies(test_hot)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 899, in get_dummies
dtype=dtype)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 906, in _get_dummies_1d
codes, levels = _factorize_from_iterable(Series(data))
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2515, in _factorize_from_iterable
cat = Categorical(values, ordered=True)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 347, in __init__
codes, categories = factorize(values, sort=False)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 630, in factorize
na_value=na_value)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/algorithms.py", line 476, in _factorize_array
na_value=na_value)
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'list'
Run Code Online (Sandbox Code Playgroud)
如果我只输入一个值列表,则该方法可以正常工作:
[1,2,3,4,5]
Run Code Online (Sandbox Code Playgroud)
它将显示一个5x5矩阵,但仅用1填充一行。我正在尝试对此进行扩展,以便通过填充一列列表可以为每行填充1个以上的值。
如果性能很重要,请使用MultiLabelBinarizer
:
test_hot = pd.Series([[1,2,3],[3,4,5],[1,6]])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(test_hot),columns=mlb.classes_)
print (df)
1 2 3 4 5 6
0 1 1 1 0 0 0
1 0 0 1 1 1 0
2 1 0 0 0 0 1
Run Code Online (Sandbox Code Playgroud)
您的解决方案应该使用 create DataFrame
、 reshape 和进行更改DataFrame.stack
,最后使用get_dummies
withDataFrame.max
进行聚合:
df = pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
.max(level=0, axis=0)
print (df)
1 2 3 4 5 6
0 1 1 1 0 0 0
1 0 0 1 1 1 0
2 1 0 0 0 0 1
Run Code Online (Sandbox Code Playgroud)
细节:
创建MultiIndex Series
:
print(pd.DataFrame(test_hot.values.tolist()).stack().astype(int))
0 0 1
1 2
2 3
1 0 3
1 4
2 5
2 0 1
1 6
dtype: int32
Run Code Online (Sandbox Code Playgroud)
称呼pd.get_dummies
:
print (pd.get_dummies(pd.DataFrame(test_hot.values.tolist()).stack().astype(int)))
1 2 3 4 5 6
0 0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
1 0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 0 0 0 1 0
2 0 1 0 0 0 0 0
1 0 0 0 0 0 1
Run Code Online (Sandbox Code Playgroud)
以及每个第一级的最后总计max
。