熊猫从列表列中获取唯一值

Question

熊猫从列表列中获取唯一值

我如何在 pandas 或 numpy 中获取一列列表的唯一值，以便第二列来自

会导致'action', 'crime', 'drama'.

我能想出的最接近（但非功能性）的解决方案是：

 genres = data['Genre'].unique()

Run Code Online (Sandbox Code Playgroud)

但这可以预见地导致 TypeError 说明列表如何不可散列。

TypeError: unhashable type: 'list'

Set 似乎是个好主意，但是

genres = data.apply(set(), columns=['Genre'], axis=1)

但也会导致 TypeError: set() takes no keyword arguments

Answer 1

PMe*_*nde 17

您可以使用explode：

data = pd.DataFrame([
    {
        "title": "The Godfather: Part II",
        "genres": ["crime", "drama"],
        "director": "Fracis Ford Coppola"
    },
    {
        "title": "The Dark Knight",
        "genres": ["action", "crime", "drama"],
        "director": "Christopher Nolan"
    }
])
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc
data["genres"].explode().unique()

Run Code Online (Sandbox Code Playgroud)

结果是：

array(['crime', 'drama', 'action'], dtype=object)

Run Code Online (Sandbox Code Playgroud)

Answer 2

raf*_*elc 14

如果您只想找到唯一值，我建议使用itertools.chain.from_iterable连接所有这些列表

import itertools

>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')

Run Code Online (Sandbox Code Playgroud)

甚至更快

>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}

Run Code Online (Sandbox Code Playgroud)

`Timings`

df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)

%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
    
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop

%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop

%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop

%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	15200 次
最近记录：	5 年前