将Pandas DataFrame列列表值拆分为重复的行

Question

将Pandas DataFrame列列表值拆分为重复的行

我有一个数据框，如下所示：

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
.

Run Code Online (Sandbox Code Playgroud)

我想要做的是获取“作者”列，并通过复制所有其他列将列表中的列表分成几行，我还想将结果存储在名为“作者”的新列中，并保留原始列。

以下内容准确描述了我想要实现的目标：

publication_title    authors                             author          type ...
title 1              ['author1', 'author2', 'author3']   author1         proceedings
title 1              ['author1', 'author2', 'author3']   author2         proceedings
title 1              ['author1', 'author2', 'author3']   author3         proceedings
title 2              ['author4', 'author5']              author4         collections
title 2              ['author4', 'author5']              author5         collections
title 3              ['author6', 'author7']              author6         books
title 3              ['author6', 'author7']              author7         books
.
.
.

Run Code Online (Sandbox Code Playgroud)

我尝试使用pandas DataFrame explode方法实现此目的，但是我找不到将结果存储在新列中的方法。

谢谢您的帮助。

Answer 1

Erf*_*fan 6

既然pandas 0.25.0我们有了explode方法。首先，我们复制authors列并同时使用重命名它，assign然后我们将此列分解为行并复制其他列：

df.assign(author=df['authors']).explode('author')

Run Code Online (Sandbox Code Playgroud)

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
0           title_1  [author1, author2, author3]  proceedings  author2
0           title_1  [author1, author2, author3]  proceedings  author3
1           title_2           [author4, author5]  collections  author4
1           title_2           [author4, author5]  collections  author5
2           title_3           [author6, author7]        books  author6
2           title_3           [author6, author7]        books  author7

Run Code Online (Sandbox Code Playgroud)

如果要删除重复的索引，请使用reset_index：

df.assign(author=df['authors']).explode('author').reset_index(drop=True)

Run Code Online (Sandbox Code Playgroud)

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
1           title_1  [author1, author2, author3]  proceedings  author2
2           title_1  [author1, author2, author3]  proceedings  author3
3           title_2           [author4, author5]  collections  author4
4           title_2           [author4, author5]  collections  author5
5           title_3           [author6, author7]        books  author6
6           title_3           [author6, author7]        books  author7

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	155 次
最近记录：	6 年，5 月前