Tim*_*qua 5 python dataframe pandas
我得到了以下动态创建的数据帧,但将所有相关值保存到一行中:
df
+----------------+----------------+----------------+----------------+
| 1 | 2 | 3 | 4 |
+----------------+----------------+----------------+----------------+
| a1, b1, c1, d1 | a2, b2, c2, d2 | a3, b3, c3, d3 | a4, b4, c4, d4 |
+----------------+----------------+----------------+----------------+
Run Code Online (Sandbox Code Playgroud)
我需要在一行中包含所有 a_i 值,所有 b 等(列已定义且为常量):
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| a1 | a2 | a3 | a4 |
| b1 | b2 | b3 | b4 |
| c1 | c2 | c3 | c4 |
| d1 | d2 | d3 | d4 |
+----+----+----+----+
Run Code Online (Sandbox Code Playgroud)
由于df中不同字母的数量因情况而异,我需要一个动态解决方案,将df转换为上述形式。
df.explode(df.columns.tolist())
Run Code Online (Sandbox Code Playgroud)
输出:
1 2 3 4
0 a1 a2 a3 a4
0 b1 b2 b3 b4
0 c1 c2 c3 c4
0 d1 d2 d3 d4
Run Code Online (Sandbox Code Playgroud)
鉴于, df 具有此结构:
df = pd.DataFrame({1:[np.array('a1 b1 c1 d1'.split(' '))],
2:[np.array('a2 b2 c2 d2'.split(' '))],
3:[np.array('a3 b3 c3 d3'.split(' '))],
4:[np.array('a4 b4 c4 d4'.split(' '))]})
Run Code Online (Sandbox Code Playgroud)
输入数据框:
1 2 3 4
0 [a1, b1, c1, d1] [a2, b2, c2, d2] [a3, b3, c3, d3] [a4, b4, c4, d4]
Run Code Online (Sandbox Code Playgroud)
您可以使用 pd.Series.explode:
df.apply(pd.Series.explode)
Run Code Online (Sandbox Code Playgroud)
输出:
1 2 3 4
0 a1 a2 a3 a4
0 b1 b2 b3 b4
0 c1 c2 c3 c4
0 d1 d2 d3 d4
Run Code Online (Sandbox Code Playgroud)
有点类似于 Scott Boston 的回答,但速度要快得多(apply众所周知的慢):
pd.DataFrame(df.values[0].tolist(), columns=df.columns)
# 1 2 3 4
#0 a1 b1 c1 d1
#1 a2 b2 c2 d2
#2 a3 b3 c3 d3
#3 a4 b4 c4 d4
Run Code Online (Sandbox Code Playgroud)
原始答案的一部分:\n如果您的列仅在一帧中包含一个长逗号分隔的字符串,如下所示:
\ndf = pd.DataFrame(\n [\n ",".join(["a" + str(i) for i in range(4)]),\n ",".join(["b" + str(i) for i in range(4)]),\n ], \n).T\ndf.columns = list("ab")\ndf.apply(lambda x: pd.Series(x[0].split(',')))\nRun Code Online (Sandbox Code Playgroud)\n附加内容:\n(这是基于已经解决了问题的其他答案,但是为了清楚地了解执行效率,我认为测试它并在此处打印它很有帮助......我自己也很惊讶基于结果我将来会编写具有更好性能的相同功能:感谢@DIZ 和@Scott Boston)
\nimport pandas as pd\nimport numpy as np\n\ndf = pd.DataFrame({i: [np.array([x + str(i) for x in ['a','b','c','d']])] for i in range(1,5)})\n\ndef convert_using_explode(my_df):\n return my_df.apply(pd.Series.explode)\n\ndef convert_using_conversion_to_list(my_df):\n return pd.DataFrame(my_df.values[0].tolist(), columns=my_df.columns)\n\n# this is what I would have most probably done before getting involved in this question\ndef convert_first_idx_to_series(my_df):\n another_df = pd.DataFrame()\n for col in my_df:\n another_df[col] = pd.Series(my_df.loc[0, col])\n return another_df\nRun Code Online (Sandbox Code Playgroud)\n现在计时执行:
\n%time convert_using_explode(df)\nWall time: 2 ms\nOut[10]: \n 1 2 3 4\n0 a1 a2 a3 a4\n0 b1 b2 b3 b4\n0 c1 c2 c3 c4\n0 d1 d2 d3 d4\n\n%time convert_using_conversion_to_list(df)\nWall time: 966 \xc2\xb5s\nOut[11]: \n 1 2 3 4\n0 a1 b1 c1 d1\n1 a2 b2 c2 d2\n2 a3 b3 c3 d3\n3 a4 b4 c4 d4\n\n%time convert_first_idx_to_series(df)\nWall time: 1.99 ms\nOut[61]: \n 1 2 3 4\n0 a1 a2 a3 a4\n1 b1 b2 b3 b4\n2 c1 c2 c3 c4\n3 d1 d2 d3 d4\nRun Code Online (Sandbox Code Playgroud)\n请注意 @DIZ 版本的速度大约是其余版本的两倍。
\n