Abh*_*kur 222 python pandas pandas-groupby
我有一个像熊猫一样的数据框:
a b
A 1
A 2
B 5
B 5
B 4
C 6
Run Code Online (Sandbox Code Playgroud)
我希望按第一列分组,并将第二列作为行中的列表:
A [1,2]
B [5,5,4]
C [6]
Run Code Online (Sandbox Code Playgroud)
使用pandas groupby可以做这样的事吗?
EdC*_*ica 317
您可以使用groupby以对感兴趣的列进行分组,然后apply list对每个组进行分组:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
Run Code Online (Sandbox Code Playgroud)
B. *_* M. 41
如果表现很重要,那就归结为numpy级别:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Run Code Online (Sandbox Code Playgroud)
测试:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
Run Code Online (Sandbox Code Playgroud)
Aco*_*rbe 20
正如您所说,对象的groupby方法pd.DataFrame可以完成这项工作.
例
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
Run Code Online (Sandbox Code Playgroud)
给出了组的索引方式描述.
例如,要获取单个组的元素,您可以这样做
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
Run Code Online (Sandbox Code Playgroud)
小智 19
实现这一目标的一种方便方法是:
df.groupby('a').agg({'b':lambda x: list(x)})
Run Code Online (Sandbox Code Playgroud)
考虑编写自定义聚合:https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
Sea*_*n.H 19
只是一个补充。pandas.pivot_table更通用,似乎更方便\xef\xbc\x9a
"""data"""\ndf = pd.DataFrame( {\'a\':[\'A\',\'A\',\'B\',\'B\',\'B\',\'C\'],\n \'b\':[1,2,5,5,4,6],\n \'c\':[1,2,1,1,1,6]})\nprint(df)\n\n a b c\n0 A 1 1\n1 A 2 2\n2 B 5 1\n3 B 5 1\n4 B 4 1\n5 C 6 6\nRun Code Online (Sandbox Code Playgroud)\n"""pivot_table"""\npt = pd.pivot_table(df,\n values=[\'b\', \'c\'],\n index=\'a\',\n aggfunc={\'b\': list,\n \'c\': set})\nprint(pt)\n b c\na \nA [1, 2] {1, 2}\nB [5, 5, 4] {1}\nC [6] {6}\nRun Code Online (Sandbox Code Playgroud)\n
Mar*_*hke 15
要为数据帧的多个列解决此问题:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
Run Code Online (Sandbox Code Playgroud)
这个答案的灵感来自Anamika Modi的回答.谢谢!
Mit*_*ril 10
是时候使用agg而不是apply.
什么时候
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
Run Code Online (Sandbox Code Playgroud)
如果您希望多列堆叠到列表中,则导致 pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
Run Code Online (Sandbox Code Playgroud)
如果你想要列表中的单列,结果 ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Run Code Online (Sandbox Code Playgroud)
请注意, result inpd.DataFrame比ps.Series仅聚合单列时的result 慢约 10 倍,在多列情况下使用它。
我发现实现相同目标的最简单方法,至少对于一列,这与Anamika 的答案类似,只是使用聚合函数的元组语法。
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Run Code Online (Sandbox Code Playgroud)
如果在对多个列进行分组时寻找唯一 列表,这可能会有所帮助:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Run Code Online (Sandbox Code Playgroud)
基于@BM答案,这里有一个更通用的版本,并更新为与较新的库版本一起使用:(numpy版本1.19.2,pandas版本1.2.1)\n这个解决方案还可以处理多索引:
然而,这还没有经过严格测试,请谨慎使用。
\nimport pandas as pd\nimport numpy as np\n\nnp.random.seed(0)\ndf = pd.DataFrame({\'a\': np.random.randint(0, 10, 90), \'b\': [1,2,3]*30, \'c\':list(\'abcefghij\')*10, \'d\': list(\'hij\')*30})\n\n\ndef f_multi(df,col_names):\n if not isinstance(col_names,list):\n col_names = [col_names]\n \n values = df.sort_values(col_names).values.T\n\n col_idcs = [df.columns.get_loc(cn) for cn in col_names]\n other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]\n other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]\n\n # split df into indexing colums(=keys) and data colums(=vals)\n keys = values[col_idcs,:]\n vals = values[other_col_idcs,:]\n \n # list of tuple of key pairs\n multikeys = list(zip(*keys))\n \n # remember unique key pairs and ther indices\n ukeys, index = np.unique(multikeys, return_index=True, axis=0)\n \n # split data columns according to those indices\n arrays = np.split(vals, index[1:], axis=1)\n\n # resulting list of subarrays has same number of subarrays as unique key pairs\n # each subarray has the following shape:\n # rows = number of non-grouped data columns\n # cols = number of data points grouped into that unique key pair\n \n # prepare multi index\n idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) \n\n list_agg_vals = dict()\n for tup in zip(*arrays, other_col_names):\n col_vals = tup[:-1] # first entries are the subarrays from above \n col_name = tup[-1] # last entry is data-column name\n \n list_agg_vals[col_name] = col_vals\n\n df2 = pd.DataFrame(data=list_agg_vals, index=idx)\n return df2\nRun Code Online (Sandbox Code Playgroud)\nIn [227]: %timeit f_multi(df, [\'a\',\'d\'])\n\n2.54 ms \xc2\xb1 64.7 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\nIn [228]: %timeit df.groupby([\'a\',\'d\']).agg(list)\n\n4.56 ms \xc2\xb1 61.5 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n\nRun Code Online (Sandbox Code Playgroud)\n对于随机种子 0 会得到:
\n\n使用以下任何一种groupby和agg配方。
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
Run Code Online (Sandbox Code Playgroud)
要将多个列聚合为列表,请使用以下任一方法:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
Run Code Online (Sandbox Code Playgroud)
要仅对单个列进行组列出,请将groupby转换为SeriesGroupBy对象,然后调用SeriesGroupBy.agg。使用,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
151968 次 |
| 最近记录: |