在Pandas Dataframe中按天连接字符串列表

Oli*_*lil 4 python pandas

我有以下内容:

import pandas as pd
import numpy as np

documents = [['Human', 'machine', 'interface'],
             ['A', 'survey', 'of', 'user'],
             ['The', 'EPS', 'user'],
             ['System', 'and', 'human'],
             ['Relation', 'of', 'user'],
             ['The', 'generation'],
             ['The', 'intersection'],
             ['Graph', 'minors'],
             ['Graph', 'minors', 'a']]

df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
Run Code Online (Sandbox Code Playgroud)

只有5个独特的日子.我想按天分组以下结果:

documents2 = [['Human', 'machine', 'interface'],
              ['A', 'survey', 'of', 'user'],
              ['The', 'EPS', 'user', 'System', 'and', 'human'],
              ['Relation', 'of', 'user', 'The', 'generation'],
              ['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]


df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})
Run Code Online (Sandbox Code Playgroud)

raf*_*elc 5

IIUC,你可以aggregate通过sum

df.groupby('date').text.sum() # or .agg(sum)

date
2014-05-01                          [Human, machine, interface]
2014-05-02                                [A, survey, of, user]
2014-05-10                 [The, EPS, user, System, and, human]
2014-05-15                [Relation, of, user, The, generation]
2014-05-20    [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Run Code Online (Sandbox Code Playgroud)

或者使用列表推导来平整列表,这会产生相同的时间复杂度,chain.from_iterable但不再依赖于一个外部库

df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
Run Code Online (Sandbox Code Playgroud)


cs9*_*s95 5

sum已经在另一个答案中显示了,所以让我提出一个更快(更有效)的解决方案chain.from_iterable:

from itertools import chain
df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))

date
2014-05-01                          [Human, machine, interface]
2014-05-02                                [A, survey, of, user]
2014-05-10                 [The, EPS, user, System, and, human]
2014-05-15                [Relation, of, user, The, generation]
2014-05-20    [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Run Code Online (Sandbox Code Playgroud)

问题sum在于,对于每两个求和的列表,创建一个新的中间结果.所以操作是O(N ^ 2).您可以使用链将其减少到线性时间.


即使使用相对较小的DataFrame,性能差异也很明显.

df = pd.concat([df] * 1000)  
%timeit df.groupby('date').text.sum()
%timeit df.groupby('date').text.agg('sum')
%timeit df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
%timeit  df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))

71.8 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
68.9 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.25 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)

当群体较大时,问题会更加明显.特别是因为sum没有矢量化的对象.