Jac*_*top 5 python optimization tuples dataframe pandas
所以我想弄清楚如何加快这个操作。目前,真正的数据集需要大约一个小时来迭代(df1 和 df2 中约 50,000 列),这似乎不太实用。有人有什么建议吗?即 pandas 矢量化、pandas 条件等?
基本操作:查看 df1 中的每一行,并与 df2 中的每一行进行比较。如果agent_id匹配并且df1“created_at_email”日期大于或等于df2“created_at”日期,则提取该行。df1 编辑中每行允许拉取的最大行数为 4:首先按最近的日期排序。
示例数据框:
df1 = pd.DataFrame({'unique_col': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'agent_id': [1, 2, 3, 1, 5, 6, 7],
'created_at_email': ['1/5/2020', '1/6/2020', '1/8/2020', '1/3/2020', '1/4/2020', '1/7/2020', '1/2/2020']
})
df2 = pd.DataFrame({'unique_col': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
'agent_id': [1, 1, 3, 1, 1, 1, 1],
'created_at': ['1/4/2020', '1/5/2020', '1/6/2020', '1/9/2020', '1/2/2020', '1/3/2020', '1/4/2020']
})
Run Code Online (Sandbox Code Playgroud)
代码(需要加速):
# pre-sorting order created at column so function will iterate from most recent to least recent orders.
df2 = df2.sort_values(['created_at'], ascending=False)
# note: super not optimized
obj = []
for row in df1.itertuples():
count = 0
for row2 in df2.itertuples():
if row[2] == row2[2]:
if row2[3] <= row[3]:
if count < 4: # returns the first 4 entries
c = [row2[3], row[3], row2[2], row[2], row[1], row2[1]]
obj.append(c)
count = count + 1
Run Code Online (Sandbox Code Playgroud)
输出:(应该是什么样子)...
注意:df1 可以有多个相同的 agent_id,df2 也可以。
注意:右侧的日期大于或等于左侧的日期。
注意:unique_ids 仅用于检查一切是否对齐。
cols: created_at, created_at_email, agent_id, agent_id, unique_id, unique_id
[['1/5/2020', '1/5/2020', 1, 1, 'a', 'b'],
['1/4/2020', '1/5/2020', 1, 1, 'a', 'a'],
['1/4/2020', '1/5/2020', 1, 1, 'a', 'g'],
['1/3/2020', '1/5/2020', 1, 1, 'a', 'f'],
['1/6/2020', '1/8/2020', 3, 3, 'c', 'c'],
['1/3/2020', '1/3/2020', 1, 1, 'd', 'f'],
['1/2/2020', '1/3/2020', 1, 1, 'd', 'e']]
Run Code Online (Sandbox Code Playgroud)
谢谢,
Amerge
会更快。但我不确定是否要合并两个 50k 数据帧:
(df1.assign(row=np.arange(len(df1))) # record the row number in `df1`
.merge(df2, on=['agent_id'])
.query('created_at_email >= created_at') # select rows with greater creat_at_email
.groupby('row').head(4) # select max 4 rows for each row in df1
)
Run Code Online (Sandbox Code Playgroud)
输出:
unique_col_x agent_id created_at_email row unique_col_y created_at
0 a 1 1/5/2020 0 a 1/4/2020
1 a 1 1/5/2020 0 b 1/5/2020
3 a 1 1/5/2020 0 e 1/2/2020
4 a 1 1/5/2020 0 f 1/3/2020
9 d 1 1/3/2020 3 e 1/2/2020
10 d 1 1/3/2020 3 f 1/3/2020
12 c 3 1/8/2020 2 c 1/6/2020
Run Code Online (Sandbox Code Playgroud)