pandas 数据框：基于列和时间范围的重复项

Question

pandas 数据框：基于列和时间范围的重复项

dli*_*liv 6 python datetime duplicates conditional-statements pandas

我有一个（这里非常简单）pandas 数据框，它看起来像这样：

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

Run Code Online (Sandbox Code Playgroud)

我现在想做的是在 3 秒内获取所有具有时间戳的重复消息。所需的输出是：

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

Run Code Online (Sandbox Code Playgroud)

没有第三行，因为其文本与第一行和第二行相同，但其时间戳不在3秒范围内。

我尝试将列 datetime 和 msg 定义为该duplicate()方法的参数，但它返回一个空数据帧，因为时间戳不相同：

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

Run Code Online (Sandbox Code Playgroud)

有没有办法为我的“日期时间”参数定义一个范围？为了说明，例如：

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

Run Code Online (Sandbox Code Playgroud)

这里的任何帮助将一如既往地非常感谢。

Answer 1

小智 6

这段代码给出了预期的输出

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

Run Code Online (Sandbox Code Playgroud)

我对数据帧的“msg”列进行了分组，然后选择了该数据帧的“日期时间”列并使用了内置函数diff。Diff 函数查找该列的值之间的差异。用零填充 NaT 值并仅选择那些值小于 3 秒的索引。

在使用上面的代码之前，请确保您的数据框按日期时间升序排序。

Answer 2

Tka*_*nno 1

尽管您可能需要处理任何极端情况，但这段代码适用于您的示例数据。

从你的问题来看，我假设你想过滤掉第一次出现在 df 中的消息。如果您希望在字符串在另一个阈值之后再次出现时保留该字符串，则该方法将不起作用。

简而言之，我编写了一个函数，它将获取您的数据帧和“msg”进行过滤。它获取消息第一次出现的时间戳，并将其与它出现的所有其他时间进行比较。

然后，它仅选择在首次出现后 3 秒内出现的实例。

    import numpy as np
    import pandas as pd
    #function which will return dataframe containing messages within three seconds of the first message
    def get_info_within_3seconds(df, msg):
        df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
        t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
        datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
        filter_list = [i <= 3.0 for i in datetime_deltas]
        return df_of_msg[filter_list]

    msgs = df['msg'].unique()
    #apply function to each unique message and then create a new df 
    new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	1663 次
最近记录：	8 年，7 月前