按时间分组,然后仅当列表中存在唯一条目时才对唯一条目进行计数

Muh*_*han 2 python dataframe pandas

考虑以下熊猫数据帧“ df”和python列表“ my_list”。

df =

timestamp  address    type
1           1          A
2           9          B
3           3          A
4           6          B
5           6          B
6           2          B
7           3          A
8           2          B
9           1          B
10          3          A
11          3          A
12          3          A
Run Code Online (Sandbox Code Playgroud)

my_list =

[1, 2, 3]
Run Code Online (Sandbox Code Playgroud)

现在,我想要的是将时间戳帧中的数据帧分组在3秒的容器中,并且仅当“ my_list”中存在地址时才对唯一的“类型”进行计数。

预期的输出应如下所示:

timestamp   A    B    
1           2    0 #One "B" ignored, because address=9 is not in my_list
4           0    1 #Two "B" ignored because address is not in "my_list
7           1    2 #Two "B" with unique addresses, and one "A"
10          1    0 #Three rows with Type="A", but addresses are is same.
Run Code Online (Sandbox Code Playgroud)

请注意,时间戳记值最初是时间戳记格式的,我们可以将df.groupby和pd.TimeGrouper函数应用于3秒列中的行分组。

仅欣赏基于Pandas(Python)的答案。

如有任何混淆,我们深表歉意。我试图保持简单。

-可汗

jez*_*ael 5

使用:

#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0
Run Code Online (Sandbox Code Playgroud)

设定:

print (df)
    timestamp  address type
0           1        1    A
1           2        9    B
2           3        3    A
3           4        6    B
4           5        6    B
5           6        2    B
6           7        3    A
7           8        2    B
8           9        1    B
9          10        3    A
10         11        3    A
11         12        3    A
Run Code Online (Sandbox Code Playgroud)

解决方案datetimes更简单:

#sample datetimes 
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
                   origin=pd.Timestamp('2017-01-01'))

print (df)
    timestamp  address type
0  2017-01-02        1    A
1  2017-01-03        9    B
2  2017-01-04        3    A
3  2017-01-05        6    B
4  2017-01-06        6    B
5  2017-01-07        2    B
6  2017-01-08        3    A
7  2017-01-09        2    B
8  2017-01-10        1    B
9  2017-01-11        3    A
10 2017-01-12        3    A
11 2017-01-13        3    A

df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
          .nunique()
          .unstack(fill_value=0) )
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0
Run Code Online (Sandbox Code Playgroud)

一排解决​​方案:

df1 = (df.query("address in @my_list")
         .groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
         .nunique()
         .unstack(fill_value=0))
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0
Run Code Online (Sandbox Code Playgroud)