在numpy数组中按max或min分组

Abi*_*iel 7 python numpy

我有两个相等长度1D numpy的阵列,id并且data,其中,id是重复的序列,命令对定义子窗口的整数data.例如,

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10
Run Code Online (Sandbox Code Playgroud)

我想data通过分组id并采取最大值或最小值进行聚合.在SQL中,这将是一个典型的聚合查询SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.有没有办法可以避免Python循环并以矢量化方式执行此操作,或者我是否必须下载到C?

Bi *_*ico 9

过去几天我一直在看堆栈溢出问题.下面的代码与numpy.unique的实现非常相似,因为它利用了底层的numpy机制,它很可能比你在python循环中所做的任何事情都要快.

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]
Run Code Online (Sandbox Code Playgroud)


jfs*_*jfs 6

在纯Python中:

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
Run Code Online (Sandbox Code Playgroud)

变化:

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Run Code Online (Sandbox Code Playgroud)

基于@Bago的回答:

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]
Run Code Online (Sandbox Code Playgroud)

如果pandas安装:

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1
Run Code Online (Sandbox Code Playgroud)

  • 为熊猫+1.我认为它的可读性最简单. (2认同)