sal*_*ere 2 python arrays group-by numpy numpy-ndarray
我如何计算以下每个 workerid 的平均值?下面是我的示例 NumPy ndarray。第 0 列是 workerid,第 1 列是纬度,第 2 列是经度。
我想计算每个工人的平均纬度和经度。我想使用 NumPy (ndarray) 保留这一切,而不转换为 Pandas。
import numpy
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
class WorkerPatientScores:
'''
I read from the Patient and Worker tables in SchedulingOptimization.
'''
def __init__(self, dist_weight=1):
self.a = []
self.a = ([[25302, 32.133598100000000, -94.395845200000000],
[25302, 32.145095132560200, -94.358041585705600],
[25302, 32.160400000000000, -94.330700000000000],
[25305, 32.133598100000000, -94.395845200000000],
[25305, 32.115095132560200, -94.358041585705600],
[25305, 32.110400000000000, -94.330700000000000],
[25326, 32.123598100000000, -94.395845200000000],
[25326, 32.125095132560200, -94.358041585705600],
[25326, 32.120400000000000, -94.330700000000000],
[25341, 32.173598100000000, -94.395845200000000],
[25341, 32.175095132560200, -94.358041585705600],
[25341, 32.170400000000000, -94.330700000000000],
[25376, 32.153598100000000, -94.395845200000000],
[25376, 32.155095132560200, -94.358041585705600],
[25376, 32.150400000000000, -94.330700000000000]])
ndarray = numpy.array(self.a)
ndlist = ndarray.tolist()
geo_tuple = [(p[1], p[2]) for p in ndlist]
nd1 = numpy.array(geo_tuple)
mean_tuple = numpy.mean(nd1, 0)
print(mean_tuple)
Run Code Online (Sandbox Code Playgroud)
上面的输出是:
[ 32.14303108 -94.36152893]
给定这个数组,我们希望按第一列进行分组并取其他两列的平均值
X = np.asarray([[25302, 32.133598100000000, -94.395845200000000],
[25302, 32.145095132560200, -94.358041585705600],
[25302, 32.160400000000000, -94.330700000000000],
[25305, 32.133598100000000, -94.395845200000000],
[25305, 32.115095132560200, -94.358041585705600],
[25305, 32.110400000000000, -94.330700000000000],
[25326, 32.123598100000000, -94.395845200000000],
[25326, 32.125095132560200, -94.358041585705600],
[25326, 32.120400000000000, -94.330700000000000],
[25341, 32.173598100000000, -94.395845200000000],
[25341, 32.175095132560200, -94.358041585705600],
[25341, 32.170400000000000, -94.330700000000000],
[25376, 32.153598100000000, -94.395845200000000],
[25376, 32.155095132560200, -94.358041585705600],
[25376, 32.150400000000000, -94.330700000000000]])
Run Code Online (Sandbox Code Playgroud)
仅使用numpy循环和不使用循环
groups = X[:,0].copy()
X = np.delete(X, 0, axis=1)
_ndx = np.argsort(groups)
_id, _pos, g_count = np.unique(groups[_ndx],
return_index=True,
return_counts=True)
g_sum = np.add.reduceat(X[_ndx], _pos, axis=0)
g_mean = g_sum / g_count[:,None]
Run Code Online (Sandbox Code Playgroud)
将结果存储在字典中:
>>> dict(zip(_id, g_mean))
{25302.0: array([ 32.14636441, -94.36152893]),
25305.0: array([ 32.11969774, -94.36152893]),
25326.0: array([ 32.12303108, -94.36152893]),
25341.0: array([ 32.17303108, -94.36152893]),
25376.0: array([ 32.15303108, -94.36152893])}
Run Code Online (Sandbox Code Playgroud)
你可以使用一些创造性的数组切片和where函数来解决这个问题。
means = {}
for i in numpy.unique(a[:,0]):
tmp = a[numpy.where(a[:,0] == i)]
means[i] = (numpy.mean(tmp[:,1]), numpy.mean(tmp[:,2]))
Run Code Online (Sandbox Code Playgroud)
切片[:,0]是从二维数组中提取一列(在本例中为第一列)的便捷方式。为了得到平均值,我们从第一列中找到唯一的 ID,然后对于每一列,我们用 提取适当的行where,然后组合。最终结果是元组的字典,其中键是 ID,值是包含其他两列平均值的元组。当我运行它时,它会产生以下字典:
{25302.0: (32.1463644108534, -94.36152892856853),
25305.0: (32.11969774418673, -94.36152892856853),
25326.0: (32.12303107752007, -94.36152892856853),
25341.0: (32.17303107752007, -94.36152892856853),
25376.0: (32.15303107752007, -94.36152892856853)}
Run Code Online (Sandbox Code Playgroud)