加速python代码 - 我可以向量化双循环吗?

sau*_*sau 5 python numpy vectorization

我是python的新手.我正在使用dbscan代码进行聚类以进行一些更改.现在代码运行正常,但速度非常慢.所以我发现我必须从我的代码中删除'for loop'.这是代码的一部分:

class Point:
    def __init__(self, x = 0, y = 0, visited = False, isnoise = False):
        self.x = x  
        self.y = y  
        self.visited = False  
        self.isnoise = False  

    def show(self):  
        return self.x, self.y 

    def dist(self, p1, p2):  
        #Calculate the great circle distance between two points on the earth (specified in decimal degrees)return distance between two point  
        # convert decimal degrees to radians 
        dlat = radians(p2.x-p1.x)
        dlon = radians(p2.y-p1.y)
        a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1.x))* cos(radians(p2.x)) * sin(dlon/2) * sin(dlon/2)
        c = 2 * atan2(sqrt(a), sqrt(1-a))
        d = 6371 * c
        return d 

    def distanceQuery(self,neighbor_pts):
        dista=[]
        for i in range(len(neighbor_pts)):
          for j in range(i+1,len(neighbor_pts)):
            z=self.dist(neighbor_pts[i],neighbor_pts[j])
            dista.append(z)
        return max(dista)
Run Code Online (Sandbox Code Playgroud)

distanceQuery函数使用double for循环.有什么办法可以删除吗?我可以将这个双向量化为循环吗?由于这是聚类代码,因此需要附加一些步骤.在附加时,我已经读过numpy数组的工作方式与python列表不同.附加numpy数组是低效的.

编辑:

所以这可以是矢量化.但是这里是代码的其他部分,在我检查某些条件之后发生追加.

def expandCluster(self, P, neighbor_points):  
     self.cluster[self.cluster_inx].append(P)  
     iterator = iter(neighbor_points)  
     while True:  
       try:   
         npoint_tmp = iterator.next()  
       except StopIteration:  
         # StopIteration exception is raised after last element  
         break  
       if (not npoint_tmp.visited):  
         #for each point P' in NeighborPts   
         npoint_tmp.visited = True  
         NeighborPts_ = self.regionQuery(npoint_tmp)  
         if (len(NeighborPts_) >= self.MinPts):  
           for j in range(len(NeighborPts_)):  
            neighbor_points.append(NeighborPts_[j])
            if self.distanceQuery(neighbor_points)>0.10:
              break
Run Code Online (Sandbox Code Playgroud)

现在,如果我也矢量化neighbor_points.我将不得不解决附加问题?所以每个点都会附加到neighbour_points然后它会成为一个distanceQuery.而这个过程也是迭代的一部分.所以这里有两个循环.我只想确保在numpy数组中追加不会效率低下

Eri*_*ric 5

import numpy as np

def dist(p1, p2): 
    # Initially, p1.shape() == (n, 2) and p2.shape() == (m, 2)
    # Now, p1.shape() == (1, n, 2) and p2.shape() == (m, 1, 2)
    p1 = p1[np.newaxis, :, :]
    p2 = p2[:, np.newaxis, :]

    # get all the vectory things
    from numpy import sin, cos, radians, sqrt, arctan2 as atan2 

    # do the same math as before, but use `p[..., 0]` instead of `p.x` etc
    dlat = radians(p2[..., 0] - p1[..., 0])
    dlon = radians(p2[..., 1] - p1[..., 1])
    a = sin(dlat/2) * sin(dlat/2) + cos(p1[..., 0])*cos(p2[..., 0]) * sin(dlon/2) * sin(dlon/2)
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    d = 6371 * c
    return d 

def distanceQuery(neighbor_pts):
    return np.max(dist(neighbor_pts, neighbor_pts))
Run Code Online (Sandbox Code Playgroud)

例如:

>>> points = np.array([[0, 0], [45, 0], [45, 45], [90, 0]], dtype=float) 
>>> dist(points, points)
array([[     0.        ,   5003.77169901,   6272.52596983,  10007.54339801],
       [  5003.77169901,      0.        ,   2579.12525679,   5003.77169901],
       [  6272.52596983,   2579.12525679,      0.        ,   4347.69702221],
       [ 10007.54339801,   5003.77169901,   4347.69702221,      0.        ]])
>>> np.max(_)
10007.543398010286
Run Code Online (Sandbox Code Playgroud)

定时:

def dist_slow(p1, p2):
    """your function, adjusted to take an array instead of a `Point`"""
    from math import radians, cos, sqrt, atan2
    # compute the distance for all possible pairs
    dlat = radians(p2[0]-p1[0])
    dlon = radians(p2[1]-p1[1])

    a = sin(dlat/2) * sin(dlat/2) + cos(radians(p1[0]))*cos(radians(p2[0])) * sin(dlon/2) * sin(dlon/2)
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    d = 6371 * c
    return d

def query_iter(p):
    return max(dist_slow(p1, p2) for p1, p2 in itertools.combinations(p, 2))

def query_orig(p):
    dista=[]
    for i in range(len(p)):
      for j in range(i + 1, len(p)):
        z = dist_slow(p[i], p[j])
        dista.append(z)
    return max(dista)

def query_mine(p):
    return dist(p, p).max()
Run Code Online (Sandbox Code Playgroud)

然后:

>>> points = np.random.rand(1000, 2)
>>> timeit query_orig(points)
1 loops, best of 3: 7.94 s per loop
>>> timeit query_iter(points)
1 loops, best of 3: 7.35 s per loop
>>> timeit query_mine(points)
10 loops, best of 3: 150 ms per loop
Run Code Online (Sandbox Code Playgroud)