在Python中对单个数组进行更快的双迭代

Rob*_*isi 7 python performance numpy python-3.x pandas

我想找到一种方法来更快地计算成对准确性,即将比较同一数组中的元素(在这种情况下,这是一个熊猫df列),计算它们之间的差异,然后比较所获得的两个结果。我想有一个数据帧DF有3列(ID文件的,Jugment代表人类的评估,它是一个int对象,PR_score表示该文件的网页级别,这是一个浮动对象),我要检查,如果他们同意对一个文档进行更好/最差的分类。


例如:

id:id1,id2,id3

判断1,0,0

PR_分数0.18,0.5,0.12

在这种情况下,两个分数在对id1的分类上优于对id3的分类,对id1和id2的分类不同,并且在id2和id3之间存在人为的判断力,因此我的成对准确性是:

协议 = 1

分歧 = 1

成对准确性 =同意/(同意+反对)= 1/2 = 0.5


这是我第一个解决方案的代码,其中我将df的列用作数组(这有助于减少计算时间):

def pairwise(agree, disagree):
    return(agree/(agree+disagree))

def pairwise_computing_array(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(df['PR_Score']) 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):  
        for j in range(i+1, len(df)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue;   
            else:
                continue;

    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)
Run Code Online (Sandbox Code Playgroud)


我尝试使用列表理解来获得更快的计算速度,但实际上比第一种解决方案要慢:

def pairwise_computing_list_comprehension(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(judgmentPR['PR_Score']) 

    sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] ) 
            for i in range(len(df)) for j in range(i+1, len(df)) 
                if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0 
                    and np.sign(humanScores[i] - humanScores[j])!=0)]

    agreement = sum(sign)
    disagreement = len(sign) -  agreement                             
    pairwise_accuracy = pairwise(agreement, disagreement)

    return(agreement, disagreement, pairwise_accuracy)

Run Code Online (Sandbox Code Playgroud)

我无法在我的整个数据集上运行,因为它花费了太多时间,所以我希望可以在不到1分钟的时间内计算出一些东西。

通过我的计算机对1000行的一小部分进行的计算达到了以下性能:

代码1:每个循环1.57 s±3.15 ms(平均±标准偏差,共7次运行,每个循环1次)

代码2:每个循环3.51 s±10.7毫秒(平均±标准偏差,共7次运行,每个循环1次)

Rob*_*isi 1

这是在合理时间内运行的代码,感谢@juanpa.arrivilillaga的建议:

\n\n
from numba import jit\n\n@jit(nopython = True)\ndef pairwise_computing(humanScores, pagerankScores):\n\n    total = 0 \n    agree = 0\n    disagree = 0\n\n    for i in range(len(humanScores)-1):  \n        for j in range(i+1, len(humanScores)):\n            total += 1\n            human = humanScores[i] -  humanScores[j] #difference human judg\n            if human != 0:\n                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score\n                if pr != 0:\n                    if np.sign(human) == np.sign(pr):  \n                        agree += 1 #they agree in which of the two is better\n                    else:\n                        disagree +=1 #they do not agree in which of the two is better\n                else:\n                    continue   \n            else:\n                continue\n    pairwise_accuracy = agree/(agree+disagree)\n    return(agree, disagree, total,  pairwise_accuracy)\n\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是我的整个数据集(58k 行)达到的时间性能:

\n\n

7.98 s \xc2\xb1 每个循环 2.78 ms(意味着 \xc2\xb1 标准偏差 7 次运行,每次 1 次循环)

\n