Rob*_*isi 7 python performance numpy python-3.x pandas
我想找到一种方法来更快地计算成对准确性,即将比较同一数组中的元素(在这种情况下,这是一个熊猫df列),计算它们之间的差异,然后比较所获得的两个结果。我想有一个数据帧DF有3列(ID文件的,Jugment代表人类的评估,它是一个int对象,PR_score表示该文件的网页级别,这是一个浮动对象),我要检查,如果他们同意对一个文档进行更好/最差的分类。
例如:
id:id1,id2,id3
判断:1,0,0
PR_分数:0.18,0.5,0.12
在这种情况下,两个分数在对id1的分类上优于对id3的分类,对id1和id2的分类不同,并且在id2和id3之间存在人为的判断力,因此我的成对准确性是:
协议 = 1
分歧 = 1
成对准确性 =同意/(同意+反对)= 1/2 = 0.5
这是我第一个解决方案的代码,其中我将df的列用作数组(这有助于减少计算时间):
def pairwise(agree, disagree):
return(agree/(agree+disagree))
def pairwise_computing_array(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(df['PR_Score'])
total = 0
agree = 0
disagree = 0
for i in range(len(df)-1):
for j in range(i+1, len(df)):
total += 1
human = humanScores[i] - humanScores[j] #difference human judg
if human != 0:
pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score
if pr != 0:
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue;
else:
continue;
pairwise_accuracy = pairwise(agree, disagree)
return(agree, disagree, total, pairwise_accuracy)
Run Code Online (Sandbox Code Playgroud)
我尝试使用列表理解来获得更快的计算速度,但实际上比第一种解决方案要慢:
def pairwise_computing_list_comprehension(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(judgmentPR['PR_Score'])
sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] )
for i in range(len(df)) for j in range(i+1, len(df))
if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0
and np.sign(humanScores[i] - humanScores[j])!=0)]
agreement = sum(sign)
disagreement = len(sign) - agreement
pairwise_accuracy = pairwise(agreement, disagreement)
return(agreement, disagreement, pairwise_accuracy)
Run Code Online (Sandbox Code Playgroud)
我无法在我的整个数据集上运行,因为它花费了太多时间,所以我希望可以在不到1分钟的时间内计算出一些东西。
通过我的计算机对1000行的一小部分进行的计算达到了以下性能:
代码1:每个循环1.57 s±3.15 ms(平均±标准偏差,共7次运行,每个循环1次)
代码2:每个循环3.51 s±10.7毫秒(平均±标准偏差,共7次运行,每个循环1次)
这是在合理时间内运行的代码,感谢@juanpa.arrivilillaga的建议:
\n\nfrom numba import jit\n\n@jit(nopython = True)\ndef pairwise_computing(humanScores, pagerankScores):\n\n total = 0 \n agree = 0\n disagree = 0\n\n for i in range(len(humanScores)-1): \n for j in range(i+1, len(humanScores)):\n total += 1\n human = humanScores[i] - humanScores[j] #difference human judg\n if human != 0:\n pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score\n if pr != 0:\n if np.sign(human) == np.sign(pr): \n agree += 1 #they agree in which of the two is better\n else:\n disagree +=1 #they do not agree in which of the two is better\n else:\n continue \n else:\n continue\n pairwise_accuracy = agree/(agree+disagree)\n return(agree, disagree, total, pairwise_accuracy)\n\nRun Code Online (Sandbox Code Playgroud)\n\n这是我的整个数据集(58k 行)达到的时间性能:
\n\n7.98 s \xc2\xb1 每个循环 2.78 ms(意味着 \xc2\xb1 标准偏差 7 次运行,每次 1 次循环)
\n| 归档时间: |
|
| 查看次数: |
199 次 |
| 最近记录: |