在熊猫DataFrame中计算h指数（作者出版物的影响/生产率）的有效方法

Question

在熊猫DataFrame中计算h指数（作者出版物的影响/生产率）的有效方法

BKS*_*BKS 5 python dataframe python-2.7 pandas

我对熊猫还很陌生，但是我一直在阅读有关它的知识以及处理大数据的速度。

我设法创建了一个数据框，现在有一个熊猫数据框，看起来像这样：

    0     1
0    1    14
1    2    -1
2    3  1817
3    3    29
4    3    25
5    3     2
6    3     1
7    3    -1
8    4    25
9    4    24
10   4     2
11   4    -1
12   4    -1
13   5    25
14   5     1

Run Code Online (Sandbox Code Playgroud)

Columns 0是作者的ID，column 1是该作者在出版物上被引用的次数（-1表示零被引用）。每行代表一个作者的不同出版物。

我正在尝试h-index为每个作者计算。h-index定义为作者被引用至少h次的h出版物的数量。所以对于作者：

作者1的h-index为1

作者2的h-index为0

作者3的h-index为3

作者4的h-index为2

作者5的h-index为1

这是我当前正在执行的方法，其中涉及很多循环：

current_author=1
hindex=0

for index, row in df.iterrows():
    if row[0]==current_author:
        if row[1]>hindex:
            hindex+=1
    else:
        print "author ",current_author," has h-index:", hindex
        current_author+=1
        hindex=0
        if row[1]>hindex:
            hindex+=1

print "author ",current_author," has h-index:", hindex

Run Code Online (Sandbox Code Playgroud)

我的实际数据库有300万以上的作者。如果我为每个循环，这将需要几天的时间来计算。我正在尝试找出您认为解决此问题最快的方法是什么？

提前致谢！

Answer 1

EdC*_*ica 5

我在这里将您的列重命名为“作者”和“引文”，我们可以对作者进行分组，然后应用lambda，这里的lambda将引文数量与值进行比较，如果为true，则将生成1或0，我们可以然后总结一下：

In [104]:

df['h-index'] = df.groupby('author')['citations'].transform( lambda x: (x >= x.count()).sum() )
?
df
Out[104]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

Run Code Online (Sandbox Code Playgroud)

编辑如@Julien Spronck所指出，如果作者4的引用为3、3、3，则上述内容将无法正常工作。通常，您无法访问组间索引，但是我们可以将引文值与进行比较rank，这是一个伪索引，但仅当引文值唯一时才有效：

In [129]:

df['h-index'] = df.groupby('author')['citations'].transform(lambda x: ( x >= x.rank(ascending=False, method='first') ).sum() )
?
df
Out[129]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，11 月前
查看次数：	441 次
最近记录：	6 年，11 月前