Mann–Whitney U test on Pandas dataframe

gur*_*kan 1 python scipy pandas

I have a large dataframe similar to this one:

In [1]: grades
Out[1]: 
                          course1  course2
school  class  student                    
school1 class1 student1         2        2
               student2         3        2
               student3         1        3
               student4         3        1
               student5         3        1
...                           ...      ...
        class3 student86        3        1
               student87        2        2
               student88        1        1
               student89        3        3
               student90        0        1

[90 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)

I want to compute the Mann-Whitney rank test on the grades from the sample school and each sub-sample class. How can I do this using pandas and scipy.stats.mannwhitneyu without iterating through the dataframe?

Bre*_*arn 5

您想要做的是groupby在索引级别上并应用一个调用函数mannwhitneyu,将两列course1course2. 假设这是您的数据:

index = pandas.MultiIndex.from_product([
    ['school{0}'.format(n) for n in xrange(3)],
    ['class{0}'.format(n) for n in xrange(3)],
    ['student{0}'.format(n) for n in xrange(10)]
])
d = pandas.DataFrame({'course1': np.random.randint(0, 10, 90), 'course2': np.random.randint(0, 10, 90)},
                     index=index)
Run Code Online (Sandbox Code Playgroud)

然后按学校计算 Mann-Whitney U:

>>> d.groupby(level=0).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0    (426.5, 0.365937834646)
school1    (445.0, 0.473277409673)
school2    (421.0, 0.335714211748)
dtype: object
Run Code Online (Sandbox Code Playgroud)

并按班级做到这一点:

>>> d.groupby(level=[0, 1]).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0  class0     (38.5, 0.200247279189)
         class1     (37.0, 0.169040187814)
         class2     (46.5, 0.409559639829)
school1  class0     (33.5, 0.110329749527)
         class1     (47.5, 0.439276896563)
         class2    (30.0, 0.0684355963119)
school2  class0     (47.5, 0.439438219083)
         class1     (43.0, 0.308851989782)
         class2     (34.0, 0.118791221444)
dtype: object
Run Code Online (Sandbox Code Playgroud)

levels参数中的数字groupby指的是您的 MultiIndex 的级别。因此,按学校/班级组合按 0 级组分组,按 0 级和 1 级组分组。