Mann–Whitney U test on Pandas dataframe

Question

Mann–Whitney U test on Pandas dataframe

I have a large dataframe similar to this one:

In [1]: grades
Out[1]: 
                          course1  course2
school  class  student                    
school1 class1 student1         2        2
               student2         3        2
               student3         1        3
               student4         3        1
               student5         3        1
...                           ...      ...
        class3 student86        3        1
               student87        2        2
               student88        1        1
               student89        3        3
               student90        0        1

[90 rows x 2 columns]

Run Code Online (Sandbox Code Playgroud)

I want to compute the Mann-Whitney rank test on the grades from the sample school and each sub-sample class. How can I do this using pandas and scipy.stats.mannwhitneyu without iterating through the dataframe?

Answer 1

Bre*_*arn 5

您想要做的是groupby在索引级别上并应用一个调用函数mannwhitneyu，将两列course1和course2. 假设这是您的数据：

index = pandas.MultiIndex.from_product([
    ['school{0}'.format(n) for n in xrange(3)],
    ['class{0}'.format(n) for n in xrange(3)],
    ['student{0}'.format(n) for n in xrange(10)]
])
d = pandas.DataFrame({'course1': np.random.randint(0, 10, 90), 'course2': np.random.randint(0, 10, 90)},
                     index=index)

Run Code Online (Sandbox Code Playgroud)

然后按学校计算 Mann-Whitney U：

>>> d.groupby(level=0).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0    (426.5, 0.365937834646)
school1    (445.0, 0.473277409673)
school2    (421.0, 0.335714211748)
dtype: object

Run Code Online (Sandbox Code Playgroud)

并按班级做到这一点：

>>> d.groupby(level=[0, 1]).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0  class0     (38.5, 0.200247279189)
         class1     (37.0, 0.169040187814)
         class2     (46.5, 0.409559639829)
school1  class0     (33.5, 0.110329749527)
         class1     (47.5, 0.439276896563)
         class2    (30.0, 0.0684355963119)
school2  class0     (47.5, 0.439438219083)
         class1     (43.0, 0.308851989782)
         class2     (34.0, 0.118791221444)
dtype: object

Run Code Online (Sandbox Code Playgroud)

levels参数中的数字groupby指的是您的 MultiIndex 的级别。因此，按学校/班级组合按 0 级组分组，按 0 级和 1 级组分组。

归档时间：	10 年，4 月前
查看次数：	4449 次
最近记录：	10 年，4 月前