如何在熊猫数据框中按组进行 t 检验？

Question

如何在熊猫数据框中按组进行 t 检验？

我有一个很大的 Pandas 数据框，有很多列。数据框包含两组。基本设置如下：

import pandas as pd
csv = [{"air" : 0.47,"co2" : 0.43 , "Group" : 1}, {"air" : 0.77,"co2" : 0.13 , "Group" : 1}, {"air" : 0.17,"co2" : 0.93 , "Group" : 2} ]
df = pd.DataFrame(csv)

Run Code Online (Sandbox Code Playgroud)

我想对 t 检验进行配对 t 检验air，co2从而比较两组Group = 1和Group = 2.

我有更多的列air co2- 因此，我想找到一个适用于数据帧中所有列的过程。我相信，我可以scipy.stats.ttest_rel与pd.groupbyoder一起使用apply。这将如何运作？提前致谢/R

Answer 1

err*_*ror 5

我会使用pandas dataframe.where 方法。

group1_air = df.where(df.Group== 1).dropna()['air']
group2_air = df.where(df.Group== 2).dropna()['air']

Run Code Online (Sandbox Code Playgroud)

这段代码将 group 列为 1 的 air 列的所有值和 group2_air 中 group 为 2 的 air 列的所有值返回到 group1_air 中。本drop.na()因为需要.where方法将返回NAN每一个在其指定的条件不满足行。因此，当您使用df.where(df.Group== 1).

您是否需要使用scipy.stats.ttest_rel或scipy.stats.ttest_ind取决于您的组。如果您的样本来自独立组，则应使用，ttest_ind如果您的样本来自相关组，则应使用ttest_rel.

因此，如果您的样本彼此独立，那么您所需的最后一段代码就是。

scipy.stats.ttest_ind(group1_air,group2_air)

Run Code Online (Sandbox Code Playgroud)

否则你需要使用

scipy.stats.ttest_rel(group1_air,group2_air)

Run Code Online (Sandbox Code Playgroud)

当您还想测试 co2 时，您只需在给定的示例中将空气更改为 co2。

编辑：

这是您应该运行的代码的粗略草图，以对数据帧中的每一列（组列除外）执行测试。您可能需要对进行一些改动column_list以使其完全符合您的需求（例如，您可能不想遍历每一列）。

# get a list of all columns in the dataframe without the Group column
column_list = [x for x in df.columns if x != 'Group']
# create an empty dictionary
t_test_results = {}
# loop over column_list and execute code explained above
for column in column_list:
    group1 = df.where(df.Group== 1).dropna()[column]
    group2 = df.where(df.Group== 2).dropna()[column]
    # add the output to the dictionary 
    t_test_results[column] = scipy.stats.ttest_ind(group1,group2)
results_df = pd.DataFrame.from_dict(t_test_results,orient='Index')
results_df.columns = ['statistic','pvalue']

Run Code Online (Sandbox Code Playgroud)

在此代码的末尾，您有一个数据框，其中包含将循环遍历的每一列的 ttest 输出。

归档时间：	8 年，11 月前
查看次数：	9200 次
最近记录：	8 年，11 月前