从 Pandas DataFrame 计算 pvalue

Question

从 Pandas DataFrame 计算 pvalue

Gab*_*ina 5 python statistics scipy dataframe pandas

我有一个带有 Multindex 和 8 个样本（此处仅显示两个）和每个样本的 8 个基因的 DataFrame 统计数据。

 In[13]:stats
    Out[13]: 
                       ARG/16S                                            \
                         count          mean           std           min   
    sample      gene                                                       
    Arnhem      IC        11.0  2.319050e-03  7.396130e-04  1.503150e-03   
                Int1      11.0  7.243040e+00  6.848327e+00  1.364879e+00   
                Sul1      11.0  3.968956e-03  9.186019e-04  2.499074e-03   
                TetB       2.0  1.154748e-01  1.627663e-01  3.816936e-04   
                TetM       4.0  1.083125e-04  5.185259e-05  5.189226e-05   
                blaOXA     4.0  4.210963e-06  3.783235e-07  3.843571e-06   
                ermB       4.0  4.111081e-05  7.894879e-06  3.288865e-05   
                ermF       4.0  2.335210e-05  4.519758e-06  1.832037e-05   
    Basel       Aph3a      4.0  7.815592e-06  1.757242e-06  5.539389e-06   
                IC        11.0  5.095161e-03  5.639278e-03  1.302205e-03   
                Int1      12.0  1.333068e+01  1.872207e+01  4.988048e-02   
                Sul1      11.0  1.618617e-02  1.988817e-02  2.970397e-03

Run Code Online (Sandbox Code Playgroud)

我正在尝试计算每个样本的 p 值（学生 t 检验），比较它们之间的每个基因。

我使用过 scipy.stats.ttest_ind_from_stats 但我设法获得了一个基因的不同样本的 p 值，并且只有那些彼此相邻的样本的 p 值。

Experiments = list(values1_16S['sample'].unique())
for exp in Experiments:
    if Experiments.index(exp)<len(Experiments)-1:
        second = Experiments[Experiments.index(exp)+1]
    else:
        second = Experiments[0]
    tstat, pvalue = scipy.stats.ttest_ind_from_stats(stats.loc[(exp,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(exp,'Sul1')]['ARG/16S','count'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','mean'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','std'],
                                    stats.loc[(second,'Sul1')]['ARG/16S','count'])
    d.append({'loc1':exp, 'loc2':second, 'pvalue':pvalue})


stats_Sul1 = pd.DataFrame(d)
stats_Sul1

Run Code Online (Sandbox Code Playgroud)

如何获得所有样本之间的 pvalues？有没有一种方法可以一次对所有基因执行此操作，而无需为每个基因逐一运行代码？

Answer 1

Ben*_*n.T 6

假设您对 Y 样本具有相同的 X 基因。我用 X=3 和 Y=2 尝试我的方法，但我想你可以概括。我开始于：

df1 = 
             count       mean        std       min
sample gene                                       
Arnhem IC       11   0.002319   0.000740  0.001503
       Int1     11   7.243040   6.848327  1.364879
       Sul1     11   0.003969   0.000919  0.002499
Basel  IC       11   0.005095   0.005639  0.001302
       Int1     12  13.330680  18.722070  0.049880
       Sul1     11   0.016186   0.019888  0.002970

Run Code Online (Sandbox Code Playgroud)

请注意，基因需要按相同顺序排列。首先reset_index()用df_reindex = df1.reset_index()，我不知道我在做什么是可能的多指标：

df_reindex =
   sample  gene  count       mean        std       min
0  Arnhem    IC     11   0.002319   0.000740  0.001503
1  Arnhem  Int1     11   7.243040   6.848327  1.364879
2  Arnhem  Sul1     11   0.003969   0.000919  0.002499
3   Basel    IC     11   0.005095   0.005639  0.001302
4   Basel  Int1     12  13.330680  18.722070  0.049880
5   Basel  Sul1     11   0.016186   0.019888  0.002970

Run Code Online (Sandbox Code Playgroud)

我创建了一个滚动的 DF 并将其加入df_reindex：

nb_genes = 3
df_rolled = pd.DataFrame(pd.np.roll(df_reindex,nb_genes,0), columns = df_reindex.columns)
df_joined = df_reindex.join(df_rolled, rsuffix='_')
# rsuffix='_' is to be able to perform the join

Run Code Online (Sandbox Code Playgroud)

现在在同一行，我有你需要计算pvalue和创建列的所有数据apply：

df_joined['pvalue'] = df_joined.apply(lambda x: stats.ttest_ind_from_stats(x['mean'],x['std'],x['count'], x['mean_'],x['std_'],x['count_'])[1],axis=1)

Run Code Online (Sandbox Code Playgroud)

最后，我用你想要的数据创建一个 DF 并重命名列：

df_output = df_joined[['sample','sample_','gene','pvalue']].rename(columns = {'sample':'loc1', 'sample_':'loc2'})

Run Code Online (Sandbox Code Playgroud)

你最终得到数据：

df_output = 
     loc1    loc2  gene    pvalue
0  Arnhem   Basel    IC  0.121142
1  Arnhem   Basel  Int1  0.321072
2  Arnhem   Basel  Sul1  0.055298
3   Basel  Arnhem    IC  0.121142
4   Basel  Arnhem  Int1  0.321072
5   Basel  Arnhem  Sul1  0.055298

Run Code Online (Sandbox Code Playgroud)

您可以根据需要重新索引。

如果你想对每个样本进行对比，我认为循环for可以做到。

编辑：使用pivot_table，我认为有一种更简单的方法。

仅将您的输入stats作为多索引表ARG/16S（不确定如何处理此级别），所以我从（可能是您的stats['ARG/16S']）开始：

df=
               count       mean           std       min
sample gene                                            
Arnhem IC         11   0.002319  7.396130e-04  0.001503
       Int1       11   7.243040  6.848327e+00  1.364879
       Sul1       11   0.003969  9.186019e-04  0.002499
       TetB        2   0.115475  1.627663e-01  0.000382
       TetM        4   0.000108  5.185259e-05  0.000052
       blaOXA      4   0.000004  3.783235e-07  0.000004
       ermB        4   0.000041  7.894879e-06  0.000033
       ermF        4   0.000023  4.519758e-06  0.000018
Basel  Aph3a       4   0.000008  1.757242e-06  0.000006
       IC         11   0.005095  5.639278e-03  0.001302
       Int1       12  13.330680  1.872207e+01  0.049880
       Sul1       11   0.016186  1.988817e-02  0.002970

Run Code Online (Sandbox Code Playgroud)

使用函数pivot_table，您可以重新排列数据，例如：

df_pivot = df.pivot_table(values = ['count','mean','std'], index = 'gene', 
                               columns = 'sample', fill_value = 0)

Run Code Online (Sandbox Code Playgroud)

在此df_pivot（我不会在这里打印它以提高可读性，但在新列的末尾），您可以使用itertoolsand为每对夫妇（sample1、sample2）创建一列apply：

import itertools
for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_pivot[sample1+ '_' + sample2 ] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

Run Code Online (Sandbox Code Playgroud)

我认为这种方法与样本数量、基因数量无关，如果基因不完全相同，您最终会得到df_pivot如下结果：

        count            mean                      std            Arnhem_Basel
sample Arnhem Basel    Arnhem      Basel        Arnhem      Basel             
gene                                                                          
Aph3a       0     4  0.000000   0.000008  0.000000e+00   0.000002     1.000000
IC         11    11  0.002319   0.005095  7.396130e-04   0.005639     0.121142
Int1       11    12  7.243040  13.330680  6.848327e+00  18.722070     0.321072
Sul1       11    11  0.003969   0.016186  9.186019e-04   0.019888     0.055298
TetB        2     0  0.115475   0.000000  1.627663e-01   0.000000     1.000000
TetM        4     0  0.000108   0.000000  5.185259e-05   0.000000     1.000000
blaOXA      4     0  0.000004   0.000000  3.783235e-07   0.000000     1.000000
ermB        4     0  0.000041   0.000000  7.894879e-06   0.000000     1.000000
ermF        4     0  0.000023   0.000000  4.519758e-06   0.000000     1.000000

Run Code Online (Sandbox Code Playgroud)

让我知道它是否有效

EDIT2：回复评论，我认为你可以这样做：

没有变化df_pivot，然后你创建一个多索引 DFdf_multi来写入你的结果：

df_multi = pd.DataFrame(index = df.index.levels[1], 
                        columns = pd.MultiIndex.from_tuples([p for p in itertools.combinations(df.index.levels[0],2)])).fillna(0)

Run Code Online (Sandbox Code Playgroud)

然后你使用循环for来实现这个数据df_multi：

for sample1, sample2 in itertools.combinations(df.index.levels[0],2):
    # itertools.combinations create all combinations between your samples
    df_multi.loc[:,(sample1,sample2)] = df_pivot.apply(lambda x: stats.ttest_ind_from_stats(x['mean'][sample1],x['std'][sample1],x['count'][sample1], 
                                                                                        x['mean'][sample2 ],x['std'][sample2 ],x['count'][sample2 ],)[1],axis=1).fillna(1)

Run Code Online (Sandbox Code Playgroud)

最后，您可以在级别 1 上使用transpose和unstack来获得您询问的方式（如果我误解了，请关闭）

df_output = df_multi.transpose().unstack(level=[1]).fillna(1)

Run Code Online (Sandbox Code Playgroud)

你会看到，你不必在列索引中最后一个样本和第一个样品（因为它们不存在我如何建立一切），如果你想要他们，你需要更换itertools.combinations由itertools.combinations_with_replacement两个创作的df_multi，并在环for（我没试过，但应该可以）

归档时间：	7 年，9 月前
查看次数：	11880 次
最近记录：	7 年，9 月前