优化两个 Pandas Dataframe 之间的笛卡尔积

Question

优化两个 Pandas Dataframe 之间的笛卡尔积

我有两个具有相同列的数据框：

数据框 1：

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

Run Code Online (Sandbox Code Playgroud)

数据框 2：

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

Run Code Online (Sandbox Code Playgroud)

我想生成所有可能组合的相关性和 p 值数据框，结果如下：

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

Run Code Online (Sandbox Code Playgroud)

问题是笛卡尔积生成了超过 300 万个元组。花几分钟完成。这是我的代码，我写了两种选择：

首先，初始数据帧：

df1 = pd.DataFrame({
    'userID': ['John', 'Charles', 'Genarito', 'Mark'],
    'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
    'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
    'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')

df2 = pd.DataFrame({
    'petID': ['Firulais', 'Connie', 'PopCorn'],
    'attr_1': [1.2501, 0.0, 2.7037],
    'attr_77': [2.4196, 1.0618, 4.6707],
    'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

Run Code Online (Sandbox Code Playgroud)

选项 1：

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)

i = 0
for df1_key in df1_keys:
    df1_values = df1.loc[df1_key, :].values
    for df2_key in df2.index:
        df2_values = df2.loc[df2_key, :]
        pearson_res = pearsonr(df1_values, df2_values)

        users[i] = df1_key
        pets[i] = df2_key
        coff[i] = pearson_res[0]
        p_value[i] = pearson_res[1]
        i += 1

# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
    'userID': users,
    'petID': pets,
    'Correlation': coff,
    'p-value': p_value
})

Run Code Online (Sandbox Code Playgroud)

选项 2 ~~（较慢）~~，从这里开始：

# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
    df1, df2 = prepare_df(df1_file_path, df2_file_path)

    df1['_tmpkey'] = 1
    df2['_tmpkey'] = 1

    res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
    res.index = pd.MultiIndex.from_product((df1.index, df2.index))

    df1.drop('_tmpkey', axis=1, inplace=True)
    df2.drop('_tmpkey', axis=1, inplace=True)

    return res

# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
    values = np.split(row.values, 2)
    return pearsonr(values[0], values[1])

result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)

Run Code Online (Sandbox Code Playgroud)

有没有更快的方法用 Pandas 解决这样的问题？或者我除了并行化迭代别无选择？

编辑：

随着数据帧大小的增加，第二个选项会带来更好的运行时间，但仍然需要几秒钟才能完成。

提前致谢

Answer 1

Gen*_*ito 4

在测试的所有替代方案中，给我带来最佳结果的方案如下：

迭代产品是使用 itertools.product()制作的。
两个迭代行上的所有迭代都是在并行进程池上执行的（使用映射函数）。

为了提高性能，该函数是使用Cythoncompute_row_cython编译的，正如Pandas 文档本节中建议的那样：

在文件中cython_modules.pyx：

from scipy.stats import pearsonr
import numpy as np

def compute_row_cython(row):
    (df1_key, df1_values), (df2_key, df2_values) = row
    cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
    return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Run Code Online (Sandbox Code Playgroud)

然后我设置了setup.py：

from distutils.core import setup
from Cython.Build import cythonize

setup(name='Compiled Pearson',
      ext_modules=cythonize("cython_modules.pyx")

Run Code Online (Sandbox Code Playgroud)

最后我编译它：python setup.py build_ext --inplace

最后留下的代码是：

import itertools
import multiprocessing
from cython_modules import compute_row_cython

NUM_CORES = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Run Code Online (Sandbox Code Playgroud)

mergeDask 和使用的函数都没有apply给我更好的结果。甚至没有优化 Cython 的应用。事实上，这两种方法的替代方案给了我内存错误，当使用 Dask 实现解决方案时，我必须生成多个分区，这降低了性能，因为它必须执行许多 I/O 操作。

Dask的解决方案可以在我的另一个问题中找到。

归档时间：	5 年，9 月前
查看次数：	303 次
最近记录：	5 年，8 月前