如何加速pandas申请字符串匹配

Wou*_*ter 5 python apply pandas

我有大量文件,我必须根据字符串列对其进行计算。相关列如下所示。

df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})

    A   B   C   D
0   A   B   A   A
1   B   C   B   C
2   A   D   D   C
3   B   A   D   B
Run Code Online (Sandbox Code Playgroud)

我必须创建包含每行中某些字符串出现次数的新列。我这样做:

for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)

   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1
Run Code Online (Sandbox Code Playgroud)

但是,每个文件需要几分钟,我必须为大约 900 个文件执行此操作。有什么办法可以加快速度吗?

Shu*_*rma 6

stack+ str.get_dummies,然后sumlevel=0joindf

df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
Run Code Online (Sandbox Code Playgroud)

结果:

print(df1)
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1
Run Code Online (Sandbox Code Playgroud)


Tom*_*Tom 3

我没有apply循环遍历每一行,而是循环遍历每一列来计算每个字母的总和:

\n
for l in ['A','B','C','D']:\n    df['n_' + l] = (df == l).sum(axis=1)\n
Run Code Online (Sandbox Code Playgroud)\n

这似乎是本示例中的一个改进,但是(根据未显示的快速测试)似乎可能等于或更差,具体取决于数据的形状和大小(以及可能您正在寻找的字符串数量)

\n

一些时间对比:

\n
%%timeit\nfor elem in ['A', 'B', 'C', 'D']:\n    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    \n#6.77 ms \xc2\xb1 145 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\nfor l in ['A','B','C','D']:\n    df['n_' + l] = (df == l).sum(axis=1)\n#1.95 ms \xc2\xb1 17 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n

以及其他答案:

\n
%%timeit\ndf1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))\n#3.59 ms \xc2\xb1 62.4 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\ndf1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))\n#5.82 ms \xc2\xb1 52.2 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\ncounts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)\ncounts.columns = [f'n_{col}' for col in counts.columns]\ndf.join(counts)\n#5.58 ms \xc2\xb1 71.4 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n