Python Pandas使用另一列删除子字符串

Lin*_*ink 8 python string replace series pandas

我试过四处寻找并且无法找到一个简单的方法来做到这一点,所以我希望你的专业知识可以提供帮助.

我有一个有两列的pandas数据框

import numpy as np
import pandas as pd

pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})
Run Code Online (Sandbox Code Playgroud)

这给了我

          FULL_NAME   NAME
0        FIRST LAST  FIRST
1               NaN    NaN
2        FIRST LAST  NAME2
3       FIRST NAME3  NAME3
4  FIRST NAME4 LAST  NAME4
5      ANOTHER NAME  NAME5
6         LAST NAME  NAME6
Run Code Online (Sandbox Code Playgroud)

我想要做的是从'NAME'列中取值,然后从'FULL NAME'列中删除它,如果它在那里.所以函数会返回

          FULL_NAME   NAME           NEW
0        FIRST LAST  FIRST          LAST
1               NaN    NaN           NaN
2        FIRST LAST  NAME2    FIRST LAST
3       FIRST NAME3  NAME3         FIRST
4  FIRST NAME4 LAST  NAME4    FIRST LAST
5      ANOTHER NAME  NAME5  ANOTHER NAME
6         LAST NAME  NAME6     LAST NAME
Run Code Online (Sandbox Code Playgroud)

到目前为止,我已经在下面定义了一个函数并使用了apply方法.虽然我的大型数据集运行速度相当慢,但我希望有一种更有效的方法.谢谢!

def address_remove(x):
    try:
        newADDR1 = re.sub(x['NAME'], '', x[-1])
        newADDR1 = newADDR1.rstrip()
        newADDR1 = newADDR1.lstrip()
        return newADDR1
    except:
        return x[-1]
Run Code Online (Sandbox Code Playgroud)

joh*_*ase 5

这是一个比你当前的解决方案快一点的解决方案,我不相信会有更快的东西

In [13]: import numpy as np
         import pandas as pd
         n = 1000
         testing  = pd.DataFrame({'NAME':[
         'FIRST', np.nan, 'NAME2', 'NAME3', 
         'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST  LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})
Run Code Online (Sandbox Code Playgroud)

这是一种长的衬里,但它应该做你需要的

我能想出的禁食解决方案是使用replace另一个答案中提到的:

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop
Run Code Online (Sandbox Code Playgroud)

原始答案:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop
Run Code Online (Sandbox Code Playgroud)

与您当前的解决方案相比:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop
Run Code Online (Sandbox Code Playgroud)

这些可以为您提供与当前解决方案相同的答案


Ant*_*pov 5

你可以用replace方法和regex参数来做,然后使用str.strip

In [605]: testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
Out[605]: 
0            LAST
1             NaN
2      FIRST LAST
3           FIRST
4     FIRST  LAST
5    ANOTHER NAME
6       LAST NAME
Name: FULL_NAME, dtype: object
Run Code Online (Sandbox Code Playgroud)

注意您需要传递notnulltesting.NAME因为没有它NaN值也将被替换为空字符串

基准测试比最快的@johnchase 解决方案慢,但我认为它更具可读性,并使用 DataFrames 和 Series 的所有 Pandas 方法:

In [607]: %timeit testing['NEW'] = testing.FULL_NAME.replace(testing.NAME[testing.NAME.notnull()], '', regex = True).str.strip()
100 loops, best of 3: 4.56 ms per loop

In [661]: %timeit testing ['NEW'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
1000 loops, best of 3: 450 µs per loop
Run Code Online (Sandbox Code Playgroud)