更智能的方法来检查字符串是否包含列表中的元素-Python

Dus*_*Sun 5 python pandas

列表top_brands包含品牌列表,例如

top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]
Run Code Online (Sandbox Code Playgroud)

items是a pandas.DataFrame,结构如下所示。我的任务是填补了brand_nameitem_title如果brand_name丢失

row     item_title                 brand_name

1    |  Apple 6S                  |  Apple
2    |  New Victoria\'s Secret    |  missing  <-- need to fill with Victoria\'s Secret
3    |  Used Samsung TV           |  missing  <--need fill with Samsung
4    |  Used bike                 |  missing  <--No need to do anything because there is no brand_name in the title 
    ....
Run Code Online (Sandbox Code Playgroud)

我的代码如下。问题在于,对于包含200万条记录的数据框而言,它太慢了。我可以使用pandas或numpy处理任务吗?

def get_brand_name(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']

    item_title = row['item_title']

    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)): 
            print(brand)
            return brand

    return 'missing'    ### end of get_brand_name


items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)
Run Code Online (Sandbox Code Playgroud)

r.o*_*ook 2

尝试这个:

pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
Run Code Online (Sandbox Code Playgroud)

输出:

              item_title         brand_name
0               Apple 6S              Apple
1  New Victoria's Secret  Victoria's Secret
2        Used Samsung TV            Samsung
3              Used Bike            missing
Run Code Online (Sandbox Code Playgroud)

我在我的机器上对 200 万个项目的随机样本进行了测试:

def read_file():
    df = pd.read_csv('file1.txt')
    new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
    return new_df

start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')
Run Code Online (Sandbox Code Playgroud)

输出:

                                   item_title         brand_name
0                                    LG watch                 LG
1                                  Sony watch               Sony
2                                 Used Burger            missing
3                                    New Bike            missing
4                               New underwear            missing
5                                    New Sony               Sony
6                        Used Apple underwear              Apple
7                       Refurbished Panasonic          Panasonic
8                   Used Victoria's Secret TV  Victoria's Secret
9                                Disney phone             Disney
10                                Used laptop            missing
...                                       ...                ...
1999990             Refurbished Disney tablet             Disney
1999991                    Refurbished laptop            missing
1999992                       Nintendo Coffee           Nintendo
1999993                      Nintendo desktop           Nintendo
1999994         Refurbished Victoria's Secret  Victoria's Secret
1999995                           Used Burger            missing
1999996                    Nintendo underwear           Nintendo
1999997                     Refurbished Apple              Apple
1999998                      Refurbished Sony               Sony
1999999                      New Google phone             Google

[2000000 rows x 2 columns]
Took 3.2660000324249268s to process
Run Code Online (Sandbox Code Playgroud)

我的机器规格:

Windows 7 Pro 64 位英特尔 i7-4770 @ 3.40GHZ 12.0 GB 内存

3.266 秒相当快了......对吧?