列表top_brands包含品牌列表,例如
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]
Run Code Online (Sandbox Code Playgroud)
items是a pandas.DataFrame,结构如下所示。我的任务是填补了brand_name从item_title如果brand_name丢失
row item_title brand_name
1 | Apple 6S | Apple
2 | New Victoria\'s Secret | missing <-- need to fill with Victoria\'s Secret
3 | Used Samsung TV | missing <--need fill with Samsung
4 | Used bike | missing <--No need to do anything because there is no brand_name in the title
....
Run Code Online (Sandbox Code Playgroud)
我的代码如下。问题在于,对于包含200万条记录的数据框而言,它太慢了。我可以使用pandas或numpy处理任务吗?
def get_brand_name(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)):
print(brand)
return brand
return 'missing' ### end of get_brand_name
items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)
Run Code Online (Sandbox Code Playgroud)
尝试这个:
pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
Run Code Online (Sandbox Code Playgroud)
输出:
item_title brand_name
0 Apple 6S Apple
1 New Victoria's Secret Victoria's Secret
2 Used Samsung TV Samsung
3 Used Bike missing
Run Code Online (Sandbox Code Playgroud)
我在我的机器上对 200 万个项目的随机样本进行了测试:
def read_file():
df = pd.read_csv('file1.txt')
new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
return new_df
start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')
Run Code Online (Sandbox Code Playgroud)
输出:
item_title brand_name
0 LG watch LG
1 Sony watch Sony
2 Used Burger missing
3 New Bike missing
4 New underwear missing
5 New Sony Sony
6 Used Apple underwear Apple
7 Refurbished Panasonic Panasonic
8 Used Victoria's Secret TV Victoria's Secret
9 Disney phone Disney
10 Used laptop missing
... ... ...
1999990 Refurbished Disney tablet Disney
1999991 Refurbished laptop missing
1999992 Nintendo Coffee Nintendo
1999993 Nintendo desktop Nintendo
1999994 Refurbished Victoria's Secret Victoria's Secret
1999995 Used Burger missing
1999996 Nintendo underwear Nintendo
1999997 Refurbished Apple Apple
1999998 Refurbished Sony Sony
1999999 New Google phone Google
[2000000 rows x 2 columns]
Took 3.2660000324249268s to process
Run Code Online (Sandbox Code Playgroud)
我的机器规格:
Windows 7 Pro 64 位英特尔 i7-4770 @ 3.40GHZ 12.0 GB 内存
3.266 秒相当快了......对吧?
| 归档时间: |
|
| 查看次数: |
671 次 |
| 最近记录: |