我目前有这个df,其中rect列是所有字符串.我需要将x,y,w和h从中提取到单独的列中.数据集非常大,所以我需要一种有效的方法
df['rect'].head()
0 <Rect (120,168),260 by 120>
1 <Rect (120,168),260 by 120>
2 <Rect (120,168),260 by 120>
3 <Rect (120,168),260 by 120>
4 <Rect (120,168),260 by 120>
Run Code Online (Sandbox Code Playgroud)
到目前为止,这个解决方案有效,但是你可以看到它非常混乱
df[['x', 'y', 'w', 'h']] = df['rect'].str.replace('<Rect \(', '').str.replace('\),', ',').str.replace(' by ', ',').str.replace('>', '').str.split(',', n=3, expand=True)
Run Code Online (Sandbox Code Playgroud)
有没有更好的办法?可能是正则表达式方法
运用 extractall
df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]:
match 0 1 2 3
0 120 168 260 120
1 120 168 260 120
2 120 168 260 120
3 120 168 260 120
4 120 168 260 120
Run Code Online (Sandbox Code Playgroud)
制作副本
df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
Run Code Online (Sandbox Code Playgroud)
或者只是重新分配给 df
df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
df
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
Run Code Online (Sandbox Code Playgroud)
修改现有的 df
df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())
df
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
Run Code Online (Sandbox Code Playgroud)