Sai*_*mar 2 python join string-matching partial pandas
我有一个数据框,我想比较它们是否存在于另一个 df 中。
after_h.sample(10, random_state=1)
movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
Run Code Online (Sandbox Code Playgroud)
我想比较上述电影是否存在于另一个 df 中。
FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560
Run Code Online (Sandbox Code Playgroud)
我想要这样的东西作为我的最终输出:
FILM votes
0 Max Steel 560
Run Code Online (Sandbox Code Playgroud)
有两种方式:
获取部分匹配的行索引:FILM.startswith(title)或FILM.contains(title)。两者之一:
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
Run Code Online (Sandbox Code Playgroud)
merge()movie_title (year)。.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
Run Code Online (Sandbox Code Playgroud)
(感谢@user3483203 在这里和 Python 聊天室的帮助)
重新创建数据帧的代码:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Run Code Online (Sandbox Code Playgroud)