基于列之间的部分字符串匹配连接数据框

Question

基于列之间的部分字符串匹配连接数据框

Sai*_*mar 2 python join string-matching partial pandas

我有一个数据框，我想比较它们是否存在于另一个 df 中。

after_h.sample(10, random_state=1)

             movie           year   ratings
108 Mechanic: Resurrection   2016     4.0
206 Warcraft                 2016     4.0
106 Max Steel                2016     3.5
107 Me Before You            2016     4.5

Run Code Online (Sandbox Code Playgroud)

我想比较上述电影是否存在于另一个 df 中。

              FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000 
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560

Run Code Online (Sandbox Code Playgroud)

我想要这样的东西作为我的最终输出：

    FILM              votes
0  Max Steel           560

Run Code Online (Sandbox Code Playgroud)

Answer 1

smc*_*mci 6

有两种方式：

获取部分匹配的行索引：FILM.startswith(title)或FILM.contains(title)。两者之一：

df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]

df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]

     movie      year      ratings
106  Max Steel  2016      3.5

Run Code Online (Sandbox Code Playgroud)

或者，如果将复合字符串列 df2['FILM'] 转换为其两个组件列，则可以使用merge()movie_title (year)。

.

# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)

df2.merge(df1)
       movie  year  Votes  ratings
0  Max Steel  2016    560      3.5

Run Code Online (Sandbox Code Playgroud)

（感谢@user3483203 在这里和 Python 聊天室的帮助）

重新创建数据帧的代码：

import pandas as pd
from pandas.compat import StringIO

dat1 = """movie           year   ratings
108  Mechanic: Resurrection   2016     4.0
206  Warcraft                 2016     4.0
106  Max Steel                2016     3.5
107  Me Before You            2016     4.5"""

dat2 = """FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560"""

df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')

Run Code Online (Sandbox Code Playgroud)

`df2[df1['movie'].apply(lambda movie_title: df2['FILM'].str.contains(movie_title)).any(0)]` (2认同)

归档时间：	7 年，1 月前
查看次数：	9869 次
最近记录：	4 年，10 月前