我正在寻找解决方案来显示字符串中两列之间差异的位置
Input:
df=pd.DataFrame({'A':['this is my favourite one','my dog is the best'],
'B':['now is my favourite one','my doggy is the worst']})
expected output:
[A-B],[B-A]
0:4 ,0:3 #'this','now'
3:6 ,3:8 #'dog','doggy'
14:18,16:21 #'best','worst'
Run Code Online (Sandbox Code Playgroud)
现在我只有一种方法来搜索差异(但不起作用,不知道为什么)
df['A-B']=df.apply(lambda x: x['A'].replace(x['B'], "").strip(),axis=1)
df['B-A']=df.apply(lambda x: x['B'].replace(x['A'], "").strip(),axis=1)
Run Code Online (Sandbox Code Playgroud)
你的问题非常简单,正如评论中提到的,最好使用它difflib.Sequencematcher.get_matching_blocks,但我无法让它工作。所以这是一个可行的解决方案,它不会在速度方面执行,但会得到输出。
首先我们得到单词的差异,然后我们找到每列的开始+结束位置:
def get_diff_words(col1, col2):
diff_words = [[w1, w2] for w1, w2 in zip(col1, col2) if w1 != w2]
return diff_words
df['diff_words'] = df.apply(lambda x: get_diff_words(x['A'].split(), x['B'].split()), axis=1)
df['pos_A'] = df.apply(lambda x: [f'{x["A"].find(word[0])}:{x["A"].find(word[0])+len(word[0])}' for word in x['diff_words']], axis=1)
df['pos_B'] = df.apply(lambda x: [f'{x["B"].find(word[1])}:{x["B"].find(word[1])+len(word[1])}' for word in x['diff_words']], axis=1)
Run Code Online (Sandbox Code Playgroud)
输出
A B diff_words pos_A pos_B
0 this is my favourite one now is my favourite one [[this, now]] [0:4] [0:3]
1 my dog is the best my doggy is the worst [[dog, doggy], [best, worst]] [3:6, 14:18] [3:8, 16:21]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1543 次 |
| 最近记录: |