Jos*_*gts 7 python algorithm fuzzy-comparison lcs levenshtein-distance
假设我有一个大字符串和一个子串数组,当连接时等于大字符串(差异很小).
例如(注意字符串之间的细微差别):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
Run Code Online (Sandbox Code Playgroud)
如何最好地对齐字符串以从原始字符串生成一组新的子字符串large_str?例如:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Run Code Online (Sandbox Code Playgroud)
附加信息
用例是从PDF文档中提取的文本的现有分页符中查找原始文本的分页符.从PDF中提取的文本是OCR并且与原始文本相比具有较小的错误,但原始文本没有分页符.目标是准确地分页原文,避免PDF文本的OCR错误.
from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len
print(matches)
Run Code Online (Sandbox Code Playgroud)
输出:
['hello, this is a long string',
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']
Run Code Online (Sandbox Code Playgroud)