如何找到大字符串的最佳拟合子序列?

Jos*_*gts 7 python algorithm fuzzy-comparison lcs levenshtein-distance

假设我有一个大字符串和一个子串数组,当连接时等于大字符串(差异很小).

例如(注意字符串之间的细微差别):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]
Run Code Online (Sandbox Code Playgroud)

如何最好地对齐字符串以从原始字符串生成一组新的子字符串large_str?例如:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]
Run Code Online (Sandbox Code Playgroud)

附加信息

用例是从PDF文档中提取的文本的现有分页符中查找原始文本的分页符.从PDF中提取的文本是OCR并且与原始文本相比具有较小的错误,但原始文本没有分页符.目标是准确地分页原文,避免PDF文本的OCR错误.

Ant*_*ton 3

  1. 连接子字符串
  2. 将连接与原始字符串对齐
  3. 跟踪原始字符串中的哪些位置与子字符串之间的边界对齐
  4. 在与这些边界对齐的位置上分割原始字符串

使用 Python 的difflib 的实现:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)
Run Code Online (Sandbox Code Playgroud)

输出:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']
Run Code Online (Sandbox Code Playgroud)