没有剪切word-python的最长公共子串

alv*_*vas 3 python string nlp substring longest-substring

鉴于以下内容,我可以找到最长的公共子字符串:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

foo bar
Run Code Online (Sandbox Code Playgroud)

但是,我如何确保最长的公共子字符串尊重英语单词边界并且不要删掉一个单词?例如,以下句子:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
Run Code Online (Sandbox Code Playgroud)

输出其后续NOT期望的,因为它打破了字kappa从S2:

a foo bar
Run Code Online (Sandbox Code Playgroud)

所需的输出仍然是:

foo bar
Run Code Online (Sandbox Code Playgroud)

我已经尝试了一种ngram方法来获得最长的公共子字符串,但是在没有计算ngrams的情况下还有其他处理字符串的方法吗?(见答案)

Ano*_*oop 8

这太简单了.我用你的代码完成了75%的工作.我首先将句子分成单词,然后将其传递给你的函数以获得最大的公共子字符串(在这种情况下它将是最长的连续单词),所以你的函数给了我['foo','bar'],我加入了该数组的元素,以产生所需的结果.

这是在线工作副本供您测试和验证并摆弄它.

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'
Run Code Online (Sandbox Code Playgroud)

边缘情况

  1. '' 和'?' 如果在最后一个单词和标点符号之间有空格,也会被视为有效单词.如果你不留空间,他们将被算作最后一个字的一部分.在那种情况下'羊'和'羊?' 不再是同一个词了.在调用此类函数之前,由您决定如何处理此类字符.在这种情况下

    import re
    s1 = re.sub('[.?]','', s1)
    s2 = re.sub('[.?]','', s2)

然后像往常一样继续.