来自两个以上字符串的最长公共单词序列

sal*_*hin 3 python string

我试图在句子列表(两个以上的句子)中找到最长的常用单词序列。

例子:

list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)
Run Code Online (Sandbox Code Playgroud)

这个答案中,该解决方案工作正常,但它捕获了部分单词并返回以下内容:

'ial van for mov'
Run Code Online (Sandbox Code Playgroud)

输出应该是

'van for'
Run Code Online (Sandbox Code Playgroud)

我找不到修改它以返回所需输出的方法

Ste*_*ski 6

关键是修改为全词子序列搜索。

from itertools import islice

def is_sublist(source, target):
    slen = len(source)
    return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))

def long_substr_by_word(data):
    subseq = []
    data_seqs = [s.split(' ') for s in data]
    if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
        for i in range(len(data_seqs[0])):
            for j in range(len(data_seqs[0])-i+1):
                if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
                    subseq = data_seqs[0][i:i+j]
    return ' '.join(subseq)
Run Code Online (Sandbox Code Playgroud)

演示:

>>> data = ['commercial van for movers',
...         'partial van for movers',
...         'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
Run Code Online (Sandbox Code Playgroud)