我试图在句子列表(两个以上的句子)中找到最长的常用单词序列。
例子:
list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)
Run Code Online (Sandbox Code Playgroud)
在这个答案中,该解决方案工作正常,但它捕获了部分单词并返回以下内容:
'ial van for mov'
Run Code Online (Sandbox Code Playgroud)
输出应该是
'van for'
Run Code Online (Sandbox Code Playgroud)
我找不到修改它以返回所需输出的方法
关键是修改为全词子序列搜索。
from itertools import islice
def is_sublist(source, target):
slen = len(source)
return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))
def long_substr_by_word(data):
subseq = []
data_seqs = [s.split(' ') for s in data]
if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
for i in range(len(data_seqs[0])):
for j in range(len(data_seqs[0])-i+1):
if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
subseq = data_seqs[0][i:i+j]
return ' '.join(subseq)
Run Code Online (Sandbox Code Playgroud)
演示:
>>> data = ['commercial van for movers',
... 'partial van for movers',
... 'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
Run Code Online (Sandbox Code Playgroud)