有没有办法删除字符串中的重复和连续的单词/短语?

alv*_*vas 5 python regex string

有没有办法删除字符串中的重复和连续的单词/短语?例如

[在]: foo foo bar bar foo bar

[OUT]: foo bar foo bar

我试过这个:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'
Run Code Online (Sandbox Code Playgroud)

当它变得有点复杂并且我想删除短语时会发生什么(让我们说短语最多可以由5个单词组成)?怎么做到呢?例如

[在]: foo bar foo bar foo bar

[OUT]: foo bar

另一个例子:

[在]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .

[OUT]: this is a sentence where phrases duplicate . sentence are not prhases .

sha*_*hmo 13

您可以使用re模块.

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'
Run Code Online (Sandbox Code Playgroud)

如果要匹配任意数量的连续出现:

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'    
Run Code Online (Sandbox Code Playgroud)

编辑.最后一个例子的补充.为此,您必须在重复短语时调用re.sub.所以:

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'
Run Code Online (Sandbox Code Playgroud)


Kir*_*ser 6

我爱itertools.似乎每次我想写东西时,itertools都已经拥有它.在这种情况下,groupby获取一个列表并将该列表中重复的顺序项分组为一个元组(item_value, iterator_of_those_values).在这里使用它像:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
Run Code Online (Sandbox Code Playgroud)

因此,让我们使用一个函数来扩展它,该函数返回一个列表,其中删除了重复的重复值:

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))
Run Code Online (Sandbox Code Playgroud)

这对于单词短语非常有用,但对于较长的短语没有帮助.该怎么办?好吧,首先,我们要通过跨越原始短语检查更长的短语:

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return
Run Code Online (Sandbox Code Playgroud)

现在我们正在做饭!好.因此,我们的策略是先删除所有单字重复项.接下来,我们将删除两个字的重复项,从偏移0开始然后是1.之后,从偏移0,1和2开始的三字重复,依此类推,直到我们达到五个字的重复:

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words
Run Code Online (Sandbox Code Playgroud)

把它们放在一起:

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b
Run Code Online (Sandbox Code Playgroud)