是否有快速算法来删除字符串中重复的子串?

hao*_*mao 9 python string algorithm

有一个像它的字符串

dxabcabcyyyydxycxcxz
Run Code Online (Sandbox Code Playgroud)

我想将它合并到

dxabcydxycxz
Run Code Online (Sandbox Code Playgroud)

其他例子: ddxddx - > dxdx,abbab - > abab.

规则是:

if (adjacent and same): merge

# Such as 'abc',they are same and , so I will delete one of them .
# Although 'dx' is same as 'dx',they are nonadjacent,so I do not delete any of them
# If one character has been deleted, we don't delete any sub-string include it 
Run Code Online (Sandbox Code Playgroud)

我在python的代码中完成了它,但是当它在一个长字符串中时它很慢.

# original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] *str_len #Use a list to mark which char is deleted

# enumerate the size of sub-str
for i in range(1,str_len):
    # enumerate the begin of the sub-str
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin) #if I deleted it ,mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)
Run Code Online (Sandbox Code Playgroud)

有没有更快的方法来解决它?

2017年4月29日更新

对不起,它似乎是一个XY问题.另一方面,它可能不是.有内容

我正在编写一个网络蜘蛛编码,并获得了许多像这样的标记路径

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
Run Code Online (Sandbox Code Playgroud)

正如你所看到的,有一些'tag-path'以同样的方式做了,所以我想折叠它们以发现是否有任何其他'tag-path具有相同的结构.折叠后,我得到这样的"标记路径".

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
Run Code Online (Sandbox Code Playgroud)

这只是我的想法,我不知道这样做是否合适.(尝试之后,我选择了另一种方式来做到这一点.

然而,有一个有趣的问题,如ACM问题.

因此,我将一个"标记路径"简化为一个角色并寻求帮助.因为我自己并没有快速做到这一点.实际上,这个问题有许多我不介意的角落案例,感谢大家帮助我完成它.

谢谢大家.

jas*_*per 16

看看正则表达的力量:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'
Run Code Online (Sandbox Code Playgroud)

这将查找一个包含1个或多个任意字符的序列(.+?)(作为非贪婪匹配,以便首先尝试更短的序列),然后重复匹配序列的一次或多次\1+,并将其全部替换为匹配的序列\1.


sil*_*lel 0

这可以是一个开始:

for i in range(len(string)):
    for j in range(i + 1, len(string)):
        while string[i:j] == string[j:j + j - i]:
            string = string[:j] + string[j + j - i:]
Run Code Online (Sandbox Code Playgroud)

所提供示例的结果:

abbab  -> abab
ddxddx -> dxdx
abcabcabc -> abc
dxabcabcyyyydxycxcxz -> dxabcydxycxz
Run Code Online (Sandbox Code Playgroud)