删除列表中的子串,其复杂度优于O(n ^ 2)

Question

删除列表中的子串,其复杂度优于O(n ^ 2)

Dal*_*tor 12 python string algorithm list

我有一个包含许多单词(100.000+)的列表,我想要做的是删除列表中每个单词的所有子串.

所以为简单起见,让我们假设我有以下列表:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

Run Code Online (Sandbox Code Playgroud)

以下输出是所需的:

['Hello', 'Apple', 'Banana', 'Peter']

Run Code Online (Sandbox Code Playgroud)

'Hell' 被删除,因为它是一个子串 'Hello'
'Ban' 被删除,因为它是一个子串 'Banana'
'P' 被删除,因为它是一个子串 'Peter'
'e'被删除,因为它是一个字符串'Hello','Hell', 'Apple',等等.

我做了什么

这是我的代码,但我想知道是否有比这些嵌套理解更有效的方法.

to_remove = [x for x in words for y in words if x != y and x in y]
output = [x for x in words if x not in to_remove]

Run Code Online (Sandbox Code Playgroud)

如何提高性能？我应该用regex吗？

Answer 1

bti*_*lly 3

@wim 是正确的。

给定固定长度的字母表，以下算法与文本的总长度呈线性关系。如果字母表的大小是无限的，那么它就会是O(n log(n))无限的。无论哪种方式都比好O(n^2)。

Create an empty suffix tree T.
Create an empty list filtered_words
For word in words:
    if word not in T:
        Build suffix tree S for word (using Ukkonen's algorithm)
        Merge S into T
        append word to filtered_words

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	424 次
最近记录：	7 年，10 月前