过滤字符串列表,忽略其他项的子字符串

Kat*_*ova 18 python string algorithm

如何筛选包含字符串和子字符串的列表以仅返回最长的字符串.(如果列表中的任何项是另一项的子字符串,则只返回较长的字符串.)

我有这个功能.有更快的方法吗?

def filterSublist(lst):
    uniq = lst
    for elem in lst:
        uniq = [x for x in uniq if (x == elem) or (x not in elem)]
    return uniq

lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)

> ['abc', 'd', 'xyz']
> Function time: 0.000011
Run Code Online (Sandbox Code Playgroud)

Nik*_* B. 8

一个简单的二次时间解决方案是这样的:

res = []
n = len(lst)
for i in xrange(n):
    if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
        res.append(lst[i])
Run Code Online (Sandbox Code Playgroud)

但我们可以做得更好:

我们$是不会出现在任何你的字符串,并有比你的所有实际字符的较低值的字符.

让我们S所有字符串串联$起来.在你的例子中,S = a$abc$b$d$xy$xyz.

你可以建立的后缀数组S线性时间.您还可以使用我在另一个答案中描述的更简单的O(n log ^ 2 n)构造算法.

现在对于每个字符串lst,检查它是否恰好在后缀数组中出现一次.您可以执行两次二进制搜索以查找子字符串的位置,它们在后缀数组中形成连续范围.如果字符串出现多次,则将其删除.

通过预先计算LCP信息,也可以在线性时间内完成.

示例O(n log ^ 2 n)实现,改编自我的后缀数组答案:

def findFirst(lo, hi, pred):
  """ Find the first i in range(lo, hi) with pred(i) == True.
  Requires pred to be a monotone. If there is no such i, return hi. """
  while lo < hi:
    mid = (lo + hi) // 2
    if pred(mid): hi = mid;
    else: lo = mid + 1
  return lo

# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
  def __init__(self, s):
    """ build the suffix array of s in O(n log^2 n) where n = len(s). """
    n = len(s)
    log2 = 0
    while (1<<log2) < n:
      log2 += 1
    rank = [[0]*n for _ in xrange(log2)]
    for i in xrange(n):
      rank[0][i] = s[i]
    L = [0]*n
    for step in xrange(1, log2):
      length = 1 << step
      for i in xrange(n):
        L[i] = (rank[step - 1][i],
                rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
                i)
      L.sort()
      for i in xrange(n):
        rank[step][L[i][2]] = \
          rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
    self.log2 = log2
    self.rank = rank
    self.sa = [l[2] for l in L]
    self.s = s
    self.rev = [0]*n
    for i, j in enumerate(self.sa):
      self.rev[j] = i

  def lcp(self, x, y):
    """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """
    n = len(self.s)
    if x == y:
      return n - x
    ret = 0
    for k in xrange(self.log2 - 1, -1, -1):
      if x >= n or y >= n:
        break
      if self.rank[k][x] == self.rank[k][y]:
        x += 1<<k
        y += 1<<k
        ret += 1<<k
    return ret

  def compareSubstrings(self, x, lx, y, ly):
    """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
    l = min((self.lcp(x, y), lx, ly))
    if l == lx == ly: return 0
    if l == lx: return -1
    if l == ly: return 1
    return cmp(self.s[x + l], self.s[y + l])

  def count(self, x, l):
    """ count occurences of substring s[x:x+l] in O(log n). """
    n = len(self.s)
    cs = self.compareSubstrings
    lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
    hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
    return hi - lo

  def debug(self):
    """ print the suffix array for debugging purposes. """
    for i, j in enumerate(self.sa):
      print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"

def filterSublist(lst):
  splitter = "\x00"
  s = splitter.join(lst) + splitter
  sa = SuffixArray(s)
  res = []
  offset = 0
  for x in lst:
    if sa.count(offset, len(x)) == 1:
      res.append(x)
    offset += len(x) + 1
  return res
Run Code Online (Sandbox Code Playgroud)

然而,解释开销可能导致这比O(n ^ 2)方法慢,除非S非常大(大约10 ^ 5个字符或更多).

  • @KatrinaMalakhova那你原来的代码怎么了?你怎么决定它太慢了?此外,如果您想比较不同方法的性能,请使用实际数据(而不是包含10个元素的列表,它不会告诉您n = 10000的情况) (3认同)