智能过滤器与python

Dom*_*ane 3 python optimization text filter python-2.4

嗨,
我需要过滤掉所有不包含来自巨大"必要"列表的符号的行,示例代码:

def any_it(iterable):
      for element in iterable:
          if element: return True
      return False

regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg

filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
Run Code Online (Sandbox Code Playgroud)

我有python 2.4,所以我不能使用内置any().
我等了很长时间才进行过滤,但有没有办法对其进行优化?例如,第1行和第4行包含"RED .."模式,如果我们发现"RED .."模式没问题,我们可以跳过搜索10000个成员列表中第4行相同的模式吗?
还有其他方法可以优化过滤吗?
谢谢.
... 编辑 ...
UPD:在这篇文章的评论中查看真实的示例数据.我也有兴趣用"水果"来分类结果.谢谢!
...... 结束编辑 ......

Zac*_*sch 5

如果您将necessary列表组织为trie,那么您可以查看该trie以检查是否fruit以有效前缀开头.这应该比比较fruit每个前缀更快.

例如(仅经过温和测试):

import bisect
import re

class Node(object):
    def __init__(self):
        self.children = []
        self.children_values = []
        self.exists = False

    # Based on code at http://docs.python.org/library/bisect.html                
    def _index_of(self, ch):
        i = bisect.bisect_left(self.children_values, ch)
        if i != len(self.children_values) and self.children_values[i] == ch:
            return (i, self.children[i])
        return (i, None)

    def add(self, value):
        if len(value) == 0:
            self.exists = True
            return
        i, child = self._index_of(value[0])
        if not child:
            child = Node()
            self.children.insert(i, child)
            self.children_values.insert(i, value[0])
        child.add(value[1:])

    def contains_prefix_of(self, value):
        if self.exists:
            return True
        i, child = self._index_of(value[0])
        if not child:
            return False
        return child.contains_prefix_of(value[1:])

necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
             'LIGHTRED', 'LIGHTGREEN', 'GRAY']

trie = Node()
for value in necessary:
    trie.add(value)

# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
    fruit = regexp.findall(line)[0]
    if trie.contains_prefix_of(fruit):
        filtered.append(line)
Run Code Online (Sandbox Code Playgroud)

这改变了你的算法从O(N * k),其中N是的元件的数量necessaryk是的长度fruit,只是O(k)(更多或更少).它确实需要更多的内存,但这可能是一个值得你的案例权衡.