简化/优化for循环链

Question

简化/优化for循环链

alv*_*vas 16 python reduce dictionary filter nested-loops

我有一系列for循环,它们在原始的字符串列表上工作,然后逐渐过滤列表,例如:

import re

# Regex to check that a cap exist in string.
pattern1 = re.compile(r'\d.*?[A-Z].*?[a-z]')
vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

def check_no_caps(s):
    return None if re.match(pattern1, s) else s

def check_nomorethan_five(s):
    return s if len(s) <= 5 else None

def check_in_vocab_plus_x(s,x):
    # s and x are both str.
    return None if s not in vocab else s+x

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps
slist = [check_no_caps(s) for s in slist]
# filter no more than 5.
slist = [check_nomorethan_five(s) for s in slist if s is not None]
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]

Run Code Online (Sandbox Code Playgroud)

以上只是一个例子,实际上我操作字符串的函数更复杂,但它们确实返回原始字符串或操作字符串.

我可以使用生成器而不是列表,并执行以下操作:

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps and no more than 5.
slist = (s2 check_no_caps(s1) for s1 in slist 
         for s2 in check_nomorethan_five(s1) if s1)
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]

Run Code Online (Sandbox Code Playgroud)

或者在一个疯狂的嵌套生成器中:

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
slist = (s3 check_no_caps(s1) for s1 in slist 
         for s2 in check_nomorethan_five(s1) if s1
         for s3 in check_in_vocab_plus_x(s2, str(i)) if s2)

Run Code Online (Sandbox Code Playgroud)

肯定有更好的办法.有没有办法让for循环链变得更快？

有没有办法做到map,reduce和filter？会更快吗？

想象一下,我原来的slist非常非常大,就像数十亿.而且我的函数不像上面的函数那么简单,它们进行一些计算并且每秒执行大约1,000次调用.

Answer 1

Ali*_*ssa 7

首先是你对字符串的整个过程.您正在使用一些字符串,并且每个字符串都应用某些功能.然后清理列表.让我们说一段时间,你应用于字符串的所有函数都在一个恒定的时间工作(这不是真的,但是现在它并不重要).在您的解决方案中,您使用一个函数(即O(N))迭代throgh列表.然后你接下一个函数并再次迭代(另一个O(N)),依此类推.因此,加速的显而易见的方法是减少循环次数.这并不困难.

接下来要做的是尝试优化您的功能.例如,你使用regexp来检查字符串是否有大写字母,但是有str.islower(如果字符串中的所有套接字符都是小写且至少有一个套接字符,则返回true,否则返回false).

因此,这是第一次简化和加速代码的尝试:

vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

# note that first two functions can be combined in one
def no_caps_and_length(s):
    return s if s.islower() and len(s)<=5 else None

# this one is more complicated and cannot be merged with first two
# (not really, but as you say, some functions are rather complicated)
def check_in_vocab_plus_x(s,x):
    # s and x are both str.
    return None if s not in vocab else s+x

# now let's introduce a function that would pipe a string through all functions you need
def pipe_through_funcs(s):
    # yeah, here we have only two, but could be more
    funcs = [no_caps_and_length, check_in_vocab_plus_x]
    for func in funcs:
        if s == None: return s
        s = func(s)
    return s

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# final step:
slist = filter(lambda a: a!=None, map(pipe_through_funcs, slist))

Run Code Online (Sandbox Code Playgroud)

可能还有一件事可以改进.目前,您遍历列表修改元素,然后将其过滤掉.但是如果过滤然后修改可能会更快.像这样:

vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.

# make a function that does all the checks for filtering
# you can make a big expression and return its result,
# or a sequence of ifs, or anything in-between,
# it won't affect performance,
# but make sure you put cheaper checks first
def my_filter(s):
    if len(s)>5: return False
    if not s.islower(): return False
    if s not in vocab: return False
    # maybe more checks here
    return True

# now we need modifying function
# there is a concern: if you need indices as they were in original list
# you might need to think of some way to pass them here
# as you iterate through filtered out list
def modify(s,x):
    s += x
    # maybe more actions
    return s

slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# final step:
slist = map(modify, filter(my_filter, slist))

Run Code Online (Sandbox Code Playgroud)

另请注意,在某些情况下,生成器,地图和事物可以更快,但并非总是如此.我相信,如果你过滤掉的项目数量很大,那么使用附加的for循环可能会更快.我不会保证它会更快但你可以尝试这样的事情:

initial_list = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
new_list = []
for s in initial_list:
    processed = pipe_through_funcs(s)
    if processed != None: new_list.append(processed)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Kár*_*agy 3

如果你使你的转换函数统一，那么你可以这样做：

import random
slist = []
for i in range(0,100):
    slist.append(random.randint(0,1000))

# Unified functions which have the same function description
# x is the value
# i is the counter from enumerate
def add(x, i):
    return x + 2

def replace(x, i):
    return int(str(x).replace('2', str(i)))

# Specifying your pipelines as a list of tuples 
# Where tuple is (filter function, transformer function)
_pipeline = [
    (lambda s: True, add),
    (lambda s: s % 2 == 0, replace),
]

# Execute your pipeline
for _filter, _fn in _pipeline:
    slist = map(lambda item: _fn(*item), enumerate(filter(_filter, slist)))

Run Code Online (Sandbox Code Playgroud)

该代码适用于 python 2 和 python 3。不同之处在于，在 Python3 中，所有内容都返回一个生成器，因此只有在必要时才执行它。因此，您将有效地对您的列表进行一次迭代。

print(slist)
<map object at 0x7f92b8315fd0>

Run Code Online (Sandbox Code Playgroud)

然而，只要可以在内存中完成，迭代一次或多次就不会有太大区别，因为无论采用哪种迭代方法，都必须执行相同数量的转换和过滤。因此，为了改进您的代码，请尝试使您的过滤和转换函数尽可能快。

例如，@Rawing 提到的作为集合而不是列表进行调用将会产生很大的差异，尤其是对于大量项目。

归档时间：	9 年，10 月前
查看次数：	951 次
最近记录：	9 年，10 月前