Python中是否有`string.split()`的生成器版本?

Man*_*dan 107 python string generator

string.split()返回一个列表实例.是否有返回生成器的版本?是否有任何理由反对拥有发电机版本?

nin*_*cko 68

re.finditer使用相当小的内存开销极有可能.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
Run Code Online (Sandbox Code Playgroud)

演示:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']
Run Code Online (Sandbox Code Playgroud)

编辑:我刚刚确认这在python 3.2.1中需要恒定的内存,假设我的测试方法是正确的.我创建了一个非常大的字符串(1GB左右),然后通过for循环迭代迭代(不是列表理解,这会生成额外的内存).这并没有导致内存显着增长(也就是说,如果内存增长,它远远小于1GB字符串).

  • 优秀!我忘记了发现者.如果有人对分割线这样做感兴趣,我会建议使用这个RE:'(.*\n |.+ $)'str.splitlines切断了火车新线(我不喜欢的东西......) ); 如果你想复制那部分行为,你可以使用分组:(m.group(2)或m.group(3)for re in re.finditer('((.*)\n |(.+) $)',s)).PS:我猜不需要RE中的外层; 我对使用|感到不安 没有paren:P (4认同)
  • 性能怎么样?重新匹配应该比普通搜索慢. (3认同)

Eli*_*ins 13

我能想到的最有效的方法是使用方法的offset参数来编写一个str.find().这避免了大量内存使用,并且在不需要时依赖于正则表达式的开销.

[编辑2016-8-2:更新此选项以选择性支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize
Run Code Online (Sandbox Code Playgroud)

这可以像你想要的那样使用......

>>> print list(isplit("abcb","b"))
['a','c','']
Run Code Online (Sandbox Code Playgroud)

虽然每次执行find()或切片时都会在字符串中进行一些成本搜索,但这应该是最小的,因为字符串在内存中表示为连续数组.


Ber*_*ohn 9

这是split()通过实现的生成器版本re.search(),没有分配太多子串的问题.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
Run Code Online (Sandbox Code Playgroud)

编辑:如果没有给出分隔符字符,则修正了对周围空白的处理.

  • 为什么这比're.finditer`更好? (11认同)

c z*_*c z 8

是否对所提出的各种方法进行了一些性能测试(我在此不再赘述).一些结果:

  • str.split (默认= 0.3461570239996945
  • 手动搜索(按字符)(Dave Webb的答案之一)= 0.8260340550004912
  • re.finditer (ninjagecko的回答)= 0.698872097000276
  • str.find (Eli Collins的答案之一)= 0.7230395330007013
  • itertools.takewhile (Ignacio Vazquez-Abrams的回答)= 2.023023967998597
  • str.split(..., maxsplit=1) 递归= N/A†

†递归答案(string.splitwith maxsplit = 1)无法在合理的时间内完成,给定string.split速度它们可能在较短的字符串上工作得更好,但是我无法看到短字符串的用例,其中内存不是问题.

测试使用timeit:

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn
Run Code Online (Sandbox Code Playgroud)

这提出了另一个问题,即string.split尽管内存使用情况如何更快.

  • 这是因为内存比 cpu 慢,在这种情况下,列表是按块加载的,而所有其他列表都是按元素加载的。同样,许多学者会告诉你链表速度更快,复杂性更低,而你的计算机通常会更快地使用数组,它发现更容易优化。**你不能假设一个选项比另一个更快,测试它!** +1进行测试。 (2认同)

Ole*_*pin 6

这是我的实现,它比这里的其他答案快得多,也更快.它有4个独立的子功能,适用于不同的情况.

我只是复制main str_split函数的docstring :


str_split(s, *delims, empty=None)
Run Code Online (Sandbox Code Playgroud)

s通过其余参数拆分字符串,可能省略空部分(empty关键字参数负责).这是一个发电机功能.

当只提供一个分隔符时,字符串将被简单地分割. empty那么True默认是.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'
Run Code Online (Sandbox Code Playgroud)

当提供多个分隔符时,默认情况下,字符串被这些分隔符的最长可能序列拆分,或者,如果empty设置为 True,则还包括分隔符之间的空字符串.请注意,在这种情况下,分隔符可能只是单个字符.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''
Run Code Online (Sandbox Code Playgroud)

当没有提供分隔符时,string.whitespace使用,所以效果是相同的str.split(),除了这个函数是一个生成器.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'
Run Code Online (Sandbox Code Playgroud)
import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)
Run Code Online (Sandbox Code Playgroud)

这个函数在Python 3中有效,并且可以应用一个简单但非常难看的修复,使其在2和3版本中都能正常工作.该函数的第一行应更改为:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')
Run Code Online (Sandbox Code Playgroud)