Python正则表达式解析流

Question

Python正则表达式解析流

有没有办法在python中的流上使用正则表达式匹配？喜欢

reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))

Run Code Online (Sandbox Code Playgroud)

我不想通过获取整个字符串的值来做到这一点.我想知道是否有任何方法可以在srtream(即时)上匹配正则表达式.

Answer 1

Mar*_*ner 16

我有同样的问题.第一个想法是实现一个LazyString类,它像一个字符串,但只读取当前需要的流中的数据(我通过重新实现__getitem__和__iter__获取并缓冲字符到达访问的最高位置...).

这没有用(我得到了一个"TypeError:期望的字符串或缓冲区" re.match),所以我看了一下re标准库中模块的实现.

不幸的是,在流上使用正则表达式似乎是不可能的.模块的核心是用C实现的,这个实现期望整个输入同时在内存中(我猜主要是因为性能原因).似乎没有简单的方法来解决这个问题.

我也看过PYL(Python LEX/YACC),但是他们的词法分析器在re内部使用,所以这不会解决问题.

可能是使用支持Python后端的ANTLR.它使用纯python代码构造词法分析器,似乎能够在输入流上运行.因为对我来说问题并不那么重要(我不希望我的输入变得非常大......),我可能不会进一步调查,但它可能值得一看.

Answer 2

use*_*ica 5

在文件的特定情况下，如果您可以使用内存映射文件，mmap并且您使用的是字节串而不是Unicode，则可以将内存映射文件re作为字节串进行馈送，它将可以正常工作。这受您的地址空间而不是RAM的限制，因此具有8 GB RAM的64位计算机可以对32 GB文件进行内存映射。

如果可以做到这一点，这是一个非常不错的选择。如果不能，则必须转向更杂乱的选项。

3rd-party regex模块（not re）提供了部分匹配支持，可用于构建流支持...但是它很杂乱并且有很多警告。像lookbehinds之类的东西^将无法正常工作，零宽度匹配很难正确实现，而且我不知道它是否可以与其他高级功能regex提供的功能正确交互，re也不能与之交互。尽管如此，它似乎仍然是最完整的解决方案。

如果传递partial=True到regex.match，regex.fullmatch，regex.search，或regex.finditer，那么除了报告完成比赛，regex也将报道的事情，可能是一个匹配，如果数据扩展：

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

Run Code Online (Sandbox Code Playgroud)

如果更多数据可以更改匹配结果，它将报告部分匹配而不是完全匹配，例如，regex.search(r'[\s\S]*', anything, partial=True)它将始终是部分匹配。

这样，您就可以保持滑动的数据窗口匹配，在到达窗口末端时将其扩展，并从头开始丢弃消耗的数据。不幸的是，任何会通过数据从字符串的开始消失感到困惑将无法工作，所以lookbehinds，，^，\b和\B都出来了。零宽度匹配也需要仔细处理。这是在文件或类似文件的对象上使用滑动窗口的概念证明：

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，10 月前
查看次数：	6630 次
最近记录：	8 年前