在 PyParsing 中，如何指定一个 Word 不等于给定的文字？

Question

在 PyParsing 中，如何指定一个 Word 不等于给定的文字？

我正在尝试解析来自http://www.apkmirror.com的APK下载页面，例如http://www.apkmirror.com/apk/google-inc/gmail/gmail-7-3-26-152772569-release -release/gmail-7-3-26-152772569-release-android-apk-download/。通常，“APK 详细信息”部分具有以下结构：

我想将“17329196”解析为version_code，“arm”解析为architecture，“com.skype.m2”解析为package。然而，有时， with 行architecture会丢失，如下所示：

到目前为止，使用Scrapy和选择器

apk_details = response.xpath('//*[@title="APK details"]/following-sibling::*[@class="appspec-value"]//text()').extract()

Run Code Online (Sandbox Code Playgroud)

我已经能够提取包含上面显示的“行”的列表。我正在尝试编写一个函数，parse_apk_details以便通过以下测试：

import pytest

def test_parse_apk_details_with_architecture():
    apk_details = [u'Version: 3.0.38_ww (4030038)',
                   u'arm ',
                   u'Package: com.lenovo.anyshare.gps',
                   u'\n',
                   u'2,239 downloads ']

    version_code, architecture, package = parse_apk_details(apk_details)

    assert version_code == 4030038
    assert architecture == "arm"
    assert package == "com.lenovo.anyshare.gps"

@pytest.mark.skip(reason="This does not work yet, because 'Package:' is interpreted by the parser as the architecture.")
def test_parse_apk_details_without_architecture():
    apk_details = [u'Version: 3.0.38_ww (4030038)',
                   u'Package: com.lenovo.anyshare.gps',
                   u'\n',
                   u'2,239 downloads ']

    version_code, architecture, package = parse_apk_details(apk_details)

    assert version_code == 4030038
    assert package == "com.lenovo.anyshare.gps"


if __name__ == "__main__":
    pytest.main([__file__])

Run Code Online (Sandbox Code Playgroud)

然而，如上所述，第二个测试尚未通过。这是到目前为止的功能：

from pyparsing import Word, printables, nums, Optional

def parse_apk_details(apk_details):
    apk_details = "\n".join(apk_details)    # The newline character is ignored by PyParsing (by default)
    version_name = Word(printables)         # The version name can consist of all printable, non-whitespace characters
    version_code = Word(nums)               # The version code is expected to be an integer
    architecture = Word(printables)
    package = Word(printables)

    expression = "Version:" + version_name + "(" + version_code("version_code") + ")" + Optional(architecture("architecture")) + "Package:" + package("package")
    result = expression.parseString(apk_details)

    return int(result.get("version_code")), result.get("architecture"), result.get("package")

Run Code Online (Sandbox Code Playgroud)

当我尝试运行第二个测试时出现的错误是：

ParseException: Expected "Package:" (at char 38), (line:2, col:10)

Run Code Online (Sandbox Code Playgroud)

我相信正在发生的事情是工作“Package:”正在被“消费”为architecture. 解决此问题的一种方法是将行更改architecture = Word(printables)为类似 (以伪代码) 的内容architecture = Word(printables) + ~"Package:"，以指示它可以是由除单词“Package:”之外的可打印字符组成的任何内容。

我如何确保architecture仅在不是特定单词时才解析"Package:"？scrapy（我也对原始问题的基于替代的解决方案感兴趣）。

Answer 1

Pau*_*McG 5

你和真的很亲近architecture = Word(printables) + ~Literal("Package:")。要进行否定前瞻，请从否定开始，然后是匹配：

architecture = ~Literal("Package:") + Word(printables)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	1044 次
最近记录：	8 年，7 月前