正则表达式将 URL 分解为多个部分

Question

正则表达式将 URL 分解为多个部分

我最近刚刚开始学习正则表达式，所以我还不确定孔问题的几个方面。

现在，我的网页读取 URL 将其分成几个部分，并且仅使用某些部分进行处理：例如 1) http://mycontoso.com/products/luggage/selloBag 例如 2) http://mycontoso.com/products /行李/selloBag.sf404.aspx

由于某种原因，Sitefinity 为我们提供了两种可能性，这很好，但我需要的只是“luggage/selloBag”中的实际产品详细信息

我当前的正则表达式是："(.*)(map-search)(\/)(.*)(\.sf404\.aspx)"，我将其与替换语句结合起来并提取组 4（或 $4）的内容，这很好，但对于示例 2 不起作用。

所以问题是：是否可以用正则表达式匹配两种可能性，其中字符串的一部分可能存在或可能不存在，然后仍然引用您实际想要使用其值的组？

Answer 1

rid*_*ner 5

RFC-3986 是有关 URI 的权威。附录 B提供了此正则表达式，可将其分解为各个组成部分：

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Run Code Online (Sandbox Code Playgroud)

这是一个增强的（和注释的）正则表达式（Python 语法），它利用命名捕获组：

    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)

Run Code Online (Sandbox Code Playgroud)

有关根据 RFC-3986 挑选和验证 URI 的更多信息，您可能需要查看我一直在撰写的一篇文章：正则表达式 URI 验证

归档时间：	14 年，5 月前
查看次数：	1485 次
最近记录：	6 年，7 月前