.NET正则表达式中"组"和"捕获"之间有什么区别？

Question

.NET正则表达式中"组"和"捕获"之间有什么区别？

当谈到.NET的正则表达式语言时,我对"组"和"捕获"之间的区别有点模糊.考虑以下C#代码:

MatchCollection matches = Regex.Matches("{Q}", @"^\{([A-Z])\}$");

Run Code Online (Sandbox Code Playgroud)

我希望这会导致单个捕获字母'Q',但如果我打印返回的属性MatchCollection,我看到:

matches.Count: 1
matches[0].Value: {Q}
        matches[0].Captures.Count: 1
                matches[0].Captures[0].Value: {Q}
        matches[0].Groups.Count: 2
                matches[0].Groups[0].Value: {Q}
                matches[0].Groups[0].Captures.Count: 1
                        matches[0].Groups[0].Captures[0].Value: {Q}
                matches[0].Groups[1].Value: Q
                matches[0].Groups[1].Captures.Count: 1
                        matches[0].Groups[1].Captures[0].Value: Q

Run Code Online (Sandbox Code Playgroud)

到底发生了什么？我知道整个比赛也有一个捕获,但这些小组是如何进入的？为什么不matches[0].Captures包括字母'Q'的捕获？

Answer 1

Abe*_*bel 123

你不会是第一个对它模糊的人.这就是着名的杰弗里·弗里德(Jeffrey Friedl)所说的(第437页):

根据您的视图,它会为匹配结果添加一个有趣的新维度,或者增加混乱和膨胀.

进一步说:

Group对象和Capture对象之间的主要区别在于每个Group对象包含一个Captures集合,表示匹配期间组的所有中间匹配,以及组匹配的最终文本.

几页后,这是他的结论:

在通过.NET文档并实际了解这些对象添加的内容后,我对它们的看法很复杂.一方面,这是一个有趣的创新[...],另一方面,它似乎增加了一个功能的效率负担[..],在大多数情况下不会使用

换句话说:它们非常相似,但偶尔也会发生,你会发现它们的用途.在你长出另一个灰胡子之前,你甚至可能喜欢捕获...

由于上述内容以及其他帖子中的内容似乎都没有回答您的问题,请考虑以下内容.将Captures视为一种历史追踪器.当正则表达式进行匹配时,它会从左到右经过字符串(忽略一段时间的回溯),当遇到匹配的捕获括号时,它会将其存储在$x(x是任意数字)中,比方说$1.

正常的正则表达式引擎,当要重复捕获括号时,将丢弃当前的$1并将用新值替换它.不是.NET,它将保留这段历史并将其放入Captures[0].

如果我们将您的正则表达式更改为如下所示:

MatchCollection matches = Regex.Matches("{Q}{R}{S}", @"(\{[A-Z]\})+");

Run Code Online (Sandbox Code Playgroud)

你会注意到第一组Group将有一个Captures(第一组始终是整个匹配,即等于$0),第二组将保持{S},即只有最后一个匹配的组.但是,如果你想找到另外两个捕获物,它们就在这里Captures,它们包含了所有中间捕获的{Q} {R}和{S}.

如果你想知道如何从多重捕获中获得,它只显示字符串中明显存在的各个捕获的最后一个匹配,你必须使用Captures.

关于最后一个问题的最后一句话:总匹配总共有一个Capture,不要与各个组混合.捕获只在组内有趣.

@Abel - 我知道我一直在说,但是你一直听不到它.我不喜欢弗里德尔的这个声明,这个功能在大多数情况下不会被使用.事实上,它是regex土地上最受追捧的功能.懒/贪心？这与我的评论有什么关系？它可以使用不同数量的捕获缓冲区.它可以在一次匹配中扫描整个字符串.如果`.*？(dog)`找到第一个`dog`然后`(？:.*？(dog))+`将在一个匹配中找到整个字符串中的_all_`dog`.性能提升是显而易见的. (2认同)

Answer 2

Ger*_*ill 19

组是我们与正则表达式中的组相关联的组

"(a[zx](b?))"

Applied to "axb" returns an array of 3 groups:

group 0: axb, the entire match.
group 1: axb, the first group matched.
group 2: b, the second group matched.

Run Code Online (Sandbox Code Playgroud)

除了这些只是'捕获'组.非捕获组(使用'(？:'语法)不在此处表示.

"(a[zx](?:b?))"

Applied to "axb" returns an array of 2 groups:

group 0: axb, the entire match.
group 1: axb, the first group matched.

Run Code Online (Sandbox Code Playgroud)

Capture也是我们与"捕获的组"相关联的.但是当组多次应用量词时,只有最后一个匹配保持为组的匹配.captures数组存储所有这些匹配项.

"(a[zx]\s+)+"

Applied to "ax az ax" returns an array of 2 captures of the second group.

group 1, capture 0 "ax "
group 1, capture 1 "az "

Run Code Online (Sandbox Code Playgroud)

关于你的最后一个问题 - 在调查之前我会想到Captures将是他们所属的组所订购的捕获数组.相反,它只是组[0]的别名.Captures.很没用..

Answer 3

pma*_*lee 14

从MSDN 文档:

当量化器应用于捕获组时,会发生Captures属性的实际效用,以便该组在单个正则表达式中捕获多个子字符串.在这种情况下,Group对象包含有关最后捕获的子字符串的信息,而Captures属性包含有关该组捕获的所有子字符串的信息.在以下示例中,正则表达式\ b(\ w +\s*)+.匹配以句点结尾的整个句子.组(\ w +\s*)+捕获集合中的单个单词.由于Group集合仅包含有关最后捕获的子字符串的信息,因此它捕获句子中的最后一个单词"sentence".但是,该组捕获的每个单词都可以从Captures属性返回的集合中获得.

Answer 4

Eri*_*ith 11

这可以用一个简单的例子(和图片)来解释.

3:10pm与正则表达式匹配((\d)+):((\d)+)(am|pm),并使用Mono交互式csharp:

csharp> Regex.Match("3:10pm", @"((\d)+):((\d)+)(am|pm)").
      > Groups.Cast<Group>().
      > Zip(Enumerable.Range(0, int.MaxValue), (g, n) => "[" + n + "] " + g);
{ "[0] 3:10pm", "[1] 3", "[2] 3", "[3] 10", "[4] 0", "[5] pm" }

Run Code Online (Sandbox Code Playgroud)

那1是哪里的？

因为在第四组上有多个匹配的数字,所以如果我们引用该组,我们只会"得到"最后一个匹配(具有隐式ToString(),即).为了公开中间匹配,我们需要更深入地引用Captures相关组中的属性:

csharp> Regex.Match("3:10pm", @"((\d)+):((\d)+)(am|pm)").
      > Groups.Cast<Group>().
      > Skip(4).First().Captures.Cast<Capture>().
      > Zip(Enumerable.Range(0, int.MaxValue), (c, n) => "["+n+"] " + c);
{ "[0] 1", "[1] 0" }

Run Code Online (Sandbox Code Playgroud)

礼貌的这篇文章.

好文章。一张图片胜过千言万语。 (2认同)

Answer 5

And*_*yWD 5

想象一下您有以下文本输入dogcatcatcat和类似的模式dog(cat(catcat))

\n\n

在本例中，您有 3 个组，第一个组（主要组）对应于比赛。

\n\n

匹配 ==dogcatcatcat和 Group0 ==dogcatcatcat

\n\n

组1==catcatcat

\n\n

第2组==catcat

\n\n

那么这到底是怎么回事呢？

\n\n

让我们考虑一个使用Regex类用 C# (.NET) 编写的小示例。

\n\n

int matchIndex = 0;\nint groupIndex = 0;\nint captureIndex = 0;\n\nforeach (Match match in Regex.Matches(\n        "dogcatabcdefghidogcatkjlmnopqr", // input\n        @"(dog(cat(...)(...)(...)))") // pattern\n)\n{\n    Console.Out.WriteLine($"match{matchIndex++} = {match}");\n\n    foreach (Group @group in match.Groups)\n    {\n        Console.Out.WriteLine($"\\tgroup{groupIndex++} = {@group}");\n\n        foreach (Capture capture in @group.Captures)\n        {\n            Console.Out.WriteLine($"\\t\\tcapture{captureIndex++} = {capture}");\n        }\n\n        captureIndex = 0;\n    }\n\n    groupIndex = 0;\n    Console.Out.WriteLine();\n        }\n

Run Code Online (Sandbox Code Playgroud)\n\n

输出：

\n\n

match0 = dogcatabcdefghi\n    group0 = dogcatabcdefghi\n        capture0 = dogcatabcdefghi\n    group1 = dogcatabcdefghi\n        capture0 = dogcatabcdefghi\n    group2 = catabcdefghi\n        capture0 = catabcdefghi\n    group3 = abc\n        capture0 = abc\n    group4 = def\n        capture0 = def\n    group5 = ghi\n        capture0 = ghi\n\nmatch1 = dogcatkjlmnopqr\n    group0 = dogcatkjlmnopqr\n        capture0 = dogcatkjlmnopqr\n    group1 = dogcatkjlmnopqr\n        capture0 = dogcatkjlmnopqr\n    group2 = catkjlmnopqr\n        capture0 = catkjlmnopqr\n    group3 = kjl\n        capture0 = kjl\n    group4 = mno\n        capture0 = mno\n    group5 = pqr\n        capture0 = pqr\n

Run Code Online (Sandbox Code Playgroud)\n\n

让我们只分析第一场比赛 ( match0)。

\n\n

正如您所看到的，共有三个小组：group3、group4和group5

\n\n

    group3 = kjl\n        capture0 = kjl\n    group4 = mno\n        capture0 = mno\n    group5 = pqr\n        capture0 = pqr\n

Run Code Online (Sandbox Code Playgroud)\n\n

这些组 (3-5) 是由于主模式的“子模式”而创建的(...)(...)(...) (dog(cat(...)(...)(...)))

\n\n
的值group3对应于它的捕获 ( capture0)。group4（如和的情况group5）。那是因为没有像那样的组重复(...){3}。
\n\n
\n\n
好的，让我们考虑另一个存在组重复的例子。
\n\n
如果我们将要匹配的正则表达式模式（对于上面显示的代码）从\n修改(dog(cat(...)(...)(...)))为(dog(cat(...){3}))\n，\n您会注意到存在以下组重复: (...){3}。
\n\n
现在输出已经改变：
\n\n
match0 = dogcatabcdefghi\n group0 = dogcatabcdefghi\n capture0 = dogcatabcdefghi\n group1 = dogcatabcdefghi\n capture0 = dogcatabcdefghi\n group2 = catabcdefghi\n capture0 = catabcdefghi\n group3 = ghi\n capture0 = abc\n capture1 = def\n capture2 = ghi\n\nmatch1 = dogcatkjlmnopqr\n group0 = dogcatkjlmnopqr\n capture0 = dogcatkjlmnopqr\n group1 = dogcatkjlmnopqr\n capture0 = dogcatkjlmnopqr\n group2 = catkjlmnopqr\n capture0 = catkjlmnopqr\n group3 = pqr\n capture0 = kjl\n capture1 = mno\n capture2 = pqr\n
Run Code Online (Sandbox Code Playgroud)\n\n
再次，让我们只分析第一个匹配项 ( match0)。
\n\n
不再有小组 group4，并且group5由于(...){3} 重复（{n}，其中n>=2）\n它们已合并为一个组group3。
\n\n
在这种情况下，该group3值对应于它的capture2（换句话说，最后一次捕获）。
\n\n
因此，如果您需要所有 3 个内部捕获（capture0、capture1、capture2），您将必须循环浏览该组的Captures集合。
\n\n
\xd0\xa1 的结论是：注意你设计模式组的方式。\n你应该预先考虑什么行为会导致组的规范，例如(...)(...)，(...){2}等等(.{3}){2}。
\n\n
\n\n
希望它也能帮助阐明捕获、组和匹配之间的差异。
\n

归档时间：	15 年，4 月前
查看次数：	28385 次
最近记录：	8 年，1 月前