获取 Text.RegularExpressions.Regex 匹配的行号

Question

获取 Text.RegularExpressions.Regex 匹配的行号

ede*_*ter 3 regex powershell logging regex-group

我使用 PowerShell 解析日志文件目录并从日志文件中提取所有 XML 条目。这工作得很好。但是，由于日志文件可以包含许多这些 xml 片段，我想将找到它的特定匹配的行号也放在我编写的 XML 文件的文件名中，因此我可以打开日志文件并跳转到那个特定的行做一些根本原因分析。

有一个字段“索引”是我认为的字符数，这可能应该引导我到行号，但我认为“索引”以某种方式包含其他内容作为 Measure-Object -Character 因为索引的值大于用 Measure-Object-Character 找到的大小，例如 $m.groups[0].Captures[0].Index 是 9963166 但日志目录中的 Measure-Object -Character 整体文件给出 9838833 作为最大值，所以我认为它也计算换行符.

所以问题可能是：如果匹配将“索引”作为属性提供给我，我怎么知道“索引”包含多少换行符？我是否必须从文件中获取“索引”字符，然后检查它包含多少换行符，然后我有该行？大概。

$tag = 'data_json'
$input_dir = $absolute_root_dir + $specific_dir
$output_dir = $input_dir  + 'ParsedDataFiles\'
$OFS = "`r`n"
$nice_specific_dir = $specific_dir.Replace('\','_')
$nice_specific_dir = $nice_specific_dir.Replace(':','_')
$regex = New-Object Text.RegularExpressions.Regex "<$tag>(.+?)<\/$tag>", ('singleline', 'multiline')
New-Item -ItemType Directory -Force -Path $output_dir
Get-ChildItem -Path $input_dir -Name -File | % {   
    $output_file = $output_dir + $nice_specific_dir + $_ + '.'
    $content = Get-Content ($input_dir + $_)
    $i = 0
    foreach($m in $regex.Matches($content)) {        
        $outputfile_xml = $output_file + $i++ + '.xml'
        $outputfile_txt = $output_file + $i++ + '.txt'
        $xml = [xml] ("<" + $tag+ ">" + $m.Groups[1].Value + "</" + $tag + ">")
        $xml.Save($outputfile_xml)
        $j = 0
        $xml.data_json.Messages.source.item | % { $_.SortOrder + ", " + $_.StartOn + ", " + $_.EndOn + ", " + $_.Id } | sort | %  { 
            (($j++).ToString() + ", " + $_ )   | Out-File $outputfile_txt -Append
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

mkl*_*nt0 5

^{Note: If what your regular expression matches is guaranteed to never span multiple lines, i.e. if the matched text is guaranteed to be on a single line, consider a simpler Select-String-based solution as shown in js2010's answer; generally, though, a method/expression-based solution, as in this answer, will perform better.}

Your first problem is that you use Get-Content without -Raw, which reads the input file as an array of lines rather than a single, multi-line string.

When you pass this array to $regex.Matches(), PowerShell stringifies the array by joining the element with spaces (by default).

Therefore, read your input file with Get-Content -Raw, which ensures that it is read as single, multi-line string with newlines intact:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

Run Code Online (Sandbox Code Playgroud)

Once you match against a multi-line string, you can infer the line number by counting the number of lines in the substring up to the character index at which each match was found, via .Substring() and Measure-Object -Line:

Here's a simplified, self-contained example (see the bottom section if you also want to determine the column number):

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
    <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"
}

Run Code Online (Sandbox Code Playgroud)

^{Note the + 1 in $m.Index + 1, which is needed to ensure that the substring doesn't end in a newline character, because Measure-Object line would disregard such a trailing newline. By including at least one additional (non-newline) character, the < of the matched element, the line count is always correct, even if the matched element starts at the very first column.}

The above yields:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

Run Code Online (Sandbox Code Playgroud)

In case you want to also get the column number (the 1-based index of the character that starts the match on the line it was found):

Determining the line and column numbers of regex matches in multi-line strings:

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
      <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."
}

Run Code Online (Sandbox Code Playgroud)

The above yields:

Found '<title>De Profundis</title>' at line 3, column 5.
Found '<title>Pygmalion</title>' at line 6, column 7.

Run Code Online (Sandbox Code Playgroud)

Note: For simplicity, both solutions above count the lines from the start of the string in every iteration.
In most cases this will likely still perform well enough; if not, see the variant approach in the performance benchmarks below, where the line count is calculated iteratively, with only the lines between the current and the previous match getting counted in a given iteration.

Optional reading: performance comparison of approaches to line counting:

sln's answer proposes using regular expressions also for line counting.

Comparing these approaches as well as the .Substring() plus Measure-Object -Line approach above in terms of performance may be of interest.

The following tests are based on the Time-Command function.

Sample result are from PowerShell Core 7.0.0-preview.3 on macOS 10.14.6, averaged over 100 runs; the absolute numbers will vary depending on the execution environment, but the relative ranking of the approaches (Factor column) seems to be the similar across platforms and PowerShell editions:

With 1,000 lines and 1 match on the last line:

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
    <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"
}

Run Code Online (Sandbox Code Playgroud)

With 20,000 lines and 20 evenly spaced matches starting at line 1,000:

Found '<title>De Profundis</title>' at index 34, line 3
Found '<title>Pygmalion</title>' at index 96, line 6

Run Code Online (Sandbox Code Playgroud)

Notes and conclusions:

Prefix capture group refers to (variations of) "Way1" from sln's answer, whereas Repeating capture group ... refers to "Way2".
- Note: For Way2, an (adaptation of) regex (?:.*(\r?\n))*?.*?(match_me) is used below, which is a much-improved version sln added later in a comment, whereas the version still being shown in the body of their answer (as of this writing) - ^(?:.*((?:\r?\n)?))*?(match_me) - wouldn't work for processing multiple matches in a loop.
The .Substring() + Measure-Object -Line approach from this answer is the fastest in all cases, but, with many matches to loop over, only if an iterative, between-matches line count is performed (.Substring() + Measure-Object -Line, count iteratively…), whereas the solutions above use a count-lines-from-the-start-for-every-match approach for simplicity (# .Substring() + Measure-Object -Line, count from start…).
使用 Way1 方法 ( Prefix capture group)，用于计算前缀匹配中的换行符的特定方法几乎没有区别，但Measure-Object -Line也是最快的。

这是测试的源代码；通过修改底部附近的各种变量，可以很容易地尝试匹配计数、输入行总数……

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
      <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	878 次
最近记录：	6 年，5 月前