从另一行中模式给定的位置之间的行中提取字符串

Question

从另一行中模式给定的位置之间的行中提取字符串

Fre*_*eel 6 command-line text-processing

我希望输出前一行指定的两个位置 A 和 B 之间的字符。每对，两条线的长度相等，但对之间的长度可以不同。有没有一种有效的方法（巨大的文件大小）可以使用grep, sed, 或awk？

示例文件：

xxxxxxAxxxxxxBxxxxxx
1234567890MNOPQRSTUV
xxAxxxxxxxxxxxxxxBxxxxxx
1234567890MNOPQRSTUVWXYZ

Run Code Online (Sandbox Code Playgroud)

...

我想获得输出：

7890MNOP
34567890MNOPQRST

Run Code Online (Sandbox Code Playgroud)

...

Answer 1

αғs*_*нιη 8

使用awk：

$ awk '!seen{match($0, /A.*B/);seen=1;next} {print substr($0,RSTART,RLENGTH);seen=0}' infile
7890MNOP
34567890MNOPQRST

Run Code Online (Sandbox Code Playgroud)

解释：读入manawk：

RSTART
          The index of the first character matched by match(); 0 if no
          match.  (This implies that character indices start at one.)

RLENGTH
          The length of the string matched by match(); -1 if no match.

match(s, r [, a])  
          Return the position in s where the regular expression r occurs, 
          or 0 if r is not present, and set the values of RSTART and RLENGTH. (...)

substr(s, i [, n])
          Return the at most n-character substring of s starting at I.
          If n is omitted, use the rest of s.

Run Code Online (Sandbox Code Playgroud)

Answer 2

Eli*_*gan 7

虽然你可以用 AWK 做到这一点，但我建议使用 Perl。这是一个脚本：

#!/usr/bin/env perl

use strict;
use warnings;

while (my $pattern = <>) {
    my $text = <>;
    my $start = index $pattern, 'A';
    my $stop = index $pattern, 'B', $start;
    print substr($text, $start, $stop - $start + 1), "\n";
}

Run Code Online (Sandbox Code Playgroud)

您可以随意命名该脚本文件。如果您要为其命名并将其interval放入当前目录，则可以将其标记为可执行文件chmod +x interval。然后你可以运行：

./interval paths...

Run Code Online (Sandbox Code Playgroud)

替换paths...为要解析的文件的实际路径名或路径名。例如：

$ ./interval interval-example.txt
7890MNOP
34567890MNOPQRST

Run Code Online (Sandbox Code Playgroud)

脚本的工作方式是，直到到达输入结束（即没有更多行），它：

读取一行，$pattern，这是您的字符串，A和B，以及另一行，$text，这是将被切片的字符串。
查找 first Ain$pattern和 first的索引，B除了可能在 first 之前的任何索引A，并将它们分别存储在$start和$stop变量中。
仅切出$text其索引范围从$start到的部分$stop。Perl 的substr函数采用 offset 和 length 参数，这就是减法的原因，并且您将紧接在下的字母包括在内B，这就是添加1.
仅打印该部分，然后是换行符。

如果出于某种原因，您更喜欢一个短的单行命令，它可以实现相同的功能但很容易粘贴进去——但也更难理解和维护——那么你可以使用这个：

./interval paths...

Run Code Online (Sandbox Code Playgroud)

（和以前一样，您必须paths...用实际路径名替换。）

Answer 3

Dig*_*uma 7

既然你提到了sed，你也可以用 sed 脚本来做到这一点：

/^x*Ax*Bx*$/{              # If an index line is matched, then
  N                        # append the next (content) line into the pattern buffer
  :a                       # label a
  s/^x(.*\n).(.*)/\1\2/    # remove "x" from the index line start and a char from the content line start
  ta                       # if a subtitution happened in the previous line then jump back to a
  :b                       # label a
  s/(.*)x(\n.*).$/\1\2/    # remove "x" from the index line end and a char from the content line end
  tb                       # if a subtitution happened in the previous line then jump back to b
  s/.*\n//                 # remove the index line
}

Run Code Online (Sandbox Code Playgroud)

如果你把这一切都放在一个命令行上，它看起来像这样：

$ sed -r '/^x*Ax*Bx*$/{N;:a;s/^x(.*\n).(.*)/\1\2/;ta;:b;s/(.*)x(\n.*).$/\1\2/;tb;s/.*\n//;}' example-file.txt
7890MNOP
34567890MNOPQRST
$

Run Code Online (Sandbox Code Playgroud)

-r需要这样sed才能理解正则表达式分组括号而无需额外转义。

FWIW，我认为这不能完全用完成grep，尽管我很乐意被证明是错误的。

归档时间：	7 年，5 月前
查看次数：	2548 次
最近记录：	7 年，5 月前