从文件中查找整行匹配的文件

Question

从文件中查找整行匹配的文件

ann*_*hri 4 grep awk perl text-processing

我有一个包含此内容的文件：

$ cat compromised_header.txt
some unique string 1
some other unique string 2
another unique string 3

Run Code Online (Sandbox Code Playgroud)

我想找到所有具有上述文件所有行的文件的顺序完全相同，并且这些行之间没有中间行。

示例输入文件：

$ cat a-compromised-file.txt
some unique string 1
some other unique string 2
another unique string 3
unrelated line x
unrelated line y
unrelated line z

Run Code Online (Sandbox Code Playgroud)

我尝试使用以下grep：

grep -rlf compromised_header.txt dir/

Run Code Online (Sandbox Code Playgroud)

但我不确定它会给出预期的文件，因为它也会匹配这个文件：

some unique string 1
unrelated line x
unrelated line y
unrelated line z

Run Code Online (Sandbox Code Playgroud)

Answer 1

row*_*oat 8

使用支持的 awk nextfile：

NR == FNR {
  a[++n]=$0; next
}
$0 != a[c+1] && (--c || $0!=a[c+1]) {
  c=0; next
}
++c >= n {
  print FILENAME; c=0; nextfile
}

Run Code Online (Sandbox Code Playgroud)

用find递归：

find dir -type f -exec gawk -f above.awk compromised_header.txt {} +

Run Code Online (Sandbox Code Playgroud)

或者这可能有效：

pcregrep -rxlM "$( perl -lpe '$_=quotemeta' compromised_header.txt )" dir

Run Code Online (Sandbox Code Playgroud)

使用 perl 转义元字符，因为 pcregrep 似乎没有--fixed-strings与--multiline.

在 slurp 模式下使用 perl（不适用于太大而无法保存在内存中的文件）：

find dir -type f -exec perl -n0777E 'BEGIN {$f=<>} say $ARGV if /^\Q$f/m
' compromised_header.txt {} +

Run Code Online (Sandbox Code Playgroud)

Answer 2

cas*_*cas 5

您需要使用比更强大的东西grep，它只能进行单行匹配。

perl，可以做多行匹配，非常适合这种工作，结合find生成要搜索的文件列表。

find dir/ -type f -iname '*.txt' -exec perl -e '
    local $/;    # slurp in entire files, instead of one line at a time

    my $firstfile = shift @ARGV;         # get name of the first file
    open(F,"<",$firstfile) or die "Error opening $firstfile: $!";
    my $first = <F>;                     # read it in
    close(F);
    my $search = qr/\Q$first\E/;         # compile to a fixed-string RE

    # now read in remaining files and see if they match
    while(<>) {
      next if ($ARGV eq $firstfile);
      if (m/$search/m) {
        print $ARGV,"\n";
      };
    }' ./compromised_header.txt {} +

Run Code Online (Sandbox Code Playgroud)

这将打印dir/包含第一个文件（“compromised_header.txt”）中的确切文本的任何 *.txt 文件的文件名。

笔记：

该qr//运营商编译一个正则表达式。这样做的主要用途是在循环中使用之前预编译 RE，这样就不会浪费时间和 cpu 周期在每次循环时重新编译。
操作中使用的\Qand标记了 RE 模式中文本的开头和结尾，该模式旨在解释为固定字符串 - 即可能在字符串中的所有元字符都将被引用以禁用它们的特殊含义。有关详细信息，请参阅并搜索“引用元字符” 。\Eqr//man perlreperldoc -f quotemeta

如果这看起来像一个丑陋、复杂、不可读的单行代码，那么像这样尝试，作为一个独立的脚本：

find dir/ -type f -iname '*.txt' -exec perl -e '
    local $/;    # slurp in entire files, instead of one line at a time

    my $firstfile = shift @ARGV;         # get name of the first file
    open(F,"<",$firstfile) or die "Error opening $firstfile: $!";
    my $first = <F>;                     # read it in
    close(F);
    my $search = qr/\Q$first\E/;         # compile to a fixed-string RE

    # now read in remaining files and see if they match
    while(<>) {
      next if ($ARGV eq $firstfile);
      if (m/$search/m) {
        print $ARGV,"\n";
      };
    }' ./compromised_header.txt {} +

Run Code Online (Sandbox Code Playgroud)

将其另存为，例如，check.pl并使用chmod +x check.pl. 然后运行：

find dir/ -type f -iname '*.txt' \
  -exec ./check.pl ./compromised_header.txt {} +

Run Code Online (Sandbox Code Playgroud)

@EdMorton perl 没有“专注于简洁而不是清晰”。它专注于表现力，能够以适合您的任何风格编写和格式化您的代码 - 一般或目前 - 以许多不同的方式做任何事情。Perl 的座右铭长期以来一直是“有不止一种方法可以做到”，这是有充分理由的。例如，没有人可以指责我上面的 perl 脚本过于简洁。我本可以把它写成一两行，但我喜欢编写可读易懂的代码，尤其是在我为他人编写示例时。 (2认同)

归档时间：	4 年，7 月前
查看次数：	636 次
最近记录：	4 年，7 月前