从 bash 的输出中删除包含大量可能性的行

Question

从 bash 的输出中删除包含大量可能性的行

我正在尝试过滤一个大 txt 文件（大约 10GB）的行，该文件只在direction列等于时才被呼叫号码的前缀2.

这是我从管道获取的文件格式（来自不同的脚本）

caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...

Run Code Online (Sandbox Code Playgroud)

因为这只是一个示例数据，但实际文件包含许多其他列，但我只想过滤clear_number基于direction所以这就足够了。

我想删除不包含前缀列表的行，因此例如在这里我将使用 grep 执行以下操作：

grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'

Run Code Online (Sandbox Code Playgroud)

这很好用。唯一的问题是我得到的前缀数量有时约为 10K-50K，这是很多前缀，如果我尝试使用grep get来做到这一点grep: regular expression is too large。

任何想法如何使用 Bash 命令解决它？

更新

例如..假设我有以下内容：

caller_number=34234234324,     clear_number=982545345435, direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=2342334324,      clear_number=5555345435,   direction=1
caller_number=034082394234324, clear_number=33335345435,  direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=83479234234,     clear_number=444447384533, direction=2
caller_number=83479234234,     clear_number=64237384533, direction=2

Run Code Online (Sandbox Code Playgroud)

我的list.txt包含：

642
3333
534234235

Run Code Online (Sandbox Code Playgroud)

所以它只会返回该行

caller_number=83479234234,     clear_number=64237384533, direction=2

Run Code Online (Sandbox Code Playgroud)

因为清晰的数字以642and direction=开头2。就我而言，它将超过 10GB 的文本文件并返回至少 100K 的结果。

另一个更新

对不起，我还不清楚另一件事。我从管道命令中获取行，所以我应该| awk...对从以前的命令接收到的输出进行操作。

Answer 1

Rav*_*h13 7

使用您显示的样本，请尝试以下操作。由于 OP 已更改示例，因此现在按此添加代码。

awk '
FNR==NR{
  arr[$0]
  next
}
match($0,/clear_number=[^,]*/){
  val=substr($0,RSTART+13,RLENGTH-13)
  for(i in arr){
    if(index(val,i)==1 && $NF=="direction=2,"){
      print
      next
    }
  }
}
' list.txt  Input_file

Run Code Online (Sandbox Code Playgroud)

说明：为以上添加详细说明。

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
  arr[$0]              ##Creating arr array with index of current line.
  next                 ##next will skip all further statements from here.
}
match($0,/clear_number=[^,]*/){  ##Using match to match regex for clear_match till 1st occurrence of comma here.
  val=substr($0,RSTART+13,RLENGTH-13)  ##Creating val which has substring of matched regex.
  for(i in arr){       ##Traversing through arr here.
    if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' list.txt  Input_file ##Mentioning Input_file names here.

Run Code Online (Sandbox Code Playgroud)

Answer 2

anu*_*ava 7

你也可以试试这个awk：

your_command |
awk '
FNR == NR {
   rexp["=" $1]
   next
}
$3 == "direction=2" {
   for (s in rexp)
      if (index($2, s)) {
         print
         next
      }
}' list.txt -

caller_number=83479234234,     clear_number=64237384533, direction=2

Run Code Online (Sandbox Code Playgroud)

Answer 3

Wik*_*żew 6

您可以使用awk读入前缀并使用过滤掉行

... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile

Run Code Online (Sandbox Code Playgroud)

该[,=[:space:]]+是字段分隔符正则表达式匹配一个或多个逗号，等号和空格字符。

这些FNR==NR {hash[$0]; next}部分读入list.txt带有前缀的内容，每个部分在单独的行上。

的$6 == 2需要字段6（方向）为等于2。

然后，{for (key in hash) { if (index($4, key) == 1) { print; next } }}'尝试查找作为hash当前字段 4 前缀的值，如果找到则打印该行并继续下一行。

这是可行的，但要小心，这要求 Direction=2 始终位于第三个字段中。 (2认同)

归档时间：	4 年，8 月前
查看次数：	168 次
最近记录：	4 年，8 月前