从 bash 的输出中删除包含大量可能性的行

ufk*_*ufk 5 awk grep sed

我正在尝试过滤一个大 txt 文件(大约 10GB)的行,该文件只在direction列等于时才被呼叫号码的前缀2.

这是我从管道获取的文件格式(来自不同的脚本)

caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...
Run Code Online (Sandbox Code Playgroud)

因为这只是一个示例数据,但实际文件包含许多其他列,但我只想过滤clear_number基于direction所以这就足够了。

我想删除不包含前缀列表的行,因此例如在这里我将使用 grep 执行以下操作:

grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'
Run Code Online (Sandbox Code Playgroud)

这很好用。唯一的问题是我得到的前缀数量有时约为 10K-50K,这是很多前缀,如果我尝试使用grep get来做到这一点grep: regular expression is too large

任何想法如何使用 Bash 命令解决它?

更新

例如..假设我有以下内容:

caller_number=34234234324,     clear_number=982545345435, direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=2342334324,      clear_number=5555345435,   direction=1
caller_number=034082394234324, clear_number=33335345435,  direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=83479234234,     clear_number=444447384533, direction=2
caller_number=83479234234,     clear_number=64237384533, direction=2
Run Code Online (Sandbox Code Playgroud)

我的list.txt包含:

642
3333
534234235
Run Code Online (Sandbox Code Playgroud)

所以它只会返回该行

caller_number=83479234234,     clear_number=64237384533, direction=2
Run Code Online (Sandbox Code Playgroud)

因为清晰的数字以642and direction=开头2。就我而言,它将超过 10GB 的文本文件并返回至少 100K 的结果。

另一个更新

对不起,我还不清楚另一件事。我从管道命令中获取行,所以我应该| awk...对从以前的命令接收到的输出进行操作。

Rav*_*h13 7

使用您显示的样本,请尝试以下操作。由于 OP 已更改示例,因此现在按此添加代码。

awk '
FNR==NR{
  arr[$0]
  next
}
match($0,/clear_number=[^,]*/){
  val=substr($0,RSTART+13,RLENGTH-13)
  for(i in arr){
    if(index(val,i)==1 && $NF=="direction=2,"){
      print
      next
    }
  }
}
' list.txt  Input_file
Run Code Online (Sandbox Code Playgroud)

说明:为以上添加详细说明。

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
  arr[$0]              ##Creating arr array with index of current line.
  next                 ##next will skip all further statements from here.
}
match($0,/clear_number=[^,]*/){  ##Using match to match regex for clear_match till 1st occurrence of comma here.
  val=substr($0,RSTART+13,RLENGTH-13)  ##Creating val which has substring of matched regex.
  for(i in arr){       ##Traversing through arr here.
    if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' list.txt  Input_file ##Mentioning Input_file names here.
Run Code Online (Sandbox Code Playgroud)


anu*_*ava 7

你也可以试试这个awk

your_command |
awk '
FNR == NR {
   rexp["=" $1]
   next
}
$3 == "direction=2" {
   for (s in rexp)
      if (index($2, s)) {
         print
         next
      }
}' list.txt -

caller_number=83479234234,     clear_number=64237384533, direction=2
Run Code Online (Sandbox Code Playgroud)


Wik*_*żew 6

您可以使用awk读入前缀并使用过滤掉行

... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[$0]; next} $6 == 2 {for (key in hash) { if (index($4, key) == 1) { print; next } }}' list.txt - > outputfile
Run Code Online (Sandbox Code Playgroud)

[,=[:space:]]+是字段分隔符正则表达式匹配一个或多个逗号,等号和空格字符。

这些FNR==NR {hash[$0]; next}部分读入list.txt带有前缀的内容,每个部分在单独的行上。

$6 == 2需要字段6(方向)为等于2

然后,{for (key in hash) { if (index($4, key) == 1) { print; next } }}'尝试查找作为hash当前字段 4 前缀的值,如果找到则打印该行并继续下一行。

  • 这是可行的,但要小心,这要求 Direction=2 始终位于第三个字段中。 (2认同)