我希望你能找到自己,我写信是想知道是否可以在 awk 中做这样的事情
我需要像许多 NF 一样的东西......对于 NF = 7 PK 是 1、5 美元,但对于 NF=8 是 1、6 美元
输入
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
Run Code Online (Sandbox Code Playgroud)
欲望输出
文件 .PK_OK_1
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
Run Code Online (Sandbox Code Playgroud)
文件 DUPLICATE_PK_1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
Run Code Online (Sandbox Code Playgroud)
文件 PK_OK_2
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
Run Code Online (Sandbox Code Playgroud)
文件 DUPLICATE_PK_2
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
Run Code Online (Sandbox Code Playgroud)
文件 INVALID_LENGHT
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
Run Code Online (Sandbox Code Playgroud)
我的代码是这样的(NOM_ARCH 是一个变量)
BEGIN { FS="|";
OFS="|"
}
NF == 7 {
if (!seen[$1,$5]) {
print > NOM_ARCH".PK_OK_1"; seen[$1,$5]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_1"
}
next
}
NF == 8 {
if (!seen[$1,$6]) {
print > NOM_ARCH".PK_OK_2"; seen[$1,$6]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_2"
}
next
}
{ print > NOM_ARCH".INVALID_LENGHT" }
Run Code Online (Sandbox Code Playgroud)
使用您显示的示例,请尝试以下awk代码。
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
NF==8{
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
Run Code Online (Sandbox Code Playgroud)
或根据 OP 的要求使用以下不带三元运算符的代码:
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
if(arr1[key]==1){ outputFile="file.PK_OK_1" }
else { outputFile="file_DUPLICATE_PK_1"}
}
NF==8{
if(arr1[key]==1){ outputFile="file.PK_OK_2" }
else { outputFile="file_DUPLICATE_PK_2"}
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
Run Code Online (Sandbox Code Playgroud)
说明:为以上添加详细说明。
## Starting awk program from here.
awk '
## Starting BEGIN section of this program from here, setting FS and OFS to | here.
BEGIN{ FS=OFS="|" }
##Starting main program from here.
{
##Checking condition if NF is 7 then set key to $1 FS $5.
if(NF==7){ key=($1 FS $5) }
##Checking condition if NF is 8 then set key to $1 FS $6.
if(NF==8){ key=($1 FS $6) }
}
##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
FNR==NR{
##Creating array arr1 with index of key and keep increasing same key value with 1 here.
arr1[key]++
##next will skip all further statements from here.
next
}
##Checking condition if NF==7 then do following.
NF==7{
##Setting outputFile(where contents will be written to), either file.PK_OK_1 OR file_DUPLICATE_PK_1 depending upon value of arr1.
##Basically it uses ternary operators ? and :
##Statements after ? will executed if condition arr1[key]==1 is TRUE.
##Statements after : will be executed if condition ar1[key]==1 is FALSE.
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
##Checking condition if NF==8 then do following.
NF==8{
##Setting outputFile(where contents will be written to), either file.PK_OK_2 OR file_DUPLICATE_PK_2 depending upon value of arr1.
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
##Checking condition if NF>8 then do following.
NF>8{
##Setting outputFile(where contents will be written to) to file_INVALID_LENGHTH here.
outputFile="file_INVALID_LENGHTH"
}
{
##Printing current line to outputFile(already set its value above)
print > (outputFile)
}
##Mentioning Input_file names here.
' Input_file Input_file
Run Code Online (Sandbox Code Playgroud)
通常我会建议第一遍用sort和uniq -c效率,但我开始假设错误的要求,使假设下写了大多数的这一点,所以我刚刚调整了它现在的真实需求,所以这里是如何做到这一切在一个 awk 脚本中:
$ cat tst.awk
BEGIN {
FS=OFS="|"
map[7] = 1
map[8] = 2
}
{ key = $1 FS $(NF-2) FS NF }
NR==FNR {
cnt[key]++
next
}
{
if ( NF in map ) {
sfx = ( cnt[key]>1 ? "DUPLICATE_PK" : "PK_OK" ) "_" map[NF]
}
else {
sfx = "INVALID_LENGTH"
}
print > (nom_arch "." sfx)
}
Run Code Online (Sandbox Code Playgroud)
$ awk -v nom_arch='foo' -f tst.awk file file
Run Code Online (Sandbox Code Playgroud)
$ head foo.*
==> foo.DUPLICATE_PK_1 <==
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
==> foo.DUPLICATE_PK_2 <==
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
==> foo.INVALID_LENGTH <==
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
==> foo.PK_OK_1 <==
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
==> foo.PK_OK_2 <==
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
Run Code Online (Sandbox Code Playgroud)
我更正了LENGTH上面的拼写。
请注意,NF包含在key = $1 FS $(NF-2) FS NF因此我们避免了@rowboat指出的潜在情况,其中具有 7 个字段的行与具有 8 个字段的行具有相同的 $1 和 $(NF-2) ,否则我们最终会计算两次它应该是 2 个单独的 1 计数。
我们本可以在设置时使用NF-6而不是,但是对于识别有效值也很有用,将来可能会有其他值不能通过仅减去 6 来确定。map[NF]sfxmap[]NFNFsfx