如何将整数添加到差异计算中并将其打印到行尾?

Gaw*_*ain 6 awk sed

目标:打印由分号 ($3$2)分隔的两个字段之间的差异,并+1在以“ >”开头的每行末尾添加一个整数 ( ) 到该计算值。

我的文件的代表性示例:

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP
Run Code Online (Sandbox Code Playgroud)

期望输出:

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:111
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:42
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:36
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:39
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:90
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP
Run Code Online (Sandbox Code Playgroud)

我当前的awk脚本通过打印每行之间$3$2末尾的差异使我非常接近,但不包括+1添加步骤(必需)并且不特定于以“ >”开头的行,尽管我尝试使用/^ *>/不是必需的,但是好的):

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4=$3-$2} $4<0 {$4=-$4} 1' file

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:110
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR:::0
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:41
MFLLHYYLIIQVI:::0
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:35
MQWIKDKVLIK:::0
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:38
MFYPLYLDYLYY:::0
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:89
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP:::0
Run Code Online (Sandbox Code Playgroud)

尝试将整数 ( +1)添加到差值计算中:

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4+1=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4+=1=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =

$ awk -F":" -v n=1 'BEGIN {OFS=FS} /^ *>/ {$4+n=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =
Run Code Online (Sandbox Code Playgroud)

虽然我不确定如何使用 实现函数awk,但我认为使用类似的东西可能会有一些效用:

$ function add_one (number) {
      return number + 1
  }
$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {add_one($4)=$3-$2} $4<0 {$4=-$4} 1' file
Run Code Online (Sandbox Code Playgroud)

虽然我一直试图用它awk来解决这个问题,但我对任何解决方案都感兴趣(例如,由于我正在尝试逐行执行此计算,也许有更有效的解决方案sed?)。

Rav*_*h13 5

使用您显示的样本,您能否尝试以下操作。

awk -F':|[[:space:]]+' -v OFS=":" '/^>/{$0=$0 OFS ($6>$5?($6-$5+1):($5-$6+1))} 1' Input_file
Run Code Online (Sandbox Code Playgroud)

或者

awk -F':|[[:space:]]+' -v OFS=":" '/^>/{print $0,($6>$5?($6-$5+1):($5-$6+1));next} 1' Input_file
Run Code Online (Sandbox Code Playgroud)

说明:为以上添加详细说明。

awk -F':|[[:space:]]+' -v OFS=":" '      ##Setting field separator as : OR spaces setting OFS as colon here.
/^>/{                                    ##Checking if line starts from >
  $0=$0 OFS ($6>$5?($6-$5+1):($5-$6+1))  ##Re-setting value to current line which has current line and difference of 6th and 5th field(based on condition using ternary operators take difference as per whichever value is greater than other)
}
1                                        ##printing current line here.
' Input_file                             ##Mentioning Input_file name here.
Run Code Online (Sandbox Code Playgroud)

编辑:由于 OP 提到 OP 有 mawk 并且上面的代码并不完全适用于 OP,因此在此处添加另一种方法:

awk -v OFS=":" '
{
  gsub(/\r/,"")
}
/^>/ && match($0,/[0-9]+:[0-9]+/){
  split(substr($0,RSTART,RLENGTH),arr,":")
  print $0,1+(arr[1]>arr[2]?(arr[1]-arr[2]):(arr[2]-arr[1]))
  next
}
1' Input_file
Run Code Online (Sandbox Code Playgroud)


anu*_*ava 5

这是一个awk适用于所有awk版本的替代解决方案:

awk 'BEGIN {FS=OFS=":"} /^>/ {
   v3=$3+0
   diff = 1 + (v3 > $2 ? v3-$2 : $2-v3)
   $0 = $0 OFS diff
} 1' file

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:111
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:42
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:36
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:39
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:90
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP
Run Code Online (Sandbox Code Playgroud)

PS:在运行这个awk.