Ale*_*nte 2 bash perl awk grep fasta
我一直在尝试在 fasta 文件中查找每个物种的 1 秒数量,如下所示:
>111
1100101010
>102
1110000001
Run Code Online (Sandbox Code Playgroud)
期望的输出是:
>111
5
>102
4
Run Code Online (Sandbox Code Playgroud)
我知道如何获取文件中 1 的数量:
grep -c 1 file
Run Code Online (Sandbox Code Playgroud)
我的问题是我无法找到跟踪每个物种的 1 数量(而不是文件中的总数)的方法。
grep -c 1将为您提供匹配的行数,而不是 s 的总数1。您可以让它grep -o只在单独的行上打印每个匹配行的匹配部分,然后wc -l计算行数。
while read -r line
do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $count ]]; then
printf "%d\n" $count
fi
count=0
echo "$line"
else
((count += $(grep -o 1 <<< "$line" | wc -l)))
fi
done < fasta_file
if [[ -n $count ]]; then
printf "%d\n" $count
fi
Run Code Online (Sandbox Code Playgroud)
或者在纯bash中使用参数扩展:
while read -r line
do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $count ]]; then
printf "%d\n" $count
fi
count=0
echo "$line"
else
line="${line//[^1]/}" # remove everything but 1's
((count += ${#line})) # add the length of line to count
fi
done < fasta_file
if [[ -n $count ]]; then
printf "%d\n" $count
fi
Run Code Online (Sandbox Code Playgroud)
perl中的类似设置:
while read -r line
do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $count ]]; then
printf "%d\n" $count
fi
count=0
echo "$line"
else
line="${line//[^1]/}" # remove everything but 1's
((count += ${#line})) # add the length of line to count
fi
done < fasta_file
if [[ -n $count ]]; then
printf "%d\n" $count
fi
Run Code Online (Sandbox Code Playgroud)
音译运算符tr///会返回其执行的音译次数,并且由于1是唯一的参数,因此它将与计数相同1。
一个awk想法:
awk '
/^>/ { print ; next } # print lines starting with ">"; skip to next input line
{ print gsub(/1/,"x") } # replace all "1" characters with dummy "x"; gsub() returns number of replacements (ie, number of "1" characters in the line)
' file
Run Code Online (Sandbox Code Playgroud)
或者作为一句单行:
awk '/^>/ {print;next} {print gsub(/1/,"x")}' file
Run Code Online (Sandbox Code Playgroud)
折叠成一个三元运算符来确定要做什么print:
awk '{print ($0 ~ /^>/ ? $0 : gsub(/1/,"x"))}' file
Run Code Online (Sandbox Code Playgroud)
这些都会生成:
>111
5
>102
4
Run Code Online (Sandbox Code Playgroud)
>111
11001010101110000001
Run Code Online (Sandbox Code Playgroud)
也可以写成
>111
1100101010
1110000001
Run Code Online (Sandbox Code Playgroud)
但现有的解决方案都不适合后者。这解决了这种监督问题:
>111
11001010101110000001
Run Code Online (Sandbox Code Playgroud)
对于上面显示的两个文件,程序输出
>111
1100101010
1110000001
Run Code Online (Sandbox Code Playgroud)