asp*_*ill 6 command-line text-processing
我有一个包含多个序列的文件,问题是id后面有一个空格然后是实际序列,我想在id和实际序列之间添加一个换行符。
这就是我所拥有的:
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
这就是我想要的样子:
UniRef90_Q8YC41 Putative binding protein BMEII0691
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
如果可能的话,我宁愿它看起来像这样
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
hee*_*ayl 10
使用awk, 打印第一个和最后一个字段\n作为分隔符:
awk '{printf "%s\n%s\n", $1, $NF}' file.txt
Run Code Online (Sandbox Code Playgroud)sed在匹配和使用替换时使用,捕获第一个和最后一个字段:
sed -E 's/([^[:blank:]]+).*[[:blank:]]([^[:blank:]]+)$/\1\n\2/' file.txt
Run Code Online (Sandbox Code Playgroud)与perl,类似的逻辑sed:
perl -pe 's/^([^\s]+).*\s([^\s]+)/$1\n$2/' file.txt
Run Code Online (Sandbox Code Playgroud)使用bash,更慢的办法,产生从每一行的阵列,并且通过将它们分开第一印版和最后一个元素从数组\n:
awk '{printf "%s\n%s\n", $1, $NF}' file.txt
Run Code Online (Sandbox Code Playgroud)使用python,从每一行创建一个包含空格分隔元素的列表,然后打印列表中的第一个和最后一个元素,用 分隔\n:
sed -E 's/([^[:blank:]]+).*[[:blank:]]([^[:blank:]]+)$/\1\n\2/' file.txt
Run Code Online (Sandbox Code Playgroud)例子:
$ cat file.txt
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
$ awk '{printf "%s\n%s\n", $1, $NF}' file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
$ sed -E 's/([^[:blank:]]+).*[[:blank:]]([^[:blank:]]+)$/\1\n\2/' file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
$ perl -pe 's/^([^\s]+).*\s([^\s]+)/$1\n$2/' file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
$ while read -ra line; do printf '%s\n%s\n' "${line[0]}" "${line[$((${#line[@]]}-1))]}"; done <file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
>>> with open("file.txt") as f:
... for line in f:
... line = line.split()
... print(line[0]+'\n'+line[-1])
...
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
在这个答案中:
bash+xargs单线python 单线Ruby 单线bash+xargs版本。$> cat input_file.txt | xargs -L 1 bash -c 'for i; do : ; done ; echo $1;echo $i' bash
Run Code Online (Sandbox Code Playgroud)
这基本上将每一行作为命令行参数提供给 bash,循环直到我们得到最后一行,然后将它们回显出来。
演示:
$> cat input_file.txt
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41 Putative binding protein BMEII0691 MNRFIAFFRSVFLIGLVATAFGRACA
$> cat input_file.txt | xargs -L 1 bash -c 'for i; do : ; done ; echo $1;echo $i' bash
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
更短的版本:
$> cat input_file.txt | xargs -L 1 bash -c 'echo $1;echo ${@: -1}' bash
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
python单线这个单行组合了一个字符串列表,这些字符串基本上是第一个单词 + 换行符 + 最后一个单词。最后,它将所有列表项打印为一个用换行符连接的字符串。
python -c 'import sys ; print "\n".join([ l.split()[0] + "\n" + l.split()[-1] for l in sys.stdin ])' < input_file.txt
Run Code Online (Sandbox Code Playgroud)
使用演示:
$ python -c 'import sys ; print "\n".join([ l.split()[0] + "\n" + l.split()[-1] for l in sys.stdin ])' < input_file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
在这个班轮中,-n标志作为while gets . . . end循环。$_保存每行读取的值,因此每行我们将其拆分为一个单词数组,然后打印第一个和最后一个。
$ ruby -ne 'words=$_.split(); puts words[0],words[-1]' < input_file.txt
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
UniRef90_Q8YC41
MNRFIAFFRSVFLIGLVATAFGRACA
Run Code Online (Sandbox Code Playgroud)
File.open(ARGV[0]) do |f|
f.each do |line|
puts "#{line.partition(' ')[0] + "\n" + line.rpartition(' ')[-1]}"
end
end
Run Code Online (Sandbox Code Playgroud)
将它保存为任何名称line_breaker.rb并运行它,ruby line_breaker.rb file.txt而file.txt是您存储序列的文件。