我有以下数据框:
library(tidyverse)
dat <- structure(list(fasta_header = c(">seq1", ">seq2"), sequence = c("MPSRGTRPE",
"VSSKYTFWNF")), .Names = c("fasta_header", "sequence"), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
dat
#> # A tibble: 2 x 2
#> fasta_header sequence
#> <chr> <chr>
#> 1 >seq1 MPSRGTRPE
#> 2 >seq2 VSSKYTFWNF
Run Code Online (Sandbox Code Playgroud)
我想要做的是计算每一行的氨基酸频率.期望的结果是(手工)
fasta_header sequence M P S R G T E V K Y F W N
>seq1 MPSRGTRPE 1 1 1 2 1 1 1 0 0 0 0 0 0
>seq2 VSSKYTFWNF 0 0 2 0 0 1 0 1 1 1 2 1 1
Run Code Online (Sandbox Code Playgroud)
我怎么能用dplyr管道方法做到这一点?
上面的评论是正确的,但如果你真的想要一个tidyverse管道......
library(tidyverse) #uses dplyr, purrr, tidyr and stringr
dat %>% mutate(split=map(sequence, ~unlist(str_split(., "")))) %>% #split into characters
unnest() %>% #unnest into a new column
group_by(fasta_header, sequence) %>% #group
count(split) %>% #count letters for each group
spread(key=split, value=n, fill=0) #convert to wide format
fasta_header sequence E F G K M N P R S T V W Y
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 >seq1 MPSRGTRPE 1. 0. 1. 0. 1. 0. 2. 2. 1. 1. 0. 0. 0.
2 >seq2 VSSKYTFWNF 0. 2. 0. 1. 0. 1. 0. 0. 2. 1. 1. 1. 1.
Run Code Online (Sandbox Code Playgroud)