我有这个mystring分隔符_.这里的条件是如果有两个或多个分隔符,我想在第二个分隔符处拆分,如果只有一个分隔符,我想分割".Recal"并得到result如下所示的分隔符.
mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")
Run Code Online (Sandbox Code Playgroud)
结果
"MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
Run Code Online (Sandbox Code Playgroud) 我有这个数据帧mydf.该列nucleotide可以有'A','T','G','C'字母.如果A列是' - ' ,我想将字母A更改为T,C更改为G,G更改为C,将T更改为A. 我该怎么做?
mydf<- structure(list(seqnames = structure(c(1L, 1L, 1L, 1L), .Label = c("chr1",
"chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
"chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX",
"chrY", "chrM"), class = "factor"), pos = c(115258748, 115258748,
115258748, 115258748), strand = structure(c(1L, 2L, 1L, 2L), .Label = c("+",
"-", "*"), class = "factor"), nucleotide = structure(c(2L, 2L,
2L, 2L), .Label = c("A", "C", "G", "T", …Run Code Online (Sandbox Code Playgroud) 我有一个名为 的数据框tt。我想创建一个名为 Ethnicity 的新列,其中我希望为超过 80% 的每一行值设置一个列标题。如果没有行具有大于 80% 的值,那么我希望该行中有字符串“MIX”。
tt <- structure(list(INDIVIDUAL = c("SJL0253301", "SJL1073801", "SJL1066401",
"SJL1762813"), EUR = c(0.974378, 0.496489, 1e-05, 1e-05), EAS = c(0.010592,
0.438799, 0.99996, 1e-05), AMR = c(0.004699, 1e-05, 1e-05, 0.99996
), SAS = c(1e-05, 0.053618, 1e-05, 1e-05), AFR = c(0.010321,
0.011084, 1e-05, 1e-05)), row.names = c(1L, 44L, 19L, 911L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
我想要的结果:
INDIVIDUAL EUR EAS AMR SAS AFR Ethnicity
SJL0253301 0.974378 0.010592 0.004699 0.000010 0.010321 EUR
SJL1073801 0.496489 0.438799 0.000010 0.053618 0.011084 MIX …Run Code Online (Sandbox Code Playgroud) 我试图RMySQL通过R studio 安装在Ubuntu中,但我遇到了下面的错误.请有人帮我解决这个问题.
Installing package into ‘/R_latest/lib/R/library’
(as ‘lib’ is unspecified)
--2015-11-18 11:40:26-- https://cran.rstudio.com/src/contrib/RMySQL_0.10.7.tar.gz
Resolving cran.rstudio.com (cran.rstudio.com)... 54.230.132.47
Connecting to cran.rstudio.com (cran.rstudio.com)|54.230.132.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52422 (51K) [application/x-gzip]
Saving to: '/tmp/Rtmp52166B/downloaded_packages/RMySQL_0.10.7.tar.gz’
0K .......... .......... .......... .......... .......... 97% 1.73M 0s
50K . 100% 2276G=0.03s
2015-11-18 11:40:26 (1.77 MB/s) - '/tmp/Rtmp52166B/downloaded_packages/RMySQL_0.10.7.tar.gz’ saved [52422/52422]
* installing *source* package ‘RMySQL’ ...
** package ‘RMySQL’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
Using PKG_LIBS=-lmysqlclient -lz …Run Code Online (Sandbox Code Playgroud) 我的变量名称格式如下:
PP_Sample_12.GT
Run Code Online (Sandbox Code Playgroud)
或者
PP_Sample-17.GT
Run Code Online (Sandbox Code Playgroud)
我正在尝试使用字符串拆分来 grep 出中间部分: ieSample_12或Sample-17. 但是,当我这样做时:
IDtmp <- sapply(strsplit(names(df[c(1:13)]),'_'),function(x) x[2])
IDs <- data.frame(sapply(strsplit(IDtmp,'.GT',fixed=T),function(x) x[1]))
Run Code Online (Sandbox Code Playgroud)
我最终得到的Sample是PP_Sample_12.GT。
还有其他方法可以做到这一点吗?也许使用模式/替换类型的函数?不过,不确定 R 中是否存在(但我认为这可能适用gsub)
我有一个叫做的矩阵mymat.我有一个叫做的矢量geno <- c("01","N1","11","1N","10").我有另一张桌子叫key.table.我想要做的是我想匹配key列中key.table的key列mymat和如果任何匹配行中的列值具有任何geno元素,我想从mymat匹配geno元素中提取该列名称并粘贴它在新列matched.extract中key.table为每个相应的行key并得到结果.
mymat <- structure(c("chr5:12111", "chr5:12111", "chr5:12113", "chr5:12114",
"chr5:12118", "0N", "0N", "1N", "0N", "0N", "00", "00", "00",
"11", "10", "00", "00", "1N", "0N", "00"), .Dim = c(5L, 4L), .Dimnames = list(
c("34", "35", "36", "37", "38"), c("key", "AMLM12001KP",
"AMAS-11.3-Diagnostic", "AMLM12014N-R")))
key.table<- structure(c("chr5:12111", "chr5:12111", "chr5:12113", "chr5:12114",
"chr5:12118", "chr5:12122", "chr5:12123", "chr5:12123", "chr5:12125",
"chr5:12127", "chr5:12129", "9920068", …Run Code Online (Sandbox Code Playgroud) 我有一个dd2包含数百列的数据框,我需要做的是将所有这些列值粘贴在一起,省略任何NA值.如果我做这样的事情
apply(dd2, 1, paste, collapse=",")
Run Code Online (Sandbox Code Playgroud)
它实际上包括NAs作为"NA"字符串.我想避免这种情况.我也可以如下所示,但是这会让我一次为每个单独的列工作以获得结果.
result <- cbind(
dd2,
combination = paste(dd2[,2], replace(dd2[,3], is.na(dd2[,3]), ""), sep = ",")
)
Run Code Online (Sandbox Code Playgroud)
有没有有效的方法呢?以下是示例数据:
dd2 <- structure(c("A", "B", "C", "D", "E", "AK2", "HFM1", NA, "TRR",
"RTT", NA, "PPT", "TRR", "RTT", NA, "PPT", NA, NA, "GGT", NA), .Dim = c(5L,
4L), .Dimnames = list(NULL, c("sample_id", "plant", "animal",
"more")))
Run Code Online (Sandbox Code Playgroud) 我需要在 R 中读取至少 10 GB 大小的文件。为了限制内存使用,我只想读取那些匹配模式的行。例如,在mytext.tsv下面的文本文件中,我想从将成为标题的想要的行中读取。然后读取匹配coding和synonymousfrom的行col2,即patterns。
patterns <- c("coding", "synonymous")
mytext.tsv:
## lines unwanted
## lines unwanted1
## lines unwanted2
## lines unwanted3
wanted col1 col2
aaa variant1 coding
jhjh variant2 non-coding
ggg variant3 synonymous
fgg variant4 coding
gdg variant6 missense
Run Code Online (Sandbox Code Playgroud)
我预期的数据框应该是:
wanted col1 col2
aaa variant1 coding
ggg variant3 synonymous
Run Code Online (Sandbox Code Playgroud)
我知道我可以使用连接和扫描然后循环遍历每个模式,但是在 R 中有什么有效的方法可以做到这一点吗?
我需要帮助合并数据(mydf)中具有相同名称(即起始列)的行,并连接“ALT”列中的内容,从而根据起始列中的相似值删除所有重复行。我想合并行并连接“ALT”列中用逗号分隔的内容,并得到如下所示的结果。感谢您的帮助。
> mydf
chr start end REF ALT TYPE refGene
chr10 chr10:176131 176131 C A snp nonsynonymous SNV
chr10 chr10:159149 159149 C G snp:17659149 nonsynonymous SNV
chr10 chr10:159149 159149 C T snp:17659149 nonsynonymous SNV
chr10 chr10:241469 241469 T C snp splicing
> result
chr start end REF ALT TYPE refGene
chr10 chr10:176131 176131 C A snp nonsynonymous SNV
chr10 chr10:159149 159149 C G,T snp:17659149 nonsynonymous SNV
chr10 chr10:241469 241469 T C snp splicing
Run Code Online (Sandbox Code Playgroud)
DPUT 在这里:
structure(list(chr = c("chr3", "chr3", "chr3", …Run Code Online (Sandbox Code Playgroud) 我有一个数据框,mydf其中n列具有相同的列名称name.我想将它们更改为name1 name2 and name3 ..name-nth列.我如何在R中做到这一点?