我有名为name的变量,我想将它设置为我的矩阵的列名,但在此之前,我需要编辑名为name的变量中的名称
>name
[722] "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt"
[723] "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt"
[724] "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt"
Run Code Online (Sandbox Code Playgroud)
我想在第四个之前保留字母 -
预期产出:
>name
[722] "TCGA-OL-A66N-01A"
[723] "TCGA-OL-A66O-01A"
[724] "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)
有人会帮我在R中实现这个吗?
正则表达式"["运算符定义了一个字符类,而在字符类中,第一个位置的"^"运算符进行了否定;
?regex
?sub
sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name)
[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)
这比str_split方法更简单(IMO)
sapply( lapply( strsplit(name, "\\-"), "[", 1:4),
# extracted the first 4 elements from each list element returned by strsplit
paste, collapse="-") # 'collapse' needed rather than 'sep'
#[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)
如果大小变化/不保证nchar远离你可以使用str_split_fixed()从stringr.
stringr 解:library(stringr)
name <- c(
"TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt",
"TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt",
"TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt")
apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-")
Run Code Online (Sandbox Code Playgroud)
会给你什么:
## "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)
str_split_fixed(name,"-",5) 根据前5个出现的时间将每个向量元素name分成多个5部分-
[,1:4]保留每个name元素的前4个部分(结果矩阵的列)
apply(...,1,paste0,collapse="-")将它们粘贴在一起使用"-"以恢复名称进行折叠(按行)
这里我将我的stringr+ apply()方法与@BondedDust grep方法和基本strsplit方法进行比较.
首先,让我们将其提升到一万个名字:
name <- rep(name,3.334e3)
Run Code Online (Sandbox Code Playgroud)
然后是一个微基准测试:
microbenchmark(
stringr_apply = apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-"),
grep_ninja = sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name),
strsplit = sapply( lapply( strsplit(name, "\\-"), "[", 1:4), paste, collapse="-"),
times=25)
Run Code Online (Sandbox Code Playgroud)
得到:
# Unit: milliseconds
# expr min lq median uq max neval
# stringr_apply 845.44542 874.5674 899.27849 941.22628 976.88903 25
# grep_ninja 25.51796 25.7066 25.85404 25.95922 27.89165 25
# strsplit 115.10626 123.2645 126.45171 130.10334 147.39517 25
Run Code Online (Sandbox Code Playgroud)
似乎base模式匹配/替换将更好地扩展...大约一秒钟或比最慢的方式快30倍.
| 归档时间: |
|
| 查看次数: |
284 次 |
| 最近记录: |