如何在R中编辑colnames?

use*_*363 5 regex r

我有名为name的变量,我想将它设置为我的矩阵的列名,但在此之前,我需要编辑名为name的变量中的名称

>name
[722] "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt"
[723] "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt"
[724] "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt"
Run Code Online (Sandbox Code Playgroud)

我想在第四个之前保留字母 -

预期产出:

  >name
    [722] "TCGA-OL-A66N-01A"
    [723] "TCGA-OL-A66O-01A"
    [724] "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)

有人会帮我在R中实现这个吗?

42-*_*42- 8

正则表达式"["运算符定义了一个字符类,而在字符类中,第一个位置的"^"运算符进行了否定;

?regex
?sub

sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name)
[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)

这比str_split方法更简单(IMO)

 sapply( lapply( strsplit(name, "\\-"), "[", 1:4),   
                # extracted the first 4 elements from each list element returned by strsplit
         paste, collapse="-")  # 'collapse' needed rather than 'sep'

#[1] "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)


npj*_*pjc 5

如果大小变化/不保证nchar远离你可以使用str_split_fixed()stringr.

stringr 解:

library(stringr)

name <- c(
    "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt")

apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-")
Run Code Online (Sandbox Code Playgroud)

会给你什么:

## "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"
Run Code Online (Sandbox Code Playgroud)

说明:

  • str_split_fixed(name,"-",5)

根据前5个出现的时间将每个向量元素name分成多个5部分-

  • [,1:4]

保留每个name元素的前4个部分(结果矩阵的列)

  • apply(...,1,paste0,collapse="-")

将它们粘贴在一起使用"-"以恢复名称进行折叠(按行)


但如果我有很多名字怎么办?

这里我将我的stringr+ apply()方法与@BondedDust grep方法和基本strsplit方法进行比较.

首先,让我们将其提升到一万个名字:

name <- rep(name,3.334e3)
Run Code Online (Sandbox Code Playgroud)

然后是一个微基准测试:

microbenchmark(
  stringr_apply = apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-"),
  grep_ninja = sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name),
  strsplit = sapply( lapply( strsplit(name, "\\-"), "[", 1:4), paste, collapse="-"), 
  times=25)
Run Code Online (Sandbox Code Playgroud)

得到:

#  Unit: milliseconds
#  expr             min       lq    median        uq       max    neval
# stringr_apply 845.44542 874.5674 899.27849 941.22628 976.88903    25
# grep_ninja     25.51796  25.7066  25.85404  25.95922  27.89165    25
# strsplit      115.10626 123.2645 126.45171 130.10334 147.39517    25
Run Code Online (Sandbox Code Playgroud)

似乎base模式匹配/替换将更好地扩展...大约一秒钟或比最慢的方式快30倍.

  • 看起来非常缓慢和复杂,需要额外的包.如果你不想使用正则表达式方法,那么`strsplit`可能更容易说明. (2认同)