删除"."之后的部分字符串.

Lis*_*ann 63 regex string r bioinformatics biomart

我正在使用NCBI参考序列登录号,如变量a:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")  
Run Code Online (Sandbox Code Playgroud)

要获得从biomart包我需要删除的信息.1,.2登录号等设备中后.我通常使用以下代码执行此操作:

b <- sub("..*", "", a)

# [1] "" "" "" "" "" ""
Run Code Online (Sandbox Code Playgroud)

但正如您所看到的,这不是这个变量的正确方法.谁能帮我这个?

Han*_*nsi 87

你只需要逃避这个时期:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")

gsub("\\..*","",a)
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155" 
Run Code Online (Sandbox Code Playgroud)


zx8*_*754 10

我们可以假装它们是文件名并删除扩展名:

tools::file_path_sans_ext(a)
# [1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"
Run Code Online (Sandbox Code Playgroud)


joh*_*nes 6

你可以这样做:

sub("*\\.[0-9]", "", a)
Run Code Online (Sandbox Code Playgroud)

要么

library(stringr)
str_sub(a, start=1, end=-3)
Run Code Online (Sandbox Code Playgroud)

  • `str_sub(a, start = 1, end = -3)` 解决方案假设**只有两个字符**需要删除(“.”和其后的一个数字)。对于许多基因 ID 系统,版本中可能有多个数字(尤其是探针 ID)。在这种情况下,更灵活的解决方案是“str_remove(a,pattern = "\\..*")”。在上面的代码中,模式是查找第一个句点(使用 `"\\."`),然后查找其后的*任意*字符(`"."`)*任意*次数(`"*"`) )。 (6认同)
  • 替代:`str_replace(a,“ \\。[0-9]”,“”)`和`str_replace(a,“ \\ .. *”,“”)` (3认同)

akr*_*run 6

如果字符串应该是固定长度的,那么可以使用substrfrom base R。但是,我们可以得到.with的位置regexpr并在substr

substr(a, 1, regexpr("\\.", a)-1)
#[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"   
Run Code Online (Sandbox Code Playgroud)


ben*_*n23 5

我们可以使用前瞻正则表达式来提取之前的字符串.

library(stringr)

str_extract(a, ".*(?=\\.)")
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"   
[5] "NM_011419"    "NM_053155"   
Run Code Online (Sandbox Code Playgroud)