首先,如果这个问题太天真或者之前已经重复过,我深表歉意。我试图在论坛中找到它,但我将其作为问题发布,因为我未能找到答案。
我有一个数据框,其列名称如下;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
Run Code Online (Sandbox Code Playgroud)
我想要的是使用 R 中的正则表达式来提取第一个破折号和最后一个句点之间的字符串。
预期结果是,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
Run Code Online (Sandbox Code Playgroud)
我尝试跟随但没有运气......
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Run Code Online (Sandbox Code Playgroud)
有人可以帮助我解决这个问题吗?
预先非常感谢。
沙尼.
首先,如果我在发布之前没有遇到类似问题的答案,那么首先道歉.我有一组(72)基因注释文件.我想以下面的格式提取GO术语(奖金将是其他注释术语)
HORVU1Hr1G002090 GO:0003824
HORVU1Hr1G002090 GO:0006527
HORVU1Hr1G002090 GO:0008295
HORVU1Hr1G002090 GO:0008792
HORVU1Hr1G005360 GO:0004497
HORVU1Hr1G005360 GO:0005506
HORVU1Hr1G005360 GO:0016705
HORVU1Hr1G005360 GO:0020037
HORVU1Hr1G005360 GO:0055114
HORVU1Hr1G087600 GO:0009055
HORVU1Hr1G087600 GO:0015035
HORVU1Hr1G087600 GO:0016705
.
.
.
Run Code Online (Sandbox Code Playgroud)
我的input_file看起来像这样:
HORVU1Hr1G002090.11 HORVU1Hr1G002090 chr1H:4283580-4286133 HC_G arginine decarboxylase 1 GO:0003824, GO:0006527, GO:0008295, GO:0008792 PF00278, PF02784 IPR000183, IPR002985, IPR009006, IPR022643, IPR022644, IPR022657, IPR029066 HORVU1Hr1G002090
HORVU1Hr1G005360.1 HORVU1Hr1G005360 chr1H:11579708-11582804 HC_G Cytochrome P450 superfamily protein GO:0004497, GO:0005506, GO:0016705, GO:0020037, GO:0055114 PF00067 IPR001128, IPR002403, IPR017972 HORVU1Hr1G005360
HORVU1Hr1G087600.1 HORVU1Hr1G087600 chr1H:539679073-539680597 HC_G Glutaredoxin family protein GO:0009055, GO:0015035, GO:0045454 PF00462 IPR002109, …Run Code Online (Sandbox Code Playgroud)