我想从数据框的列中提取特定信息,并将其添加到同一数据框中的新列.复杂性在于某些行根本没有我要提取的信息("UniProt:"之后的6个字符),而其他行有多次出现 - 我希望这些行相应地显示,因为此列包含标识符在我的数据框中.
这是一个例子; 我从我的数据框中复制了几行Fasta.headers:
第1行:
H05C05.1c; CE43771; WBGene00019157;状态:Partially_confirmed; UniProt的:H2L0A8; protein_id:CCD72193.1;> H05C05.1a; CE37385; WBGene00019157;状态:Partially_confirmed; UniProt的:Q9TXU2; protein_id:CCD72188.1
第2行:
C02B10.5; CE16802; WBGene00015330;状态:Partially_confirmed; UniProt的:O44447; protein_id:CCD61167.1
第3行:
ZK1127.4; CE07643; WBGene00022851;状态:确认; protein_id:CCD73716.1
第4行:
T27C4.4a; CE21211; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:O61907; protein_id:CCD74255.1;> T27C4.4b; CE21212; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:Q76NP4 ; protein_id:CCD74256.1;> T27C4.4d; CE33331;> F54F2.9; CE39158; WBGene00018836;状态:成熟; UniProt的:P34454; protein_id:CCD71243.1
我希望输出为:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454
Run Code Online (Sandbox Code Playgroud)
这里strapplyc从gsubfn包从提取所需的字符串x和sapply折叠多个串入由分号分隔的一个字符串:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
Run Code Online (Sandbox Code Playgroud)
赠送:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
Run Code Online (Sandbox Code Playgroud)
在哪里x:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
Run Code Online (Sandbox Code Playgroud)
添加了一些解释.
| 归档时间: |
|
| 查看次数: |
171 次 |
| 最近记录: |