我有一个包含数千行的 1 列数据框,全部基于相同的模式构建,例如:
ids <- c("ETC|HMPI01000001|HMPI01000001.1 TAG: Genus Species, T05X3Ml2_CL10007Cordes1_1","ETC|HMPI31000002|HMPI31000002.1 TAG: Genus Species, T3X3Ml2_CL10157Cordes1_1", "ETC|HMPI01000007|HMPI01000007.1 TAG: Genus Species, T1X3Ml2_CL11231Cordes1_1")
df <- as.data.frame(ids)
Run Code Online (Sandbox Code Playgroud)
> df
ids
1 ETC|HMPI01000001|HMPI01000001.1 TAG: Genus Species, T05X3Ml2_CL10007Cordes1_1
2 ETC|HMPI31000002|HMPI31000002.1 TAG: Genus Species, T3X3Ml2_CL10157Cordes1_1
3 ETC|HMPI01000007|HMPI01000007.1 TAG: Genus Species, T1X3Ml2_CL11231Cordes1_1
Run Code Online (Sandbox Code Playgroud)
我想将这些字符分成两列: var1 和 var2 这样,并保留第二个管道之后和第一个空格之前的文本,以及空格之后第二个 T 中的文本。这些将是所有线路的共同模式。预期结果是:
> df
var1 var2
1 HMPI01000001.1 T05X3Ml2_CL10007Cordes1_1
2 HMPI31000002.1 T3X3Ml2_CL10157Cordes1_1
3 HMPI01000007.1 T1X3Ml2_CL11231Cordes1_1
Run Code Online (Sandbox Code Playgroud)
我尝试了几个受这里、那里或那里启发的正则表达式..但我无法弄清楚。
我目前有这个,但它没有给出预期的结果:
df2 <- df %>% separate(col = "ids", into = c("var1", "var2"), sep = "\\|([^|]+)$")
> df2
var1 var2
1 ETC|HMPI01000001
2 ETC|HMPI31000002
3 ETC|HMPI01000007
Run Code Online (Sandbox Code Playgroud)
任何帮助,最好使用正则表达式和 tidyverse,将不胜感激。
我们可以用strcapture。
strcapture("^[^|]*\\|[^|]*\\|(\\S+)\\s.*? (T.*)",
ids, list(var1="", var2=""))
# var1 var2
# 1 HMPI01000001.1 T05X3Ml2_CL10007Cordes1_1
# 2 HMPI31000002.1 T3X3Ml2_CL10157Cordes1_1
# 3 HMPI01000007.1 T1X3Ml2_CL11231Cordes1_1
Run Code Online (Sandbox Code Playgroud)
正则表达式:
"^[^|]*\\|[^|]*\\|(\\S+)\\s.*? (T.*)"
^ beginning of string
[^|]*\\| leading up to first |
[^|]*\\| leading up to the second |
(\\S+)\\s non-space (captured) and blank space
.*? non-greedy "anything"
(T.*) literal T (after a space) and everything else
Run Code Online (Sandbox Code Playgroud)
library(stringr)
library(dplyr)
df |>
mutate(var1 = str_extract(ids, ".*\\|(\\S+)", group = 1),
var2 = str_extract(ids, ".*, (.*$)", group = 1))
Run Code Online (Sandbox Code Playgroud)
separate_wider_position如果它们的位置是固定的,您也可以使用:
library(tidyr)
df |>
separate_wider_position(ids,
widths = c(17, var1 = 14, 21, var2 = max(nchar(ids))),
too_few = "align_start")
# var1 var2
# <chr> <chr>
# 1 HMPI01000001.1 T05X3Ml2_CL10007Cordes1_1
# 2 HMPI31000002.1 T3X3Ml2_CL10157Cordes1_1
# 3 HMPI01000007.1 T1X3Ml2_CL11231Cordes1_1
Run Code Online (Sandbox Code Playgroud)