复杂的正则表达式与各种模式匹配

Haa*_*kas 1 regex r

我有一个包含以下信息的列的数据框:

    c("GYRA.Flq_NC_002695.1.916822_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
"GYRB.CARD_pvgb_AP009048_3760295_3762710_ARO_3003303_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRB_RequiresSNPConfirmation", 
"MARR.CARD_pvgb_U00096_1619119_1619554_ARO_3003378_Escherichia_Multi_drug_resistance_MDR_regulator_MARR_RequiresSNPConfirmation", 
"PARC.Flq_M58408_gene_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", 
"SOXS.CARD_pvgb_U00096_4277468_4277933_ARO_3003381_Escherichia_Multi_drug_resistance_MDR_regulator_SOXS_RequiresSNPConfirmation", 
"TOLC.CARD_phgb_FJ768952_0_1488_ARO_3000237_tolC_Multi_drug_resistance_Multi_drug_efflux_pumps_TOLC", 
"parE.CARD_pvgb_NC_007779_3172159_3174052_ARO_3003316_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation", 
"GYRA.Flq_CP001918.1_gene3562_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
"PARC.Flq_NC_003197.1.1254697_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", 
"GYRA.Flq_NC_003197.1.1253794_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", 
"parE.CARD_pvgb_NC_003197_3343961_3345854_ARO_3003317_Salmonella_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation", 
"ACRR.CARD_pvgb_NC_014121_1270697_1271351_ARO_3003374_Enterobacter_Multi_drug_resistance_MDR_regulator_ACRR_RequiresSNPConfirmation"
)
Run Code Online (Sandbox Code Playgroud)

我想要做的是获取上面每个条目中的特定ID号,标记如下,并为数据框中的每一行创建一个具有此数字的新列.

"GYRA.Flq_ NC_002695.1.916822 _Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", "GYRB.CARD_pvgb_ AP009048_3760295_3762710 _ARO_3003303_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRB_RequiresSNPConfirmation", "MARR.CARD_pvgb_ U00096_1619119_1619554 _ARO_3003378_Escherichia_Multi_drug_resistance_MDR_regulator_MARR_RequiresSNPConfirmation", "PARC.Flq_ M58408 _gene_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", "SOXS.CARD_pvgb_ U00096_4277468_4277933 _ARO_3003381_Escherichia_Multi_drug_resistance_MDR_regulator_SOXS_RequiresSNPConfirmation", "TOLC.CARD_phgb_ FJ768952_0_1488 _ARO_3000237_tolC_Multi_drug_resistance_Multi_drug_efflux_pumps_TOLC", "parE.CARD_pvgb_ NC_007779_3172159_3174052 _ARO_3003316_Escherichia_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation","GYRA.Flq_ CP001918.1 _gene3562_Fluoroquinolones_Fluoroquinolone_resis tant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", "PARC.Flq_ NC_003197.1.1254697 _Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_PARC_RequiresSNPConfirmation", "GYRA.Flq_ NC_003197.1.1253794 _Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_GYRA_RequiresSNPConfirmation", "parE.CARD_pvgb_ NC_003197_3343961_3345854 _ARO_3003317_Salmonella_Fluoroquinolones_Fluoroquinolone_resistant_DNA_topoisomerases_parE_RequiresSNPConfirmation", "ACRR.CARD_pvgb_ NC_014121_1270697_1271351 _ARO_3003374_Enterobacter_Multi_drug_resistance_MDR_regulator_ACRR_RequiresSNPConfirmation"

我尝试了以下命令:

library(dplyr)
df %>% mutate(ref_name2 = sub("[A-z]+.[A-z]+.[A-z]+.([A-z][A-z].[0-9]+.[0-9].[0-9]+)", "\\1", ref_name),
         ref_name2 = sub("\\_ARO.*", "", ref_name2),
         ref_name2 = sub("\\_Fluoro.*", "", ref_name2),
         ref_name2 = sub("\\_gene.*", "", ref_name2))
Run Code Online (Sandbox Code Playgroud)

但这只是部分匹配上面的字符串,也删除了我想要的一些字母.有没有比多个sub/gsub调用更简单的方法?

我最终想要的是:

c(NC_002695.1.916822, AP009048_3760295_3762710, U00096_1619119_1619554, M58408, U00096_4277468_4277933, FJ768952_0_1488, NC_007779_3172159_3174052, CP001918.1, NC_003197.1.1254697, NC_003197.1.1253794, NC_003197_3343961_3345854, NC_014121_1270697_1271351)
Run Code Online (Sandbox Code Playgroud)

我试图在https://regexr.com/30u4a上直观地匹配它,并且还尝试阅读很多关于复杂匹配的内容,但似乎无法找到正确的代码.

Wik*_*żew 5

你可以用

> sub("^.*?_([A-Z]+[0-9_.]*[0-9]).*", "\\1", x)
 [1] "NC_002695.1.916822"        "AP009048_3760295_3762710"  "U00096_1619119_1619554"    "M58408"                    "U00096_4277468_4277933"    "FJ768952_0_1488"          
 [7] "NC_007779_3172159_3174052" "CP001918.1"                "NC_003197.1.1254697"       "NC_003197.1.1253794"       "NC_003197_3343961_3345854" "NC_014121_1270697_1271351"
Run Code Online (Sandbox Code Playgroud)

请参阅正则表达式演示.

图案细节

  • ^- 字符串的开头(可以省略,因为sub使用)
  • .*?- 零个或多个字符,尽可能少(请注意,[^_]*这里不可能使用,因为我们需要的模式可能会出现在0或更多下划线之后)
  • _ - 一个 _
  • ([A-Z]+[0-9_.]*[0-9]) - 捕获第1组:
    • [A-Z]+ - 1个大写ASCII字母
    • [0-9_.]*- 0位或更多位数_.字符
    • [0-9] - 一个数字
  • .* - 其余的字符串.