如何从 R 中具有模式的字符串中提取特定单词

Mah*_*adi 1 regex string r gsub

我有一个数据框,其中包含教师中学生论文的导师和顾问的姓名,例如:

 DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
Run Code Online (Sandbox Code Playgroud)

我将把主管和顾问分成两个不同的列(正如我的期望),如下所示:

DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi"))

DF1
  Supervisor                                             Advisors
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
Run Code Online (Sandbox Code Playgroud)

我尝试了以下代码:

DF1<-strsplit(DF$Names, "Name :")

stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")

DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) )

DF3 <- lapply(DF2,function(x)  x[!x %in% stopwords] )

DF4<-lapply(DF3,function(x)  paste(x, collapse = " "))
Run Code Online (Sandbox Code Playgroud)

但最终结果如下并不是我的预期,显然需要进一步的工作才能转换为数据帧!:

DF4
[[1]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[2]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[3]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"
Run Code Online (Sandbox Code Playgroud)

有没有什么简单的方法可以解决这个问题呢?我发现 regexp 可能很有帮助,但我不知道至少在我的示例中如何使用它。预先感谢您的任何答复...

Chr*_*ann 5

这是一个尝试extract

library(tidyr)
DF %>%
  # clean strings:
  mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>%
  # extract data into columns:
  extract(Names,
          into = c("Supervisor", "Advisor"),
          regex = "(\\w+\\s\\w+)\\s(.*)") %>%
  # insert commas into `Advisor`:
  mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE))
  Supervisor                                              Advisor
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
Run Code Online (Sandbox Code Playgroud)

解释(根据OP的要求):

extract's表达式中的正则表达式regex旨在执行两项任务:

  • (i) 它必须从头到尾描述整个字符串
  • (ii) 它必须挑选出那些应该填充新创建的列的元素

任务(i)的实现是(\\w+\\s\\w+)捕获组成名称的两个单词Supvervisor,同时\\s描述(但不捕获)后面的空格并(.*)描述/匹配该空格后面的任何内容 - 即,在本例中为四个Advisor名称。

任务 (ii) 是通过将Supvervisor名称和Advisor名称包装在括号中给出的捕获组中来实现的;这些括号是函数extract“意识到”它们的内容应该进入新列的“语法”。

最后,使用捕获组再次在名称之间插入逗号Advisor,可以gsub使用反向引用 ( ) 在 的替换参数中重新收集该逗号\\1。该(?!$)表达式是一个否定的前瞻,断言当单词边界锚后面的内容不是\\b(因此在前瞻中)字符串的末尾(以 表示)时才插入逗号。希望这可以帮助!!$

数据:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
Run Code Online (Sandbox Code Playgroud)