如何使用 mutate 在 dplyr 中进行 grep

Question

如何使用 mutate 在 dplyr 中进行 grep

Kam*_*mil 5 r dplyr

我需要一些帮助来了解我的管道中发生的情况dplyr，并请求针对此问题的各种解决方案。

问题

我有一个研究所列表（研究期刊文章论文作者的正式术语），我想提取主要研究所名称。如果是大学，那就是Univ。XX 的例子，为了简单起见，我在这里坚持使用这个例子。

尝试的解决方案逻辑

用逗号分隔机构名称
grep 查找术语“univ”或其他与大学相关的术语列表
提取命中的索引

边缘情况/假设

我正在搜索的术语仅存在于其中一个拆分中
这里的所有机构都是大学（为了 Stack Overflow，这里的问题保持简单）

代码

df %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
 head()

Run Code Online (Sandbox Code Playgroud)

我假设正在发生但没有发生的是我上面写的逻辑。institute我看到发生的情况是，在 mutate 中，正在搜索每一行中的第一个实例df，并且填写完全相同的“新大学所以~”。我对错误是什么有一个大致的了解，但不知道为什么会这样发生的情况或如何在保持dplyr. 如果我使用一个apply函数，我可以做到这一点，我很好奇有什么答案。

它看起来像什么：

# A tibble: 6 x 2
  institute                                                                          instGuess              
  <chr>                                                                              <chr>                  
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india                                         " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~

Run Code Online (Sandbox Code Playgroud)

用于示例的数据

df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia", 
"department computer science, friedrich-alexander-university, erlangen-nuremberg, germany", 
"department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia", 
"school of information technology and electrical engineering, university of queenslandold, australia", 
"dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
), instGuess = c(" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales"
)), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pdu*_*bbs 5

您需要包含一个group_by才能使您的语法正常工作：

\n\n

df %>%\n  group_by(institute) %>%\n  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])\n

Run Code Online (Sandbox Code Playgroud)\n\n

生产：

\n\n

# A tibble: 6 x 2\n# Groups:   institute [6]\ninstitute                                                                  instGuess              \n<chr>                                                                      <chr>                  \n  1 school of computer science and engineering, university of new south wales\xe2\x80\xa6 " university of new so\xe2\x80\xa6\n2 department computer science, friedrich-alexander-university, erlangen-nur\xe2\x80\xa6 " friedrich-alexander-\xe2\x80\xa6\n3 department of ece, pesit, bangalore, india                                 NA                     \n4 school of information technology and electrical engineering, university o\xe2\x80\xa6 " university of queens\xe2\x80\xa6\n5 school of information technology and electrical engineering, university o\xe2\x80\xa6 " university of queens\xe2\x80\xa6\n6 dept. of info. syst. and comp. sci., national university of singapore, 10\xe2\x80\xa6 " national university \xe2\x80\xa6\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 2

r2e*_*ans 3

我认为 @Pdubbs\' 的答案是第一个最好的，他用来group_by模仿 @www\'s 的答案，使用rowwise()，但区别（在我看来明显的优势）是，当重复时$institute，可以通过以下方式获得效率每个机构只进行一次这种猜测。

\n\n

这更进一步，并且不会重新strsplit启动每个实例。我将复制第一行：

\n\n

df <- df[c(1,1:6),]\n

Run Code Online (Sandbox Code Playgroud)\n\n

定义一个完成工作的函数，而不是重复strsplit：

\n\n

find_univ <- function(x) {\n  message(\'*\', appendLF=FALSE)\n  y <- strsplit(x[[1]], \',\')[[1]]\n  y[grep(\'univ\', y)][1]\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

（并插入一个message调用来指示它被调用了多少次......不包括在生产中），然后是序列：

\n\n

df %>%\n  group_by(institute) %>%\n  mutate(instGuess = find_univ(institute)) %>%\n  ungroup() %>%\n  select(instGuess) # for display purposes only\n# ******  <---- six calls on seven rows, benefit of group_by\n# A tibble: 7 \xc3\x97 1\n#                           instGuess\n#                               <chr>\n# 1     university of new south wales\n# 2     university of new south wales\n# 3    friedrich-alexander-university\n# 4                              <NA>\n# 5       university of queenslandqld\n# 6       university of queenslandold\n# 7  national university of singapore\n

Run Code Online (Sandbox Code Playgroud)\n\n

我不知道这种重复数据删除是否strsplit有影响，尽管它只有在拥有大量数据时才有用。否则，这只是一种强迫症级别的效率，没有“过早优化”。

\n

归档时间：	7 年，6 月前
查看次数：	3481 次
最近记录：	5 年，3 月前