从 R 文本中提取文本引用（字符串）

Question

从 R 文本中提取文本引用（字符串）

我正在尝试编写一个函数，该函数允许我粘贴书面文本，并且它将返回写作中使用的文本内引用的列表。例如，这就是我目前拥有的：

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

Run Code Online (Sandbox Code Playgroud)

在这个例子中，它返回

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

Run Code Online (Sandbox Code Playgroud)

但我希望它返回类似的内容：

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

Run Code Online (Sandbox Code Playgroud)

关于如何使这个功能更加具体有什么想法吗？我正在使用 R。谢谢！

Answer 1

ben*_*n23 7

通过一些不那么困难的正则表达式，我们可以执行以下操作：

\n

library(tidyverse)\n\npull_cites <- function (text) {\n  str_extract_all(text, "(?<=\\\\()[A-Z][a-z][^()]* [12][0-9]{3}(?=\\\\))|[A-Z][a-z]+ \\\\([12][0-9]{3}[^()]*", simplify = T) %>% \n    gsub("\\\\(", "", .) %>% \n    str_split(., "; ") %>% \n    unlist()\n}\n\npull_cites("This is a test. I only want to select the (cites) in parenthesis. \n            I do not want it to return words in parenthesis that do not have years attached, \n            such as abbreviations (abbr). For example, citing (Smith 2010) is something I would \n            want to be returned. I would also want multiple citations returned separately such \n            as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned \n            as Cooper 2015, and not just 2015. other aspects of life \n            history (Nye et al. 2010; Runge et al. 2010; Lesser 2016). In the Gulf of Maine, \n            annual sea surface temperature (SST) averages have increased a total of roughly 1.6 \xc2\xb0C \n            since 1895 (Fernandez et al. 2020)")\n\n[1] "Smith 2010"            "Smith 2010"           \n[3] "Jones 2001"            "Brown 2020"           \n[5] "Cooper 2015"           "Nye et al. 2010"      \n[7] "Runge et al. 2010"     "Lesser 2016"          \n[9] "Fernandez et al. 2020"\n

Run Code Online (Sandbox Code Playgroud)\n

正则表达式解释str_extract_all()：

\n

(?<=\\\\()匹配开括号后的一个字符(（R 中的双转义\\\\）
[A-Z][a-z][^()]*匹配一个大写字母后跟一个小写字母后跟一个或多个不是括号的字符（[^()*]由 @WiktorStribi\xc5\xbcew 贡献）
(?=\\\\))匹配右括号之前的一个字符)
[12][0-9]{3}匹配年份，我假设年份以 1 或 2 开头，后跟 3 个数字

\n

以下正则表达式用于将特殊情况与模式匹配Copper (2015)：

\n

[A-Z][a-z]+ \\\\([12][0-9]{3}[^()]*匹配任何包含一个大写字母后跟 1 个以上小写字母后跟一个空格后跟一个开括号后跟(我上面定义的“年份”的内容

\n

归档时间：	3 年，8 月前
查看次数：	260 次
最近记录：	3 年，7 月前