如何提取特定单词后有字数限制的短语?

Roy*_*Roy 6 regex string r

我有以下文本,我想从字符串向量中提取特定单词后面的 5 个单词:

\n
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men\xe2\x80\x99s world rankings as England remains fifth in the post-Qatar standings.\nHad Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."\n\nmy_teams <- tolower(c("Brazil", "Argentina"))\n
Run Code Online (Sandbox Code Playgroud)\n

我想提取单词Brazilor之后的接下来 5 个单词Argentina,然后将它们组合为整个字符串。

\n

我使用以下脚本来获取确切的单词,但不是特定单词后面的短语:

\n
pattern <- paste(my_teams, collapse = "|")\n\nv <- unlist(str_extract_all(tolower(my_text), pattern))\n\npaste(v, collapse=' ')\n
Run Code Online (Sandbox Code Playgroud)\n

任何建议,将不胜感激。谢谢!

\n

Wik*_*żew 5

您可以使用

\n
library(stringr)\nmy_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men\xe2\x80\x99s world rankings as England remain fifth in the post-Qatar standings.\nHad Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."\nmy_teams <- tolower(c("Brazil", "Argentina"))\npattern <- paste0("(?i)\\\\b(?:", paste(my_teams, collapse = "|"), ")\\\\s+(\\\\S+(?:\\\\s+\\\\S+){4})")\nres <- lapply(str_match_all(my_text, pattern), function (m) m[,2])\nv <- unlist(res)\npaste(v, collapse=\' \')\n# => [1] "from the top of the won the final within 90"\n
Run Code Online (Sandbox Code Playgroud)\n

请参阅R 演示。您还可以查看正则表达式演示。请注意,它的使用str_match_all会保留捕获的文本。

\n

细节

\n
    \n
  • (?i)- 不区分大小写匹配
  • \n
  • \\b- 单词边界
  • \n
  • (?:Brazil|Argentina)- 国家之一
  • \n
  • \\s+- 一个或多个空格
  • \n
  • (\\S+(?:\\s+\\S+){4})- 第 1 组:一个或多个非空格,然后四次重复一个或多个空格,后面跟着一个或多个非空格。
  • \n
\n


Ben*_*ker 5

也许不是最好的,但是:

拆分为单词向量,删除非单词字符,小写(以匹配目标):

words <- strsplit(my_text,'\\s', perl= TRUE)[[1]] |>
    gsub(pattern = "\\W", replacement = "", perl = TRUE) |>
    tolower()
Run Code Online (Sandbox Code Playgroud)

查找目标位置,获取字符串,然后粘贴回一起:

loc <- which(words %in% my_teams)
sapply(loc, \(i) words[(i+1):(i+5)], simplify= FALSE) |>
    sapply(paste, collapse=" ")
Run Code Online (Sandbox Code Playgroud)
[1] "have failed to dislodge brazil"    "from the top of the"              
[3] "won the final within 90"           "in the last eight tournaments"    
[5] "the 1998 finalists getting beyond"
Run Code Online (Sandbox Code Playgroud)

paste(., collapse = " ")也许最后你还需要一份?