正则表达式匹配字符串中重复两次的前几个字符

Question

正则表达式匹配字符串中重复两次的前几个字符

Lam*_*ard 2 regex string r regular-language

我面临一个问题，即在 R 语言的字符串中查找前几个 (>=2) 字符重复两次的所有字符串。
例如

字符串应该选择出
(1) all ochir all y ------> 前 3 个字符 'all' 在字符串中重复两次
(2) froufrou ------> 前 4 个字符 'frou' 在字符串中重复两次
(3) under gro under ------> 前 5 个字符 'under' 在字符串中重复两次

琴弦应NOT选择出
（1）gummage ------>甚至第一个字符'G'重复两次，但只有1个字符，不匹配条件为> = 2个第一字符
（2）hypergoddess ------ > 没有前几个字符重复两次
(3) kgashga ------> 甚至 'ga' 重复两次，但不包括第一个字符 'k'，不匹配需要包括第一个字符的条件

听说backreference（例如 \b 或 \w）可能会有所帮助，但仍然无法弄清楚，您能帮忙弄清楚吗？

注意：我看到有一个函数作为xmatch <- str_extract_all(x, regex) == x使用的方法，str_extract_all来自library(stringr)

x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
regex <- "as described details here"
function(x, regex) {
  xmatch <- str_extract_all(x, regex) == x
  matched_x <- x[xmatch]
}

Run Code Online (Sandbox Code Playgroud)

如果很简洁就更喜欢了！！谢谢

Answer 1

Tim*_*sen 5

使用grepl：

x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
grepl("^(.{2,}).*\\1.*$", x)

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Run Code Online (Sandbox Code Playgroud)

regex 模式匹配并捕获前两个或更多字符，然后还断言相同的两个或更多字符出现在字符串的后面。

如果您想使用我的答案中的逻辑来获取匹配字符串的向量，那么只需使用：

x[grepl("^(.{2,}).*\\1.*$", x)]

[1] "allochirally"  "froufrou"      "undergrounder"

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年前
查看次数：	1249 次
最近记录：	6 年前