Sim*_*lon 13 regex pcre r strsplit
注释对我的回答这个问题,这应该使用得到期望的结果strsplit没有,即使它似乎在一个字符向量正确匹配的第一个和最后逗号.这可以使用gregexpr和证明regmatches.
那么为什么strsplit在这个例子中对每个逗号进行拆分,即使regmatches只返回同一个正则表达式的两个匹配?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Run Code Online (Sandbox Code Playgroud)
咦?到底是怎么回事?
Cas*_*yte 10
@Aprillion的理论是精确的,来自R文档:
应用于每个输入字符串的算法是
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Run Code Online (Sandbox Code Playgroud)
换句话说,在每次迭代时^都会匹配一个新字符串的开头(没有先前的项目.)
简单地说明这种行为:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Run Code Online (Sandbox Code Playgroud)
在这里,您可以通过前瞻断言作为分隔符来查看此行为的后果(感谢@ JoshO'Brien的链接.)