为什么strsplit使用正向前瞻和后观断言匹配不同？

Question

为什么strsplit使用正向前瞻和后观断言匹配不同？

常识和使用健全性检查gregexpr()表明,下面的后视和前瞻断言应该恰好在以下位置匹配testString:

testString <- "text XX text"
BB  <- "(?<= XX )"
FF  <- "(?= XX )"

as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5

Run Code Online (Sandbox Code Playgroud)

strsplit()但是,使用这些匹配位置的方式不同,testString在使用lookbehind断言时在一个位置分割,但在使用前瞻断言时在两个位置 - 第二个看起来不正确 - .

strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"    

strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text"    " "       "XX text"

Run Code Online (Sandbox Code Playgroud)

我有两个问题:(Q1)这里发生了什么？并且(Q2)如何才能strsplit()更好地表现？

更新: Theodore Lytras的优秀答案解释了发生了什么,以及地址(Q1).我的答案建立在他的基础上,以确定一个补救措施,解决(Q2).

Answer 1

The*_*ras 27

我不确定这是否属于错误,因为我相信这是基于R文档的预期行为.来自?strsplit:

应用于每个输入字符串的算法是
repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}
Run Code Online (Sandbox Code Playgroud)
请注意,这意味着如果在(非空)字符串的开头存在匹配项,则输出的第一个元素为""",但如果字符串末尾存在匹配项,则输出为与删除的匹配相同.

问题是前瞻(和后瞻)断言是零长度.例如,在这种情况下:

FF <- "(?=funky)"
testString <- "take me to funky town"

gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE

strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"

Run Code Online (Sandbox Code Playgroud)

会发生的是,孤独的前瞻(?=funky)与位置12相匹配.因此,第一次拆分包括直到位置11(匹配的左侧)的字符串,并且它将与字符串一起从字符串中移除,其中 - 但是 - 长度为零.

现在剩下的字符串是funky town,并且前瞻与位置1匹配.但是没有什么可以删除,因为匹配的左边没有任何内容,并且匹配本身的长度为零.因此算法陷入无限循环.显然,R通过拆分单个字符来解决这个问题,顺便strsplit提一下,当使用空的正则表达式时(参数时split="")记录的行为.在此之后剩下的字符串是unky town,因为没有匹配,它将作为最后一次拆分返回.

Lookbehinds没有问题,因为每个匹配被拆分并从剩余的字符串中删除,因此算法永远不会被卡住.

不可否认,这种行为乍一看看起来很怪异.然而,否则会违反前瞻的零长度假设.鉴于该strsplit算法已被记录,我相信这不符合错误的定义.

Answer 2

Jos*_*ien 16

根据Theodore Lytras对substr()行为的仔细解释,一个相当干净的解决方法是在匹配任何单个字符的前后断言中加上前置匹配的前瞻断言:

testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"

Run Code Online (Sandbox Code Playgroud)

Answer 3

fem*_*gon 5

对我来说看起来像个错误.这似乎并不仅仅与空间有关,而是与任何孤独的前瞻(正面或负面)相关:

FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"  

FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f"                "unky take me to " "f"                "unky "           
# [5] "f"                "unky town"       


FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx"       "y"       "xxxxxxx"

Run Code Online (Sandbox Code Playgroud)

如果给出与零宽度断言一起捕获的东西,似乎工作正常,例如:

FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

Run Code Online (Sandbox Code Playgroud)

也许这样的事情可能会起到解决方法的作用.

归档时间：	12 年，9 月前
查看次数：	1399 次
最近记录：	12 年，9 月前