在模式发生后拆分字符串

all*_*nvc 6 regex r stringr

我有以下字符串向量.它包含两个元素.每个元素由两个折叠短语组成.

strings <- c("This is a phrase with a NameThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
Run Code Online (Sandbox Code Playgroud)

我想将这些短语拆分为向量中的每个元素.我一直在尝试这样的事情:

library(stringr)

str_split(strings, "\\B(?=[a-z|0-9][A-Z])")
Run Code Online (Sandbox Code Playgroud)

几乎给了我正在寻找的东西:

[[1]]
[1] "This is a phrase with a Nam" "eThis is another phrase"

[[2]]
[1] "This is a phrase with the number 201" "9This is another phrase"
Run Code Online (Sandbox Code Playgroud)

我想在模式之后进行拆分,但无法弄清楚如何做到这一点.

我想我接近一个解决方案,并希望得到任何帮助.

Cer*_*nce 4

您需要匹配大写字母之前的位置,而不是初始短语的最后一个字母之前的位置(这是您需要的位置之前的一个字符)。您可能只是将非单词边界与大写字母的前瞻相匹配:

str_split(strings, "\\B(?=[A-Z])")
Run Code Online (Sandbox Code Playgroud)

如果短语可以包含前导大写字母,但在小写字母开头后不包含任何大写字母,则也可以使用数字或小写字母的后视来拆分它们。这次不需要非字边界:

strings <- c("SHOCKING NEWS: someone did somethingThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
str_split(strings, "(?<=[a-z0-9])(?=[A-Z])")
Run Code Online (Sandbox Code Playgroud)