我有以下字符串向量.它包含两个元素.每个元素由两个折叠短语组成.
strings <- c("This is a phrase with a NameThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
我想将这些短语拆分为向量中的每个元素.我一直在尝试这样的事情:
library(stringr)
str_split(strings, "\\B(?=[a-z|0-9][A-Z])")
几乎给了我正在寻找的东西:
[[1]]
[1] "This is a phrase with a Nam" "eThis is another phrase"
[[2]]
[1] "This is a phrase with the number 201" "9This is another phrase"
我想在模式之后进行拆分,但无法弄清楚如何做到这一点.
我想我接近一个解决方案,并希望得到任何帮助.
您需要匹配大写字母之前的位置,而不是初始短语的最后一个字母之前的位置(这是您需要的位置之前的一个字符)。您可能只是将非单词边界与大写字母的前瞻相匹配:
str_split(strings, "\\B(?=[A-Z])")
如果短语可以包含前导大写字母,但在小写字母开头后不包含任何大写字母,则也可以使用数字或小写字母的后视来拆分它们。这次不需要非字边界:
strings <- c("SHOCKING NEWS: someone did somethingThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
str_split(strings, "(?<=[a-z0-9])(?=[A-Z])")