R:strsplit中的正则表达式(找到","后跟大写字母)

Dav*_*vid 4 regex r strsplit

假设我有一个包含一些我希望根据正则表达式拆分的字符的向量.

更确切地说,我想基于逗号分隔字符串,然后是空格,然后是大写字母(根据我的理解,regex命令看起来像这样:( /(, [A-Z])/g当我在这里尝试它时工作正常)).

当我尝试实现这一点时r,regex似乎不起作用,例如:

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"
Run Code Online (Sandbox Code Playgroud)

它找不到分裂.我在这做错了什么?

任何帮助是极大的赞赏!

Wik*_*żew 8

这是一个解决方案:

strsplit(x, ", (?=[A-Z])", perl=T)
Run Code Online (Sandbox Code Playgroud)

请参阅IDEONE演示

输出:

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"
Run Code Online (Sandbox Code Playgroud)

正则表达式 - ", (?=[A-Z])"包含一个前瞻(?=[A-Z]),检查但不消耗大写字母.在R中,您需要使用perl=T包含外观的正则表达式.

如果空格是可选的,或者逗号和大写字母之间可以有双倍空格,请使用

strsplit(x, ",\\s*(?=[A-Z])", perl=T)
Run Code Online (Sandbox Code Playgroud)

还有一个支持Unicode字母的变体(带\\p{Lu}):

strsplit(x, ", (?=\\p{Lu})", perl=T)
Run Code Online (Sandbox Code Playgroud)