RegEx分裂camelCase或TitleCase(高级)

Jmi*_*ini 76 java regex title-case camelcasing

我找到了一个出色的RegEx来提取camelCase或TitleCase表达式的一部分.

 (?<!^)(?=[A-Z])
Run Code Online (Sandbox Code Playgroud)

它按预期工作:

  • 价值 - >价值
  • camelValue - >驼峰/价值
  • TitleValue - >标题/价值

例如使用Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
Run Code Online (Sandbox Code Playgroud)

我的问题是它在某些情况下不起作用:

  • 案例1:VALUE - > V/A/L/U/E.
  • 案例2:eclipseRCPExt - > eclipse/R/C/P/Ext

在我看来,结果应该是:

  • 案例1:价值
  • 案例2:eclipse/RCP/Ext

换句话说,给定n个大写字符:

  • 如果n个字符后跟小写字符,则组应为:(n-1个字符)/(第n个字符+低字符)
  • 如果n个字符在末尾,则该组应为:(n个字符).

关于如何改进这个正则表达式的任何想法?

NPE*_*NPE 106

以下正则表达式适用于以上所有示例:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   
Run Code Online (Sandbox Code Playgroud)

它通过强制负面的lookbehind不仅忽略字符串开头的匹配,而且还忽略大写字母前面有另一个大写字母的匹配.这会处理像"VALUE"这样的情况.

由于未能在"RPC"和"Ext"之间进行拆分,正则表达式的第一部分本身在"eclipseRCPExt"上失败.这是第二个条款的目的:(?<!^)(?=[A-Z][a-z].此子句允许在每个大写字母之前进行拆分,后跟小写字母,但字符串的开头除外.

  • @Igoru:正则表达式的风味各不相同.问题是关于Java,而不是PHP,答案也是如此. (12认同)
  • @Igoru:"通用正则表达式"是一个虚构的概念. (6认同)
  • @ igorsantos07:不,内置的正则表达式实现在平台之间变化很大.有些人试图像Perl一样,有些人试图像POSIX一样,有些人介于两者之间或完全不同. (3认同)

rid*_*ner 71

看起来你正在使它变得比它需要的更复杂.对于camelCase,拆分位置只是一个大写字母紧跟小写字母的任何地方:

(?<=[a-z])(?=[A-Z])

以下是此正则表达式如何拆分您的示例数据:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

与你想要的输出的唯一区别是eclipseRCPExt,我认为这是正确分裂在这里.

附录 - 改进版

注意:这个答案最近得到了一个upvote,我意识到有更好的方法......

通过添加上述正则表达式的第二种替代方法,所有OP的测试用例都被正确分割.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

以下是改进的正则表达式如何拆分示例数据:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

编辑:20130824添加了改进版本来处理RCPExt -> RCP / Ext案例.

  • 谢谢您的帮助; 我修改了你的正则表达式来添加一些选项来关注字符串中的数字:`(?<= [az])(?= [AZ])|(?<= [AZ])(?= [AZ] [AZ])|(<= [0-9])(= [AZ] [AZ])|????(<= [A-ZA-Z])(= [0-9])` (2认同)

YMo*_*omb 29

另一种解决方案是在commons-lang中使用专用方法:StringUtils#splitByCharacterTypeCamelCase


dea*_*dog 10

我无法让aix的解决方案工作(并且它也无法在RegExr上工作),所以我想出了我自己的测试,似乎正在寻找你正在寻找的东西:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
Run Code Online (Sandbox Code Playgroud)

这是使用它的一个例子:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Run Code Online (Sandbox Code Playgroud)

这里我用空格分隔每个单词,所以这里有一些如何转换字符串的例子:

  • ThisIsATitleCASEString =>这是标题CASE字符串
  • andThisOneIsCamelCASE =>和这一个是骆驼案例

上面的解决方案完成了原始帖子要求的内容,但我还需要一个正则表达式来查找包含数字的camel和pascal字符串,所以我也想出了这个变体来包含数字:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
Run Code Online (Sandbox Code Playgroud)

以及使用它的一个例子:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
Run Code Online (Sandbox Code Playgroud)

以下是一些使用此正则表达式转换带数字的字符串的示例:

  • myVariable123 =>我的变量123
  • my2Variables =>我的2个变量
  • The3rdVariableIsHere =>第3个变量就在这里
  • 12345NumsAtTheStartIncludedToo => 12345起初包含的Nums


Chr*_*röm 6

要处理更多的字母而不仅仅是A-Z

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Run Code Online (Sandbox Code Playgroud)

任何一个:

  • 在任何小写字母之后拆分,后跟大写字母。

例如parseXML-> parse, XML.

或者

  • 在任何字母之后拆分,后跟大写字母和小写字母。

例如XMLParser-> XML, Parser.


以更易读的形式:

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }    
}
Run Code Online (Sandbox Code Playgroud)