如何在python中拆分CamelCase

Apl*_*nus 43 python regex camelcasing

我想要实现的是这样的:

>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
Run Code Online (Sandbox Code Playgroud)

所以我搜索并找到了这个完美的正则表达式:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Run Code Online (Sandbox Code Playgroud)

作为我尝试的下一个逻辑步骤:

>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
Run Code Online (Sandbox Code Playgroud)

为什么这不起作用,如何从python中的链接问题获得结果?

编辑:解决方案摘要

我用一些测试用例测试了所有提供的解决方案:

string:                 ''
AplusKminus:            ['']
casimir_et_hippolyte:   []
two_hundred_success:    []
kalefranz:              string index out of range # with modification: either [] or ['']

string:                 ' '
AplusKminus:            [' ']
casimir_et_hippolyte:   []
two_hundred_success:    [' ']
kalefranz:              [' ']

string:                 'lower'
all algorithms:         ['lower']

string:                 'UPPER'
all algorithms:         ['UPPER']

string:                 'Initial'
all algorithms:         ['Initial']

string:                 'dromedaryCase'
AplusKminus:            ['dromedary', 'Case']
casimir_et_hippolyte:   ['dromedary', 'Case']
two_hundred_success:    ['dromedary', 'Case']
kalefranz:              ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']

string:                 'CamelCase'
all algorithms:         ['Camel', 'Case']

string:                 'ABCWordDEF'
AplusKminus:            ['ABC', 'Word', 'DEF']
casimir_et_hippolyte:   ['ABC', 'Word', 'DEF']
two_hundred_success:    ['ABC', 'Word', 'DEF']
kalefranz:              ['ABCWord', 'DEF']
Run Code Online (Sandbox Code Playgroud)

总而言之,你可以说@kalefranz的解决方案与问题不匹配(参见最后一个案例),而@casimir et hippolyte的解决方案只吃一个空格,从而违反了拆分不应该改变单个部分的想法.其余两个备选方案的唯一区别是我的解决方案返回一个空字符串输入的空字符串列表,@ 200_success的解决方案返回一个空列表.我不知道python社区在这个问题上的立场,所以我说:我对任何一个都很好.由于200_success的解决方案更简单,我接受它作为正确的答案.

200*_*ess 34

正如@nfs所解释的那样,re.split()永远不要拆分空模式匹配.因此,您应该尝试找到您感兴趣的组件,而不是拆分.

以下是使用re.finditer()模拟拆分的解决方案:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]
Run Code Online (Sandbox Code Playgroud)


Jos*_*ush 19

使用re.sub()split()

import re

name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()
Run Code Online (Sandbox Code Playgroud)

结果

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
Run Code Online (Sandbox Code Playgroud)

  • 很好,即使只是 `re.sub('([AZ]+)', r' \1', name).split()` 也适用于没有像 `'XYZCamelCase'` 和 ` 这样的输入的简单情况'IPAddress'` (或者如果您同意为它们获取 `['XYZCamel', 'Case']` 和 `['IPAddress']` )。另一个“re.sub”也解释了这些情况(使每个小写字母序列仅附加到前面的一个大写字母)。 (4认同)
  • 迄今为止最好的答案恕我直言,优雅而有效,应该是选定的答案。 (2认同)

Set*_*top 8

工作解决方案,没有正则表达式

我不太擅长正则表达式。我喜欢在我的 IDE 中使用它们进行搜索/替换,但我尽量避免在程序中使用它们。

这是纯python中一个非常简单的解决方案:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filer it
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]
Run Code Online (Sandbox Code Playgroud)

???

还有一些测试

def test():
    TESTS = [
        ("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
        ("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
        ("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
        ("Ta", ['Ta']),
        ("aT", ['a', 'T']),
        ("a", ['a']),
        ("T", ['T']),
        ("", []),
        ("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
        ("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
        ("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
    ]
    for (q,a) in TESTS:
        assert camel_case_split(q) == a

if __name__ == "__main__":
    test()
Run Code Online (Sandbox Code Playgroud)

  • 谢谢,这是可读的,有效的,并且有测试!在我看来,比正则表达式解决方案好得多。 (2认同)

Cas*_*yte 7

大多数情况下,当您不需要检查字符串的格式时,全局研究比分割更简单(对于相同的结果):

re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
Run Code Online (Sandbox Code Playgroud)

回报

['Camel', 'Case', 'XYZ']
Run Code Online (Sandbox Code Playgroud)

为了处理单峰骆驼,你可以使用:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
Run Code Online (Sandbox Code Playgroud)

注意:(?=[A-Z]|$)可以使用双重否定缩短(具有否定字符类的负向前瞻):(?![^A-Z])


emy*_*ler 6

我只是偶然发现了这个案例并编写了一个正则表达式来解决它。实际上,它应该适用于任何单词组。

RE_WORDS = re.compile(r'''
    # Find words in a string. Order matters!
    [A-Z]+(?=[A-Z][a-z]) |  # All upper case before a capitalized word
    [A-Z]?[a-z]+ |  # Capitalized words / all lower case
    [A-Z]+ |  # All upper case
    \d+  # Numbers
''', re.VERBOSE)
Run Code Online (Sandbox Code Playgroud)

这里的关键是对第一种可能情况的前瞻。它将在大写单词之前匹配(并保留)大写单词:

assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
Run Code Online (Sandbox Code Playgroud)


end*_*sol 6

import re

re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result

# '(?<=[a-z])'         --> means preceding lowercase char (group A)
# '(?=[A-Z])'          --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
Run Code Online (Sandbox Code Playgroud)