以大写字母分割字符串

Fed*_*oni 81 python regex string

在给定字符集出现之前拆分字符串的pythonic方法是什么?

例如,我希望'TheLongAndWindingRoad' 在任何大写字母的出现时分割 (可能除了第一个),并获得 ['The', 'Long', 'And', 'Winding', 'Road'].

编辑:它也应该拆分单个事件,即从'ABC'我想获得 ['A', 'B', 'C'].

Mar*_*ers 120

不幸的是,在Python中分割零宽度匹配是不可能的.但你可以re.findall改用:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']
Run Code Online (Sandbox Code Playgroud)

  • 请注意,这将删除第一个首字符之前的任何字符.'theLongAndWindingRoad'会导致['Long','And','Winding','Road'] (10认同)
  • @MarcSchulder:如果你需要那个案例,只需使用`'[a-zA-Z] [^ AZ]*'`作为正则表达式. (10认同)
  • 为了拆分小驼峰词`print(re.findall('^[az]+|[AZ][^AZ]*', 'theLongAndWindingRoad'))` (4认同)
  • “ThatLeadsT​​oYourDooooor”<3 (2认同)

Dav*_*rby 27

这是另一种正则表达式解决方案.该问题可以被称为"如何在执行拆分之前在每个大写字母之前插入空格":

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
Run Code Online (Sandbox Code Playgroud)

这具有保留所有非空白字符的优点,而大多数其他解决方案则不能.


Joh*_*ooy 18

>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']

>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']
Run Code Online (Sandbox Code Playgroud)

如果要"It'sATest"拆分以["It's", 'A', 'Test']将rexeg更改为"[A-Z][a-z']*"


End*_*nis 11

使用前瞻:

在 Python 3.7 中,你可以这样做:

re.split('(?=[A-Z])', 'theLongAndWindingRoad')
Run Code Online (Sandbox Code Playgroud)

它产生:

['the', 'Long', 'And', 'Winding', 'Road']
Run Code Online (Sandbox Code Playgroud)


use*_*088 8

Pythonic 方式可能是:

\n\n
"".join([(" "+i if i.isupper() else i) for i in \'TheLongAndWindingRoad\']).strip().split()\n[\'The\', \'Long\', \'And\', \'Winding\', \'Road\']\n
Run Code Online (Sandbox Code Playgroud)\n\n

适用于 Unicode,避免 re/re2。

\n\n
"".join([(" "+i if i.isupper() else i) for i in \'\xd0\xa1\xd1\x83\xd0\xbf\xd0\xb5\xd1\x80\xd0\x9c\xd0\xb0\xd1\x80\xd0\xba\xd0\xb5\xd1\x82\xd1\x8b\xd0\x9f\xd1\x80\xd0\xbe\xd0\xb4\xd0\xb0\xd0\xb6\xd0\xb0\xd0\x9a\xd0\xbb\xd0\xb8\xd0\xb5\xd0\xbd\xd1\x82\']).strip().split()\n[\'\xd0\xa1\xd1\x83\xd0\xbf\xd0\xb5\xd1\x80\', \'\xd0\x9c\xd0\xb0\xd1\x80\xd0\xba\xd0\xb5\xd1\x82\xd1\x8b\', \'\xd0\x9f\xd1\x80\xd0\xbe\xd0\xb4\xd0\xb0\xd0\xb6\xd0\xb0\', \'\xd0\x9a\xd0\xbb\xd0\xb8\xd0\xb5\xd0\xbd\xd1\x82\']\n
Run Code Online (Sandbox Code Playgroud)\n


Gab*_*abe 6

import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))
Run Code Online (Sandbox Code Playgroud)

或者

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]
Run Code Online (Sandbox Code Playgroud)


pwd*_*son 6

@ChristopheD解决方案的变体

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]

print parts
Run Code Online (Sandbox Code Playgroud)

  • 不错的一种-也适用于非拉丁字符。此处显示的正则表达式解决方案没有。 (2认同)

shr*_*use 6

我认为更好的答案可能是将字符串拆分为不以大写结尾的单词。这将处理字符串不以大写字母开头的情况。

 re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')
Run Code Online (Sandbox Code Playgroud)

例子:

>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']
Run Code Online (Sandbox Code Playgroud)


use*_*655 5

src = 'TheLongAndWindingRoad'
glue = ' '

result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)
Run Code Online (Sandbox Code Playgroud)


Tot*_*oro 5

另一个没有正则表达式并且能够在需要时保持连续大写的能力

def split_on_uppercase(s, keep_contiguous=False):
    """

    Args:
        s (str): string
        keep_contiguous (bool): flag to indicate we want to 
                                keep contiguous uppercase chars together

    Returns:

    """

    string_length = len(s)
    is_lower_around = (lambda: s[i-1].islower() or 
                       string_length > (i + 1) and s[i + 1].islower())

    start = 0
    parts = []
    for i in range(1, string_length):
        if s[i].isupper() and (not keep_contiguous or is_lower_around()):
            parts.append(s[start: i])
            start = i
    parts.append(s[start:])

    return parts

>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']
Run Code Online (Sandbox Code Playgroud)