TIM*_*MEX 338 python regex string
假设这是字符串:
The fox jumped over the log.
Run Code Online (Sandbox Code Playgroud)
这将导致:
The fox jumped over the log.
Run Code Online (Sandbox Code Playgroud)
什么是最简单的1-2衬垫可以做到这一点?没有分裂并进入列表......
Tay*_*ese 493
foo是你的字符串:
" ".join(foo.split())
Run Code Online (Sandbox Code Playgroud)
请注意,虽然这会删除"所有空白字符(空格,制表符,换行符,返回,换页)".(感谢hhsaffar,见评论)"this is \t a test\n"
即将有效地结束"this is a test"
Jos*_*Lee 448
>>> import re
>>> re.sub(' +', ' ', 'The quick brown fox')
'The quick brown fox'
Run Code Online (Sandbox Code Playgroud)
Nas*_*sir 80
import re
s = "The fox jumped over the log."
re.sub("\s\s+" , " ", s)
Run Code Online (Sandbox Code Playgroud)
要么
re.sub("\s\s+", " ", s)
Run Code Online (Sandbox Code Playgroud)
因为逗号之前的空格在PEP8中被列为宠物,如评论中的驼鹿所述.
pyt*_*rry 48
使用正则表达式用"\ s"和做简单的string.split()的会也删除其他空白-像换行,回车,制表符.除非需要这样做,否则只提供多个空格,我将介绍这些例子.
编辑:我不会这样做,我睡了这个,除了纠正最后结果的错误(v3.3.3 @ 64位,而不是 32位),显然打击了我:测试字符串相当琐碎.
所以,我得到了... 11段,1000字,6665字节的Lorem Ipsum,以获得更真实的时间测试.然后我添加了随机长度的额外空格:
original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))
Run Code Online (Sandbox Code Playgroud)
我也纠正了"正确join
"; 如果一个人关心,单行将基本上做任何前导/尾随空格的条带,这个修正版本保留一个前导/尾随空格(但只有一个 ;-).(我发现这是因为随机间隔lorem_ipsum
在末端有额外的空间,因此失败了assert
.)
# setup = '''
import re
def while_replace(string):
while ' ' in string:
string = string.replace(' ', ' ')
return string
def re_replace(string):
return re.sub(r' {2,}' , ' ', string)
def proper_join(string):
split_string = string.split(' ')
# To account for leading/trailing spaces that would simply be removed
beg = ' ' if not split_string[ 0] else ''
end = ' ' if not split_string[-1] else ''
# versus simply ' '.join(item for item in string.split(' ') if item)
return beg + ' '.join(item for item in split_string if item) + end
original_string = """Lorem ipsum ... no, really, it kept going... malesuada enim feugiat. Integer imperdiet erat."""
assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string)
#'''
Run Code Online (Sandbox Code Playgroud)
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string
Run Code Online (Sandbox Code Playgroud)
# re_replace_test
new_string = original_string[:]
new_string = re_replace(new_string)
assert new_string != original_string
Run Code Online (Sandbox Code Playgroud)
# proper_join_test
new_string = original_string[:]
new_string = proper_join(new_string)
assert new_string != original_string
Run Code Online (Sandbox Code Playgroud)
注意: " 请记住,主要的while
版本"制作了副本original_string
,因为我相信在第一次运行时修改后,连续运行会更快(如果只是一点点).由于这增加了时间,我将此字符串副本添加到其他两个中,以便时间仅显示逻辑上的差异.stmt
on timeit
实例只会被执行一次 ; 我这样做的原始方式,while
循环工作在同一个标签上original_string
,因此第二次运行,没有什么可做的.它现在的设置方式,使用两个不同的标签调用函数,这不是问题.我已经assert
向所有工作人员添加了声明,以验证我们每次迭代都会改变一些东西(对于那些可能是可疑的人).例如,更改为此并且它会中断:
# while_replace_test
new_string = original_string[:]
new_string = while_replace(new_string)
assert new_string != original_string # will break the 2nd iteration
while ' ' in original_string:
original_string = original_string.replace(' ', ' ')
Run Code Online (Sandbox Code Playgroud)
Tests run on a laptop with an i5 processor running Windows 7 (64-bit).
timeit.Timer(stmt = test, setup = setup).repeat(7, 1000)
test_string = 'The fox jumped over\n\t the log.' # trivial
Python 2.7.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001066 | 0.001260 | 0.001128 | 0.001092
re_replace_test | 0.003074 | 0.003941 | 0.003357 | 0.003349
proper_join_test | 0.002783 | 0.004829 | 0.003554 | 0.003035
Python 2.7.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001025 | 0.001079 | 0.001052 | 0.001051
re_replace_test | 0.003213 | 0.004512 | 0.003656 | 0.003504
proper_join_test | 0.002760 | 0.006361 | 0.004626 | 0.004600
Python 3.2.3, 32-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001350 | 0.002302 | 0.001639 | 0.001357
re_replace_test | 0.006797 | 0.008107 | 0.007319 | 0.007440
proper_join_test | 0.002863 | 0.003356 | 0.003026 | 0.002975
Python 3.3.3, 64-bit, Windows
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.001444 | 0.001490 | 0.001460 | 0.001459
re_replace_test | 0.011771 | 0.012598 | 0.012082 | 0.011910
proper_join_test | 0.003741 | 0.005933 | 0.004341 | 0.004009
Run Code Online (Sandbox Code Playgroud)
test_string = lorem_ipsum
# Thanks to http://www.lipsum.com/
# "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum"
Python 2.7.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.342602 | 0.387803 | 0.359319 | 0.356284
re_replace_test | 0.337571 | 0.359821 | 0.348876 | 0.348006
proper_join_test | 0.381654 | 0.395349 | 0.388304 | 0.388193
Python 2.7.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.227471 | 0.268340 | 0.240884 | 0.236776
re_replace_test | 0.301516 | 0.325730 | 0.308626 | 0.307852
proper_join_test | 0.358766 | 0.383736 | 0.370958 | 0.371866
Python 3.2.3, 32-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.438480 | 0.463380 | 0.447953 | 0.446646
re_replace_test | 0.463729 | 0.490947 | 0.472496 | 0.468778
proper_join_test | 0.397022 | 0.427817 | 0.406612 | 0.402053
Python 3.3.3, 64-bit
test | minum | maximum | average | median
---------------------+------------+------------+------------+-----------
while_replace_test | 0.284495 | 0.294025 | 0.288735 | 0.289153
re_replace_test | 0.501351 | 0.525673 | 0.511347 | 0.508467
proper_join_test | 0.422011 | 0.448736 | 0.436196 | 0.440318
Run Code Online (Sandbox Code Playgroud)
对于平凡的字符串,似乎while循环是最快的,接着是Pythonic字符串分割/连接,正则表达式向后拉.
对于非平凡的字符串,似乎还有一些需要考虑的问题.32位2.7?这是救援的正则表达!2.7 64位?一个while
循环是最好的,一个不错的边缘.32位3.2,使用"正确" join
.64位3.3,while
循环播放.再次.
最后,如果/在何时/何时需要,可以提高性能,但总是最好记住口头禅:
IANAL,YMMV,警告Emptor!
Kev*_*tle 41
不得不同意Paul McGuire上面的评论.对我来说,
' '.join(the_string.split())
Run Code Online (Sandbox Code Playgroud)
非常适合榨出正则表达式.
我的测量(Linux,Python 2.5)显示split-then-join几乎比执行"re.sub(...)"快5倍,如果你预先编译一次正则表达式并执行操作,仍然快3倍多次.它是由任何措施更容易理解- 很多更Python.
小智 18
import re
Text = " You can select below trims for removing white space!! BR Aliakbar "
# trims all white spaces
print('Remove all space:',re.sub(r"\s+", "", Text), sep='')
# trims left space
print('Remove leading space:', re.sub(r"^\s+", "", Text), sep='')
# trims right space
print('Remove trailing spaces:', re.sub(r"\s+$", "", Text), sep='')
# trims both
print('Remove leading and trailing spaces:', re.sub(r"^\s+|\s+$", "", Text), sep='')
# replace more than one white space in the string with one white space
print('Remove more than one space:',re.sub(' +', ' ',Text), sep='')
Run Code Online (Sandbox Code Playgroud)
结果:如代码
"Remove all space:Youcanselectbelowtrimsforremovingwhitespace!!BRAliakbar"
"Remove leading space:You can select below trims for removing white space!! BR Aliakbar"
"Remove trailing spaces: You can select below trims for removing white space!! BR Aliakbar"
"Remove leading and trailing spaces:You can select below trims for removing white space!! BR Aliakbar"
"Remove more than one space: You can select below trims for removing white space!! BR Aliakbar"
Run Code Online (Sandbox Code Playgroud)
Pet*_*ter 12
与之前的解决方案类似,但更具体:用一个替换两个或多个空格:
>>> import re
>>> s = "The fox jumped over the log."
>>> re.sub('\s{2,}', ' ', s)
'The fox jumped over the log.'
Run Code Online (Sandbox Code Playgroud)
HMS*_*HMS 11
一个简单的灵魂
>>> import re
>>> s="The fox jumped over the log."
>>> print re.sub('\s+',' ', s)
The fox jumped over the log.
Run Code Online (Sandbox Code Playgroud)
小智 11
我尝试了以下方法,它甚至适用于极端情况,例如:
str1=' I live on earth '
' '.join(str1.split())
Run Code Online (Sandbox Code Playgroud)
但是,如果您更喜欢正则表达式,则可以这样做:
re.sub('\s+', ' ', str1)
Run Code Online (Sandbox Code Playgroud)
尽管必须进行一些预处理才能删除尾随和结尾空格。
Cha*_*uad 11
Python 开发者的解决方案:
import re
text1 = 'Python Exercises Are Challenging Exercises'
print("Original string: ", text1)
print("Without extra spaces: ", re.sub(' +', ' ', text1))
Run Code Online (Sandbox Code Playgroud)
输出:
Original string: Python Exercises Are Challenging Exercises
Without extra spaces: Python Exercises Are Challenging Exercises
小智 9
这个正则表达式将在 Python 3.11 中发挥作用:
re.sub(r'\s+', ' ', text)
Run Code Online (Sandbox Code Playgroud)
该线程接受的答案在 Mac 上的 Python 3.11 中对我不起作用:
re.sub(' +', ' ', 'The quick brown fox') # does not work for me
Run Code Online (Sandbox Code Playgroud)
小智 6
这正是你想要的
old_string = 'The fox jumped over the log '
new_string = " ".join(old_string.split())
print(new_string)
Run Code Online (Sandbox Code Playgroud)
将结果至
The fox jumped over the log.
Run Code Online (Sandbox Code Playgroud)
小智 5
import re
string = re.sub('[ \t\n]+', ' ', 'The quick brown \n\n \t fox')
Run Code Online (Sandbox Code Playgroud)
这将删除所有选项卡,新行和带有单个空格的多个空格。
您还可以在Pandas DataFrame中使用字符串拆分技术,而无需使用.apply(..),如果您需要在大量字符串上快速执行操作,这将非常有用.这是一行:
df['message'] = (df['message'].str.split()).str.join(' ')
Run Code Online (Sandbox Code Playgroud)