从Python中的给定字符串中删除所有形式的URL

Pre*_*ter 6 python regex

我是python的新手,想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的URL.在谷歌搜索,似乎有很多解决方案提取域,用链接等替换它,但没有一个从字符串中删除/删除它们.我在下面提到了一些例子供参考.谢谢!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='
Run Code Online (Sandbox Code Playgroud)

错误日志:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Run Code Online (Sandbox Code Playgroud)

dor*_*oru 7

您的代码中存在错误(实际上是两个):

你应该在倒数第二个单引号前加一个反斜杠来逃避它:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)
Run Code Online (Sandbox Code Playgroud)

2.您不应该使用str变量的名称,因为它是保留关键字,因此请将其命名为thestring其他名称

例如:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string
Run Code Online (Sandbox Code Playgroud)

结果:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.


ker*_*rim 6

在源文件的顶部包含编码行(正则表达式字符串包含非ascii符号»),例如:

# -*- coding: utf-8 -*-
import re
...
Run Code Online (Sandbox Code Playgroud)

也用三重单引号(或双引号)围绕你的正则表达式字符串 - '''或者"""代替单引号,因为这个字符串本身已包含引号符号('").

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
Run Code Online (Sandbox Code Playgroud)

  • 所有你需要的*是`#coding:utf-8`.除非你正在做` - * - encoding:utf-8 - * - `(注意`en`),用emacs` - * - `来装饰它是没有好处的. (2认同)