我希望写一些似乎应该很容易的东西,但无论出于什么原因,我都很难理解它.
我正在寻找一个python函数,当传递一个字符串时,将传递该字符串与HTML编码围绕URL.
unencoded_string = "This is a link - http://google.com"
def encode_string_with_links(unencoded_string):
# some sort of regex magic occurs
return encoded_string
print encoded_string
'This is a link - <a href="http://google.com">http://google.com</a>'
Run Code Online (Sandbox Code Playgroud)
谢谢!
你需要的"正则表达式魔术"只是sub(它做了替代):
def encode_string_with_links(unencoded_string):
return URL_REGEX.sub(r'<a href="\1">\1</a>', unencoded_string)
Run Code Online (Sandbox Code Playgroud)
URL_REGEX 可能是这样的:
URL_REGEX = re.compile(r'''((?:mailto:|ftp://|http://)[^ <>'"{}|\\^`[\]]*)''')
Run Code Online (Sandbox Code Playgroud)
这是一个非常宽松的URL正则表达式:它允许使用mailto,http和ftp方案,之后几乎只会继续运行直到它遇到"不安全"的字符(除了你想要允许转义的百分比).如果需要,你可以更严格.例如,您可以要求百分比后跟有效的十六进制转义,或者只允许一个磅符号(对于片段)或强制查询参数和片段之间的顺序.不过,这应该足以让你入门.
谷歌搜索解决方案:
#---------- find_urls.py----------#
# Functions to identify and extract URLs and email addresses
import re
def fix_urls(text):
pat_url = re.compile( r'''
(?x)( # verbose identify URLs within text
(http|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
(\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r"]+ # or stuff then space, newline, tab, quote
[\w/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
pat_email = re.compile(r'''
(?xm) # verbose identify URLs in text (and multiline)
(?=^.{11} # Mail header matcher
(?<!Message-ID:| # rule out Message-ID's as best possible
In-Reply-To)) # ...and also In-Reply-To
(.*?)( # must grab to email to allow prior lookbehind
([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
[A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
@ # ...needs an at sign in the middle
(\w+\.?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
for url in re.findall(pat_url, text):
text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
for email in re.findall(pat_email, text):
text = text.replace(email[1], '<a href="mailto:%(email)s">%(email)s</a>' % {"email" : email[1]})
return text
if __name__ == '__main__':
print fix_urls("test http://google.com asdasdasd some more text")
Run Code Online (Sandbox Code Playgroud)
编辑:调整您的需求