我有一组链接,如:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&emc=rss" rel="standout"></atom:link>',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&emc=rss',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&emc=rss']
Run Code Online (Sandbox Code Playgroud)
我正在尝试迭代它们以删除之后的所有内容html.所以我有:
cleanitems = []
for item in links:
cleanitems.append(re.sub(r'html(.*)', '', item))
Run Code Online (Sandbox Code Playgroud)
哪个回报:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]
Run Code Online (Sandbox Code Playgroud)
对于为什么它包含html在捕获组中感到困惑.谢谢你的帮助.
html是匹配的文本的一部分太,不只是(...)小组.re.sub()替换所有匹配的文本.
html在替换中包含文字文本:
cleanitems.append(re.sub(r'html(.*)', 'html', item))
Run Code Online (Sandbox Code Playgroud)
或者,替代地,在组中捕获该部分:
cleanitems.append(re.sub(r'(html).*', r'\1', item))
Run Code Online (Sandbox Code Playgroud)
您可能需要考虑使用非贪婪匹配和$字符串结尾锚点来防止html多次切断路径中包含的URL ,并包括.点以确保您实际上只匹配.html扩展名:
cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))
Run Code Online (Sandbox Code Playgroud)
但是,如果您的目标是从URL中删除查询字符串,请考虑使用解析URL urllib.parse.urlparse(),并在不使用查询字符串或片段标识符的情况下重新构建URL :
from urlib.parse import urlparse
cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())
Run Code Online (Sandbox Code Playgroud)
但是,这不会删除错误的HTML块; 如果要从HTML文档解析这些URL,请考虑使用真正的HTML解析器而不是正则表达式.