re.sub替换太多文本

Question

re.sub替换太多文本

我有一组链接,如:

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&amp;emc=rss" rel="standout"></atom:link>',
 'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
 'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&amp;emc=rss',
 'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
 'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&amp;emc=rss']

Run Code Online (Sandbox Code Playgroud)

我正在尝试迭代它们以删除之后的所有内容html.所以我有:

cleanitems = []

for item in links:  
    cleanitems.append(re.sub(r'html(.*)', '', item))

Run Code Online (Sandbox Code Playgroud)

哪个回报:

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
 'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
 'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
 'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
 'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]

Run Code Online (Sandbox Code Playgroud)

对于为什么它包含html在捕获组中感到困惑.谢谢你的帮助.

Answer 1

Mar*_*ers 5

html是匹配的文本的一部分太,不只是(...)小组.re.sub()替换所有匹配的文本.

html在替换中包含文字文本:

cleanitems.append(re.sub(r'html(.*)', 'html', item))

Run Code Online (Sandbox Code Playgroud)

或者,替代地,在组中捕获该部分:

cleanitems.append(re.sub(r'(html).*', r'\1', item))

Run Code Online (Sandbox Code Playgroud)

您可能需要考虑使用非贪婪匹配和$字符串结尾锚点来防止html多次切断路径中包含的URL ,并包括.点以确保您实际上只匹配.html扩展名:

cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))

Run Code Online (Sandbox Code Playgroud)

但是,如果您的目标是从URL中删除查询字符串,请考虑使用解析URL urllib.parse.urlparse(),并在不使用查询字符串或片段标识符的情况下重新构建URL :

from urlib.parse import urlparse

cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())

Run Code Online (Sandbox Code Playgroud)

但是,这不会删除错误的HTML块; 如果要从HTML文档解析这些URL,请考虑使用真正的HTML解析器而不是正则表达式.

归档时间：	8 年，5 月前
查看次数：	471 次
最近记录：	8 年，5 月前