kyr*_*nia 3 html python beautifulsoup
我希望识别 html 文件中请求外部资源的 url。
我目前使用和标签scr中的属性,以及标签中的属性(用于识别css)。imgscripthreflink
我是否应该检查其他标签来识别其他资源?
作为参考,我的 Python 代码目前是:
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = [x['src'] for x in soup.findAll('img')]
css_link = [x['href'] for x in soup.findAll('link')]
scipt_src = [] ## Often times script doesn't have attributes 'src' hence need for try/except
for x in soup.findAll('script'):
try:
scipt_src.append(x['src'])
except KeyError:
pass
Run Code Online (Sandbox Code Playgroud)
更新了我的代码以捕获 html 代码中最常见的资源。显然,这不会考虑 CSS 或 Javascript 中请求的资源。如果我缺少标签请评论。
from bs4 import BeautifulSoup
def find_list_resources (tag, attribute,soup):
list = []
for x in soup.findAll(tag):
try:
list.append(x[attribute])
except KeyError:
pass
return(list)
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = find_list_resources('img',"src",soup)
scipt_src = find_list_resources('script',"src",soup)
css_link = find_list_resources("link","href",soup)
video_src = find_list_resources("video","src",soup)
audio_src = find_list_resources("audio","src",soup)
iframe_src = find_list_resources("iframe","src",soup)
embed_src = find_list_resources("embed","src",soup)
object_data = find_list_resources("object","data",soup)
soruce_src = find_list_resources("source","src",soup)
Run Code Online (Sandbox Code Playgroud)