Fir*_*ame 2 python beautifulsoup web-crawler
我正在尝试使用 python 解析网页并创建站点地图。我写了下面的代码 -
import urllib2
from bs4 import BeautifulSoup
mypage = "http://example.com/"
page = urllib2.urlopen(mypage)
soup = BeautifulSoup(page,'html.parser')
all_links = soup.find_all('a')
for link in all_links:
print link.get('href')
Run Code Online (Sandbox Code Playgroud)
上面的代码打印了(外部和内部)中的所有链接example.com。
"example.com"有链接到"example.com/page1",有链接到"example.com/page3"。为这种流程创建地图的理想方法是什么?我正在寻找一个显示"example.com" -> "example.com/page1" -> "example.com/page3"或类似内容的库或逻辑小智 5
我编写了一段代码,用于在 python Flask 框架中生成 sitemap.xml 文件
import xml.etree.cElementTree as ET
import datetime
def registerSiteMaps():
root = ET.Element('urlset')
root.attrib['xmlns:xsi']="http://www.w3.org/2001/XMLSchema-instance"
root.attrib['xsi:schemaLocation']="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
root.attrib['xmlns']="http://www.sitemaps.org/schemas/sitemap/0.9"
q = db.result
for doc in q.results:
uid = doc['uid']
site_root = uid.replace('__', '/').replace('_', '-')
dt = datetime.datetime.now().strftime ("%Y-%m-%d")
doc = ET.SubElement(root, "url")
ET.SubElement(doc, "loc").text = "https://www.example.com/"+site_root
ET.SubElement(doc, "lastmod").text = dt
ET.SubElement(doc, "changefreq").text = "weekly"
ET.SubElement(doc, "priority").text = "1.0"
tree = ET.ElementTree(root)
tree.write('sitemap.xml', encoding='utf-8', xml_declaration=True)
Run Code Online (Sandbox Code Playgroud)
欲了解更多详情,请点击此链接