Mir*_*ach 5 python beautifulsoup xml-parsing
我正在使用漂亮的汤从rss页面解析html代码。如何保存链接标签?
该代码最有前途的代码是:
python
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'html.parser')
Run Code Online (Sandbox Code Playgroud)
我尝试import lxml将代码切换到,
python soup = BeautifulSoup(html_doc, 'xml')
但这给了我一个错误:
ModuleNotFoundError: No module named 'lxml'
Run Code Online (Sandbox Code Playgroud)
我希望结果是,
<link>https://someurl.org</link>但输出是<link/>someurl.org
更改解析器以xml修复<link>标签:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'xml') # <-- changing to 'xml'
for link in soup.select('link'):
print(link.get_text(strip=True))
Run Code Online (Sandbox Code Playgroud)
印刷:
https://advisories.ncsc.nl/rss/advisories
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0098
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0584
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0511
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0583
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0560
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0546
...and so on.
Run Code Online (Sandbox Code Playgroud)