Beautifulsoup html解析会损坏<link>标记

Mir*_*ach 5 python beautifulsoup xml-parsing

我正在使用漂亮的汤从rss页面解析html代码。如何保存链接标签?

该代码最有前途的代码是:

python
import urllib.request, urllib.parse, urllib.error 
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'html.parser')
Run Code Online (Sandbox Code Playgroud)

我尝试import lxml将代码切换到, python soup = BeautifulSoup(html_doc, 'xml') 但这给了我一个错误:

ModuleNotFoundError: No module named 'lxml'
Run Code Online (Sandbox Code Playgroud)

我希望结果是, <link>https://someurl.org</link>但输出是<link/>someurl.org

And*_*ely 1

更改解析器以xml修复<link>标签:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'xml')    # <-- changing to 'xml'

for link in soup.select('link'):
    print(link.get_text(strip=True))
Run Code Online (Sandbox Code Playgroud)

印刷:

https://advisories.ncsc.nl/rss/advisories
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0098
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0584
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0511
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0583
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0560
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0546

...and so on.
Run Code Online (Sandbox Code Playgroud)