为什么BeautifulSoup无法正确读取/解析此RSS(XML)文档？

Question

为什么BeautifulSoup无法正确读取/解析此RSS(XML)文档？

jbr*_*aud 8 python xml rss beautifulsoup

YCombinator非常适合提供RSS提要和包含HackerNews顶级项目的大型RSS提要.我正在尝试编写一个python脚本来访问RSS feed文档,然后使用BeautifulSoup解析出某些信息.但是,当BeautifulSoup尝试获取每个项目的内容时,我会遇到一些奇怪的行为.

以下是RSS提要的一些示例行:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

Run Code Online (Sandbox Code Playgroud)

这是我写(在python)代码访问此饲料和打印出来的title,link和comments每个项目:

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

Run Code Online (Sandbox Code Playgroud)

但是,此脚本提供的输出如下所示:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

Run Code Online (Sandbox Code Playgroud)

如您所见,中间项目,link以某种方式被省略.也就是说,结果值在link某种程度上是一个空字符串.那为什么呢？

当我深入研究内容时soup,我意识到它在解析XML时会以某种方式窒息.通过查看第一项可以看出items:

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

Run Code Online (Sandbox Code Playgroud)

你会注意到link标签上发生了一些棘手的事情.它只是获取close标签,然后是该标签的文本.这是一些很奇怪的行为尤其是在对比title和comments无问题进行解析.

这似乎是BeautifulSoup的问题,因为请求实际读入的内容没有任何问题.我不认为它仅限于BeautifulSoup,因为我也尝试使用xml.etree.ElementTree API并且出现了同样的问题(是否在这个API上构建了BeautifulSoup？).

有谁知道为什么会发生这种情况或者如何在不出现此错误的情况下仍然使用BeautifulSoup？

注意:我终于能够通过xml.dom.minidom获得我想要的内容,但这似乎不是一个强烈推荐的库.如果可能的话,我想继续使用BeautifulSoup.

更新:我使用的是OSX 10.8,使用Python 2.7.2和BS4 4.1.3.

更新2:我有lxml,它是用pip安装的.它是3.0.2版.至于libxml,我检查了/ usr/lib,显示的是libxml2.2.dylib.不确定何时或如何安装.

Answer 1

jdo*_*dot 5

哇，好问题。这使我成为BeautifulSoup中的错误。您无法使用该链接的原因soup.find_all('item').link是，当您第一次将html加载到BeautifulSoup中时，它对HTML的处理有些奇怪：

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

Run Code Online (Sandbox Code Playgroud)

仔细查看-它实际上已将第一个<link>标签更改为<link/>然后删除了该</link>标签。我不确定为什么会这样做，但是如果不解决BeautifulSoup.BeautifulSoup类初始化中的问题，您现在将无法使用它。

更新：

我认为目前最好的选择（尽管是hack-y）是将以下内容用于link：

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，12 月前
查看次数：	4269 次
最近记录：	10 年，9 月前