BeautifulSoup 不一致的行为

Spa*_*ade 2 python beautifulsoup html-parsing web-scraping python-2.7

我对我在两个不同环境中编写的以下 HTML 抓取代码的行为完全感到困惑,需要帮助找到这种差异的根本原因

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)
Run Code Online (Sandbox Code Playgroud)

在机器 1 上,运行返回:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630  
Run Code Online (Sandbox Code Playgroud)

在机器 2 上,这个非常相同的代码运行返回:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462
Run Code Online (Sandbox Code Playgroud)

计算的重叠群数量不同。 请注意,相同的代码解析 HTML 表以在两个不同的环境中产生不同的结果,这些环境彼此并没有显着差异,不幸的是导致了这个生产噩梦。人工检查确认机器 2返回的结果不正确,但迄今为止无法解释。

有没有人有类似的经历?您是否注意到此代码有任何问题,还是我应该BeautifulSoup完全停止信任?

ale*_*cxe 5

您所遇到的解析器之间的差异BeaufitulSoup 自动选择为“HTML”标记类型您指定。选择哪个解析器取决于当前 Python 环境中可用的模块:

如果您不指定任何内容,您将获得已安装的最佳 HTML 解析器。Beautiful Soup 将 lxml 的解析器列为最好的,然后是 html5lib 的,然后是 Python 的内置解析器。

要跨平台保持一致的行为,请明确:

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")
Run Code Online (Sandbox Code Playgroud)

另请参阅:安装解析器