Beautifulsoup4与lxml对比Beautifulsoup3

Question

Beautifulsoup4与lxml对比Beautifulsoup3

Has*_*sek 5 python lxml profiling beautifulsoup html-parsing

我正在将一些解析器从BeautifulSoup3迁移到BeautifulSoup4,我认为分析考虑到lxml是超快的并且它是我使用BS4的解析器,这将是一个好主意,这里是配置文件结果:

对于BS3:

43208 function calls (42654 primitive calls) in 0.103 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.000    0.000 <string>:2(<module>)
   18    0.000    0.000    0.000    0.000 <string>:8(__new__)
    1    0.000    0.000    0.072    0.072 <string>:9(parser)
   32    0.000    0.000    0.000    0.000 BeautifulSoup.py:1012(__init__)
    1    0.000    0.000    0.000    0.000 BeautifulSoup.py:1018(buildTagMap)
...

Run Code Online (Sandbox Code Playgroud)

对于使用lxml的BS4:

164440 function calls (163947 primitive calls) in 0.244 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.040    0.040    0.069    0.069 <string>:2(<module>)
   18    0.000    0.000    0.000    0.000 <string>:8(__new__)
    1    0.000    0.000    0.158    0.158 <string>:9(parser)
    1    0.000    0.000    0.008    0.008 HTMLParser.py:1(<module>)
    1    0.000    0.000    0.000    0.000 HTMLParser.py:54(HTMLParseError)
...

Run Code Online (Sandbox Code Playgroud)

为什么BS4要调用4倍以上的功能？HTMLParser如果我将它设置为使用它为什么会使用它lxml？

我从BS3变为BS4的最引人注目的事情是这样的:

 BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)  --->
 BeautifulSoup(html, 'lxml')

 [x.getText('**SEP**') for x in i.findChildren('font')[:2]] --->
 [x.getText('**SEP**', strip=True) for x in i.findChildren('font')[:2]]

Run Code Online (Sandbox Code Playgroud)

其他一切只是一些名称更改(如findParent - > find_parent)

编辑:

我的环境:

python 2.7.3
beautifulsoup4==4.1.0
lxml==2.3.4

Run Code Online (Sandbox Code Playgroud)

编辑2:

这是一个试用它的小代码示例:

from cProfile import Profile

from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup as BS4
import urllib2


def parse(html):

    soup = BS4(html, 'lxml')
    hl = soup.find_all('span', {'class': 'mw-headline'})
    return [x.get_text(strip=True) for x in hl]


def parse3(html):

    soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
    hl = soup.findAll('span', {'class': 'mw-headline'})
    return [x.getText() for x in hl]


if __name__ == "__main__":
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    html = ''.join(opener.open('http://en.wikipedia.org/wiki/Price').readlines())

    profiler = Profile()
    print profiler.runcall(parse, html)
    profiler.print_stats()

    profiler2 = Profile()
    print profiler2.runcall(parse3, html)
    profiler2.print_stats()

Run Code Online (Sandbox Code Playgroud)

Answer 1

Leo*_*son 1

我认为主要问题是 Beautiful Soup 4 中的一个错误。我已将其归档，并将在下一版本中发布修复程序。感谢您找到这个。

也就是说，鉴于您使用的是 lxml，我不知道为什么您的个人资料中提到了 HTMLParser 类。

归档时间：	13 年，4 月前
查看次数：	1649 次
最近记录：	13 年，4 月前