Python使用Beautiful Soup对特定内容进行HTML处理

Eri*_*ric 5 html python parsing beautifulsoup

因此,当我决定解析网站内容时。例如,http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我想将成分解析为文本文件。成分位于:

<div class =“ ingredients” style =“ margin-top:10px;”>

在其中,每种成分都存储在

<li class =“ plaincharacterwrap”>

有人很乐于使用正则表达式来提供代码,但是当您在站点之间进行修改时,会感到困惑。所以我想使用Beautiful Soup,因为它具有很多内置功能。除了我对如何实际做感到困惑。

码:

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'
Run Code Online (Sandbox Code Playgroud)

这是您的入门方式吗?我想找到实际的div类,然后解析出位于li类内的所有那些成分。

任何帮助,将不胜感激!谢谢!

Hug*_*ell 4

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()
Run Code Online (Sandbox Code Playgroud)

结果是

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
Run Code Online (Sandbox Code Playgroud)


对@eyquem的后续回复:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)
Run Code Online (Sandbox Code Playgroud)

给出

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True
Run Code Online (Sandbox Code Playgroud)

正则表达式要快得多(除非它是错误的);但如果考虑加载页面并一起解析,BeautifulSoup 仍然只有 20% 的运行时间。如果您非常关心速度,我推荐使用 lxml。