Eri*_*ric 5 html python parsing beautifulsoup
因此,当我决定解析网站内容时。例如,http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
我想将成分解析为文本文件。成分位于:
<div class =“ ingredients” style =“ margin-top:10px;”>
在其中,每种成分都存储在
<li class =“ plaincharacterwrap”>
有人很乐于使用正则表达式来提供代码,但是当您在站点之间进行修改时,会感到困惑。所以我想使用Beautiful Soup,因为它具有很多内置功能。除了我对如何实际做感到困惑。
码:
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)
try:
ingrdiv = soup.find('div', attrs={'class': 'ingredients'})
except IOError:
print 'IO error'
Run Code Online (Sandbox Code Playgroud)
这是您的入门方式吗?我想找到实际的div类,然后解析出位于li类内的所有那些成分。
任何帮助,将不胜感激!谢谢!
import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
fname = 'PorkChopsRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
Run Code Online (Sandbox Code Playgroud)
结果是
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
Run Code Online (Sandbox Code Playgroud)
。
对@eyquem的后续回复:
from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html
start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"
# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"
# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1)
# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s - same =", (res3==res1)
Run Code Online (Sandbox Code Playgroud)
给出
Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s - same = True
lxml parse took 0.0100940499505 s - same = True
Run Code Online (Sandbox Code Playgroud)
正则表达式要快得多(除非它是错误的);但如果考虑加载页面并一起解析,BeautifulSoup 仍然只有 20% 的运行时间。如果您非常关心速度,我推荐使用 lxml。
| 归档时间: |
|
| 查看次数: |
10759 次 |
| 最近记录: |