使用BeautifulSoup提取<Script的内容

lai*_*o b 8 python beautifulsoup python-2.7

1 /我正在尝试使用美丽的汤提取脚本的一部分,但它打印无.怎么了 ?

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

for script in soup("script"):
        script.extract()

list_of_scripts = soup.findAll("script")
print list_of_scripts
Run Code Online (Sandbox Code Playgroud)

2 /目标是提取属性"transcript"的值:

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "VideoObject",
    "video": {
        "@type": "VideoObject",
        "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",
        "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",  
        "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",
Run Code Online (Sandbox Code Playgroud)

And*_*rds 61

文档

从 Beautiful Soup 4.9.0 版开始,当使用 lxml 或 html.parser 时<script><style>、 和<template>标签的内容不被视为“文本”,因为这些标签不是页面人类可见内容的一部分.

所以基本上从接受的答案falsetru以上都是好的,但使用.string而不是.text用美丽的汤的较新版本,或者我被你会疑惑.text总是返回None<script>标签。

  • 感谢您对最新版本的 bs4 的回答 (2认同)

fal*_*tru 23

extract从dom中删除标签.这就是为什么你得到空列表.


script使用type="application/ld+json"属性查找并使用解码json.loads.然后,您可以访问Python数据结构等数据.(dict对于给定的数据)

import json
import urllib2

from bs4 import BeautifulSoup

URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']
Run Code Online (Sandbox Code Playgroud)

  • @laihob,这是另一个问题。不是吗?无论如何,请尝试:`print ''.join(soup.find('span', id='articleText').strings)` (3认同)
  • 必须针对我的具体情况进行一些修改,但这个答案确实帮助我在很大程度上实现了目标。 (2认同)