标签: beautifulsoup

如何使用漂亮的汤查找特定的视频html标签？

有谁知道如何在python中使用beautifulsoup。

我有一个带有不同网址列表的搜索引擎。

我只想获取包含视频嵌入网址的html标签。并获取链接。

例

import BeautifulSoup

html = '''https://archive.org/details/20070519_detroit2'''
    #or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
    #or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''

soup = BeautifulSoup.BeautifulSoup(html)

Run Code Online (Sandbox Code Playgroud)

我下一步该怎么做。获取视频，对象或视频的确切链接的html标签。

我需要将它放在我的iframe上。我将python集成到我的php中。所以获取视频的链接并使用python输出它，然后我将在我的iframe上回显它。

python beautifulsoup

Vin*_*ent

2013 12-04

1
推荐指数

1
解决办法

6243
查看次数

如何找到第一层的后代？

请帮助修复脚本。

import pprint
import requests

import bs4


def get_catalog(url):
    req = requests.get(url)
    if req.status_code != requests.codes.ok:
        print('Error: ', req.status_code)
    else:
        soup = bs4.BeautifulSoup(req.text)
        #print(soup)
        catalogMenu = soup.find('section', {'class': 'catalog'})
        catalogMenuList = catalogMenu.find('ul', {'class': 'topnav'})
        #print(catalogMenuList)

        return catalogMenuList


def parse_catalog_categories(catalogMenuList):
    catalogNames = []
    #li = catalogMenuList.findNext('li', limit=1)   #?????????????????
    pprint.pprint(li)


if __name__ == "__main__":
    url = 'http://first-store.ru/'
    catalogMenuList = get_catalog(url)
    if not catalogMenuList:
        print('Get catalog error')
    else:
        parse_catalog_categories(catalogMenuList)

Run Code Online (Sandbox Code Playgroud)

问题是我找不到li第一层嵌套的所有后代。即：

iphone, ipad, ipod, imac, etc...

Run Code Online (Sandbox Code Playgroud)

但不是：

iphone, iphone 5s, iphone 5s …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup python-3.x

Ser*_*gey

2014 02-26

1
推荐指数

1
解决办法

1059
查看次数

Python beautifulsoup 1级唯一文本

我看了另一个beautifulsoup得到同级别的问题.好像我的情况略有不同.

这是网站http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31

我正试图让右边的那张桌子.注意表的第一行如何扩展为该数据的详细分解.我不想要那些数据.我只想要最顶级的数据.您还可以看到其他行也可以展开,但在这种情况下不会.所以只是循环和跳过tr[2]可能不起作用.我试过这个:

r = requests.get(page)
r.encoding = 'gb2312'
soup = BeautifulSoup(r.text,'html.parser')
table=soup.find('div', class_='right1').findAll('tr', {"class" : re.compile('list.*')})

Run Code Online (Sandbox Code Playgroud)

但list*在其他级别还有更多嵌套.如何只获得第一级？

python beautifulsoup

jas*_*son

lucky-day

1
推荐指数

1
解决办法

2459
查看次数

理解python中的lambda函数

我在看这篇文章:

Python BeautifulSoup:通配符属性/ id搜索

答案给出了解决方案:

dates = soup.findAll("div", {"id" : lambda L: L and L.startswith('date')})

我以为我理解了python中的lambda函数.但是,当我看到这个时 lambda L: L and L.startswith('date'),我知道它最终会返回一个id,其值包含'date'.但为什么写成L and L.startswith('date')？这看起来lambda函数返回一个字符串和一个布尔语句.

有人可以解释这背后的逻辑吗？

python lambda beautifulsoup

myn*_*EFF

2017 05-23

1
推荐指数

1
解决办法

242
查看次数

使用beautifulsoup和python删除某些标签

题

我正在尝试从BeautifulSoup下载的html文件中删除类似<h2>和的样式标签<div class=...>。我确实想保留标签包含的内容（例如文本），但是这似乎不起作用。

我尝试过的

for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find("div", {"class": "product_specifications bottom_l js_readmore_content"})
    print "<hr style='border-width:5px;'>"
    for style in table.find_all('style'):
        if 'style' in style.attrs:
            del style.attrs['style']
    print table

Run Code Online (Sandbox Code Playgroud)

我尝试过的Urls

Python HTML解析，包含漂亮的汤和过滤停用词

使用Python和lxml从HTML删除类属性

BeautifulSoup标签去除

html python strip beautifulsoup

use*_*459

2017 05-23

1
推荐指数

1
解决办法

6910
查看次数

BeautifulSoup:TypeError:'NoneType'对象不可订阅

我需要从链接(a)标签中取"href"属性.

我跑

 label_tag = row.find(class_='Label')
 print(label_tag)

Run Code Online (Sandbox Code Playgroud)

我得到了(抱歉,出于隐私原因,我无法显示链接和文字)

<a class="Label" href="_link_">_text_</a>

Run Code Online (Sandbox Code Playgroud)

的类型

<class 'bs4.element.Tag'>

Run Code Online (Sandbox Code Playgroud)

但是当我跑步时(如BeautifulSoup所示,获得href)

tag_link = label_tag['href']
print(tag_link)

Run Code Online (Sandbox Code Playgroud)

我想以下错误(在第一个命令上)

TypeError: 'NoneType' object is not subscriptable

Run Code Online (Sandbox Code Playgroud)

任何线索？提前致谢

[已解决]编辑:我犯了一个错误(循环使用异构结构的元素)

html python beautifulsoup html-parsing

dra*_*mnl

2017 05-23

1
推荐指数

1
解决办法

5079
查看次数

如何清理输入以避免django中的恶意属性？

我想允许用户发布图片,因此需要添加|safe到模板标签并使用beautifulsoap使用此代码段将某些标签列入白名单.

但是,我想知道如何避免像下面这样的潜在恶意属性？

<img src="puppy.png" onload="(function(){/* do bad stuff */}());" />

Run Code Online (Sandbox Code Playgroud)

更新: 请注意,上面链接的代码段有一些XSS漏洞,这里提到

django beautifulsoup django-templates

Jan*_*and

2017 05-23

1
推荐指数

1
解决办法

185
查看次数

如何在python中使用requests.post()和代理身份验证？

from bs4 import BeautifulSoup
import requests
from requests.auth import HTTPProxyAuth

url = "http://www.transtats.bts.gov/Data_Elements.aspx?Data=2" 
proxies = {"http":"xxx.xxx.x.xxx: port"}
auth = HTTPProxyAuth("username", "password")
r = requests.get(url, proxies=proxies, auth=auth)
soup = BeautifulSoup(r.text,"html.parser") 
viewstate_element = soup.find(id = "__VIEWSTATE").attrs 
viewstate = viewstate_element["value"]
eventvalidation_element = soup.find(id="__EVENTVALIDATION").attrs
eventvalidation = eventvalidation_element["value"]


data =     {'AirportList':"BOS",'CarrierList':"VX",'Submit':'Submit',"__EVENTTARGET":"","__EVENTARGUMENT":"","__EVENTVALIDATION":eventvalidation,"}
r = requests.post(url, proxies, auth, data )
print r

Run Code Online (Sandbox Code Playgroud)

这个代码在我使用时工作正常requests.get(url, proxies=proxies, auth=auth),但是当有一些数据必须通过requests.post()代理身份验证发送时该怎么办？

python proxy http-get beautifulsoup http-post

ken*_*way

2015 06-07

1
推荐指数

1
解决办法

1万
查看次数

让BeautifulSoup4 + lxml与cx_freeze一起工作需要什么？

摘要:

我有一个wxPython/bs4应用程序,我正在使用cx_freeze构建一个exe.

构建成功没有错误,但尝试运行EXE会导致FeatureNotFoundBeautifulSoup4出错.它抱怨我没有安装我的lxml库.

我已经将程序剥离到它的最小状态并仍然得到错误.

有没有其他人使用cx_freeze成功构建bs4应用程序？

请查看下面的详细信息,并告诉我您可能有的任何想法.

谢谢,

细节

完整错误回溯:

我已经将应用程序简化为最基本的状态,但仍然会出错.我在Python 3.4上也得到了同样的错误.

Traceback (most recent call last):
  File "C:\WinPython27\python-2.6.7\lib\site-packages\cx_Freeze\initscripts\Console.py", line 27, in <module>
    exec(code, m.__dict__)
  File "test.py", line 6, in <module>
  File "C:\WinPython27\python-2.6.7\lib\site-packages\bs4\__init__.py", line 152, in __init__
    % ",".join(feautres))
FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

Run Code Online (Sandbox Code Playgroud)

我已经尝试过的:

我发现有些人说我需要在构建脚本中包含lxml及其依赖项:http://sourceforge.net/p/cx-freeze/mailman/message/27973651/(对不起SF链接).我试过这个,但仍然没有骰子.

注释掉该行soup = BeautifulSoup("<tag>value</tag>", 'xml')不会导致错误.

版本和文件:

版本:

lxml 3.4.4
BeautifulSoup4 4.3.2
Python 2.7.6(32位)和Python …

python lxml wxpython beautifulsoup cx-freeze

dth*_*hor

lucky-day

1
推荐指数

1
解决办法

905
查看次数

BeautifulSoup如何在标签后提取文本

我不知道如何使用BeautifulSoup达到以下段落以及如何提取我想要的特定文本.因为我是Python和BS4的新手.

我的HTML如下:

<div class="inner-content">
  <div class="bred"></div>
  <div class="clrbth"></div>
  <h1></h1>
  <h4></h4>
  ...
  ...
  ...
  <p></p>
  <p></p>
  <p>

<!--This text I don't want -->

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
    <br></br>


<!-- The text I want to extract using BeautifulSoup-->

    It is a long established …

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup

Rah*_*ava

lucky-day

1
推荐指数

1
解决办法

5037
查看次数