标签: beautifulsoup

美丽的汤无法在给定的html文件中找到任何表标签,尽管它也可以找到其他标签

我试图解析给定的html文件以查找所有表.它实际上来自Android的api差异报告.

这是python代码,我手动将所有内容插入到脚本中:

from bs4 import BeautifulSoup 

input='''
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "//www.w3.org/TR/html4/strict.dtd">
<HTML style="overflow:auto;">
<HEAD>
<meta name="generator" content="JDiff v1.1.0">
<!-- Generated by the JDiff Javadoc doclet -->
<!-- (http://www.jdiff.org) -->
<meta name="description" content="JDiff is a Javadoc doclet which generates an HTML report of all the packages, classes, constructors, methods, and fields which have been removed, added or changed in any way, including their documentation, when two APIs are compared.">
<meta name="keywords" content="diff, jdiff, javadiff, java diff, java difference, …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

api*_*ang

2014 02-18

0
推荐指数

1
解决办法

111
查看次数

搜索html文本Python

我使用urllib2来获取网页,我需要在返回的数据中查找特定值.

使用Beautiful Soup并使用find方法或使用正则表达式搜索数据是最好的方法吗？

以下是请求返回的文本的一个非常基本的示例:

<html>
<body>
<table> 
   <tbody> 
      <tr>
         <td>
            <div id="123" class="services">
               <table>
                  <tbody>
                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> Example BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" …

Run Code Online (Sandbox Code Playgroud)

python regex beautifulsoup

Cia*_*ran

2014 02-19

0
推荐指数

1
解决办法

79
查看次数

Python beatutiful汤'ResultSet'对象没有属性'get'

我试图抓住网站上的一些链接,并在他们被清理后将它们写入文件.网站上的链接如下所示:

<a href="javascript:changeChannel('http://dr01-lh.akamaihd.net/i/dr01_0@147054/index_1700_av-b.m3u8', 20);">DR1</a><br>
<a href="javascript:changeChannel('http://dr02-lh.akamaihd.net/i/dr02_0@147055/index_1700_av-b.m3u8', 21);">DR2</a><br>
<a href="javascript:changeChannel('http://dr03-lh.akamaihd.net/i/dr03_0@147056/index_1700_av-b.m3u8', 701);">DR3</a><br>
<a href="javascript:changeChannel('http://dr06-lh.akamaihd.net/i/dr06_0@147059/index_1700_av-b.m3u8', 31);">DR Ultra</a><br>
<a href="javascript:changeChannel('http://dr04-lh.akamaihd.net/i/dr04_0@147057/index_1700_av-b.m3u8', 38);">DR K</a><br>
<a href="javascript:changeChannel('http://dr05-lh.akamaihd.net/i/dr05_0@147058/index_1700_av-b.m3u8', 50);">DR Ramasjang</a><br>

Run Code Online (Sandbox Code Playgroud)

我可以用它抓住它们:

links = soup.findAll(href=re.compile("javascript"))

Run Code Online (Sandbox Code Playgroud)

给我这个输出:

[<a href="javascript:changeChannel('http://dr01-lh.akamaihd.net/i/dr01_0@147054/index_1700_av-b.m3u8', 20);">DR1</a>, <a href="javascript:changeChannel('http://dr02-lh.akamaihd.net/i/dr02_0@147055/index_1700_av-b.m3u8', 21);">DR2</a>, <a href="javascript:changeChannel('http://dr03-lh.akamaihd.net/i/dr03_0@147056/index_1700_av-b.m3u8', 701);">DR3</a>, <a href="javascript:changeChannel('http://dr06-lh.akamaihd.net/i/dr06_0@147059/index_1700_av-b.m3u8', 31);">DR Ultra</a>, <a href="javascript:changeChannel('http://dr04-lh.akamaihd.net/i/dr04_0@147057/index_1700_av-b.m3u8', 38);">DR K</a>, <a href="javascript:changeChannel('http://dr05-lh.akamaihd.net/i/dr05_0@147058/index_1700_av-b.m3u8', 50);">DR Ramasjang</a>]

Run Code Online (Sandbox Code Playgroud)

现在我想清理它,所以我只得到''之间的http://部分,这就是它变坏的地方.

我试过了

fullink = links.get('href')

Run Code Online (Sandbox Code Playgroud)

我收到错误的地方:

'ResultSet' object has no attribute 'get'

Run Code Online (Sandbox Code Playgroud)

那么如何从中获取链接呢？

python beautifulsoup

use*_*151

2014 03-20

0
推荐指数

1
解决办法

2689
查看次数

Beautifulsoup不在特定网站上工作

我正在尝试解析这个网站,由于我无法理解的原因,没有任何事情发生.

url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
response = urllib2.urlopen(url).read()
doc = BeautifulSoup(response)
divs = doc.findAll('div')
print len(divs) # prints 0.

Run Code Online (Sandbox Code Playgroud)

该网站是巴西里约热内卢的一个真实广告.我在html源代码中找不到任何可以阻止Beautifulsoup工作的东西.这是大小吗？

我正在使用Enthought Canopy Python 2.7.6,IPython Notebook 2.0,Beautifulsoup 4.3.2.

html python beautifulsoup html-parsing python-2.7

sro*_*uex

2014 09-20

0
推荐指数

1
解决办法

407
查看次数

findALL无法正常工作

这就是我试图获取所有链接的方式:

soup.find("div", attrs={"class": "vl-article-title"}).find("h3").find("span").find("a")

Run Code Online (Sandbox Code Playgroud)

这只找到第一个,但正如我所说,我需要所有这些.

为什么这不起作用:

soup.findAll("div", attrs={"class": "vl-article-title"}).find("h3").find("span").find("a")

Run Code Online (Sandbox Code Playgroud)

？

我收到一个错误:

'ResultSet' object has no attribute 'find'

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Irm*_*nis

lucky-day

0
推荐指数

1
解决办法

158
查看次数

使用BeautifulSoup从URL获取图像

我正在尝试从Wikipedia页面获取重要图像，而不是缩略图或其他gif，并使用以下代码。但是，“ img”的长度为“ 0”。关于如何纠正它的任何建议。

代码：

import urllib
import urllib2
from bs4 import BeautifulSoup
import os

html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")

soup = BeautifulSoup(html)

imgs = soup.findAll("div",{"class":"image"})

Run Code Online (Sandbox Code Playgroud)

另外，如果有人可以通过查看网页中的“源元素”来详细说明如何使用findAll。那太好了。

python url urllib beautifulsoup web-scraping

Lon*_*oul

2014 06-23

0
推荐指数

1
解决办法

6407
查看次数

从样式中提取URL：background-url：是否有beautifulsoup而没有正则表达式？

我有：

<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"

Run Code Online (Sandbox Code Playgroud)

我想获取网址，但是如果不使用正则表达式，我将无法做到这一点。可能吗？

到目前为止，我使用正则表达式的解决方案是：

url = re.findall('\('(.*?)'\)', soup['style'])[0]

Run Code Online (Sandbox Code Playgroud)

python string beautifulsoup web-scraping

Gra*_*rus

2014 07-27

0
推荐指数

1
解决办法

5324
查看次数

'NoneType'对象在BeautifulSoup中使用'find_all'无法调用

这是我的代码,使用find_all,但它适用于.find():

import requests
from BeautifulSoup import BeautifulSoup

r = requests.get(URL_DEFINED)
print r.status_code

soup = BeautifulSoup(r.text)
print soup.find_all('ul')

Run Code Online (Sandbox Code Playgroud)

这就是我得到的:

Traceback (most recent call last):


File "scraper.py", line 19, in <module>
    print soup.find_all('ul')
TypeError: 'NoneType' object is not callable

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Geo*_*Jor

2014 10-10

0
推荐指数

1
解决办法

5040
查看次数

BeautifulSoup4 - python:如何合并两个bs4.element.ResultSet并获取一个列表？

我有两个

bs4.element.ResultSet

Run Code Online (Sandbox Code Playgroud)

对象.

我们打电话给他们吧

rs1
rs2

Run Code Online (Sandbox Code Playgroud)

我想要一个结果集(让我们称之为rs)以及结果集中的所有结果.

我还需要弄明白:

数组或列表或字典是否更好地浏览(可能)大的结果列表,因为每个元素将由一个由不同类型的7个属性组成的对象组成
如何将合并的结果集转换为数组/列表/字典

python beautifulsoup python-3.x

dra*_*mnl

2014 11-04

0
推荐指数

1
解决办法

2228
查看次数

来自UTF-8的Python和BeautifulSoup编码问题

我是python的新手,目前正在编写一个从Web上删除数据的应用程序.它主要完成,编码只剩下一点问题.该网站编码ISO-8859-1,但当我尝试html.decode('iso-8859-1'),它没有做任何事情.如果你运行程序,使用50000和50126PLZs你会看到我在输出中的意思.如果有人可以帮助我,那将是非常棒的.

import urllib.request
import time
import csv
import operator

from bs4 import BeautifulSoup


#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result
def doRequest(request):
    requestResult = urllib.request.urlopen(request)
    soup = BeautifulSoup(requestResult)
    return soup


#Returns all the result links from the given search parameters
def getLinksFromSearch(plz_von, plz_bis):
    database = []
    links = []

    #The search parameters
    params = {
    'name_ff': '',
    'strasse_ff': '',
    'plz_ff': plz_von,
    'plz_ff2': plz_bis,
    'ort_ff': '',
    'bundesland_ff': '', …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Fre*_*nce

lucky-day

0
推荐指数

1
解决办法

1076
查看次数