标签: beautifulsoup

使用 Python BeautifulSoup 从网页中抓取没有 id 或 class 的元素

如果元素具有 id 或类，我知道如何从网页中抓取数据。

例如，这里soup是一个 BeautifulSoup 对象。

for item in soup.findAll('a',{"class":"class_name"}):
    title = item.string
    print(title+"\n")

Run Code Online (Sandbox Code Playgroud)

如果元素没有 id 或 class，我们如何做到这一点？例如，没有 id 或 class 的段落元素。

或者在更糟糕的情况下，如果我们只需要抓取一些像下面这样的纯文本会发生什么？

<body>
<p>YO!</p>
hello world!!
</body>

Run Code Online (Sandbox Code Playgroud)

例如，如何仅hello world!!在上述页面源中打印？它没有 id 或 class。

python beautifulsoup

Rav*_*310

2015 12-19

1
推荐指数

1
解决办法

9371
查看次数

我怎样才能在beautifulsoup中获得href标签？

我正在使用python的beautifulsoup

<div class="test1">
   <a href="www.google.com" blur blur~> text </a>
</div>

<div class="test2">
   <a href="www.stackoverflow.com" blur blur~> text </a>
</div>

<div class="test3">
   <a href="www.msn.com" blur blur~> text </a>
</div>

<div class="test4">
   <a href="www.naver.com" blur blur~> text </a>
</div>

<div class="test5">
   <a href="www.ios.com" blur blur~> text </a>
</div>

Run Code Online (Sandbox Code Playgroud)

像这种情况，我想得到一个特定的 href 信息。例如，当我需要一个 href='www.ios.com' 时，我如何使用类名。

HTML 文件有超过 1000 个“a”选择器，并且包含的 url 地址是动态的。

我怎么能得到这个？请回答我TT

python parsing beautifulsoup

Kye*_*Kim

2016 01-22

1
推荐指数

1
解决办法

1万
查看次数

Python：在<br/>之前的</span>之后提取文本

这是我要处理的 html 文件：

<span class="pl">Countries:</span> USA <br/>
<span class="pl">Language:</span> English <br/>

Run Code Online (Sandbox Code Playgroud)

这是我的python代码：

from bs4 import BeautifulSoup

record=[]
soup=BeautifulSoup(html)
spans=soup.find_all('span')
for span in spans:
   record.append(span.text)

Run Code Online (Sandbox Code Playgroud)

我最终得到的是：

Countries: Language:

Run Code Online (Sandbox Code Playgroud)

结果漏掉了一些重要信息：“USA”和“English” 我怎样才能得到文本？

html python beautifulsoup

Ken*_*awa

lucky-day

1
推荐指数

1
解决办法

1769
查看次数

尝试从网页解析信息时获取 HTTPError

我刚开始学习 Python 并面临这个问题。Trued 从亚马逊解析价格并将其打印到控制台。

这是我的代码：

import requests, bs4

def getAmazonPrice(productUrl):
    res = requests.get(productUrl)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    elems = soup.select('#addToCart > a > h5 > div > div.a-column.a-span7.a-text-right.a-span-last > span.a-size-medium.a-color-price.header-price')
    return elems[0].text.strip()


price = getAmazonPrice('http://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/ref=sr_1_2?ie=UTF8&qid=1460386052&sr=8-2&keywords=python+book')
print('The price is ' + price)

Run Code Online (Sandbox Code Playgroud)

错误信息：

回溯（最近一次调用）：文件“D:/Code/Python/Basic/webBrowser-Module.py”，第 37 行，在 price = getAmazonPrice(' http://www.amazon.com/Automate-Boring-Stuff -Python-Programming/dp/1593275994/ref=sr_1_2?ie=UTF8&qid=1460386052&sr=8-2&keywords=python+book ') 文件“D:/Code/Python/Basic/webBrowser-Module.py”，第 30 行，在getAmazonPrice res.raise_for_status() 文件“C:\Python33\lib\requests\models.py”，第 844 行，在 raise_for_status 中引发 HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError：503 服务器错误：服务不可用 url : http://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/ref=sr_1_2?ie=UTF8&qid=1460386052&sr=8-2&keywords=python+book

进程以退出代码 1 结束

python beautifulsoup request

Vik*_*tor

lucky-day

1
推荐指数

1
解决办法

3432
查看次数

BeautifulSoup findall() 中的“NoneType”对象不可调用

我对 Python 的奇妙世界很陌生。下面的刮板产生一个对象不可调用错误，我真的不明白为什么会这样。非常感谢任何帮助

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.maxxim.de/lte-mini-sms1?maxxim=7hs6q1jfl95fip6qumcum4rfh4")
bsObj = BeautifulSoup(html,"html.parser")
nameList = bsObj.findall("h2")
for name in nameList:
     print (name.get_text())

Run Code Online (Sandbox Code Playgroud)

beautifulsoup python-3.x

Dan*_*l P

2016 09-01

1
推荐指数

1
解决办法

2856
查看次数

使用 Beautiful Soup 从 td 元素中提取 URL

我正在尝试从 html 表中提取 URL。URL 位于 td 单元格内的锚标记内。html 看起来像：

<table width="100%" border="0" cellspacing="0" cellpadding="0" name="TabName" id="Tab" class="common-table">
    <tr>
        <td>Acme Company</a><br/><span class="f-10">07-11-2016</span></td>
        <td><span>Vendor</span><br>
        <td><a href="http://URL" title="Report Details">Details</a></td>
    </tr>
</table>

Run Code Online (Sandbox Code Playgroud)

这是我编写的 Python 代码：

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://SourceURL')
soup = BeautifulSoup(r.content,"html.parser")
# Find table
table = soup.find("table",{"class": "common-table"})
# Find all tr rows
tr = table.find_all("tr")

for each_tr in tr:
    td = each_tr.find_all('td')
    # In each tr rown find each td cell
    for each_td in td:
        print(each_td.text) …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Ram*_*Ram

lucky-day

1
推荐指数

1
解决办法

1305
查看次数

Python - Unicode 和双反斜杠

我用 BeautifulSoup 删除了一个网页。我得到了很好的输出，除了列表的一部分在获取文本后看起来像这样：

list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

Run Code Online (Sandbox Code Playgroud)

我现在的问题是如何摆脱或用它们的特殊字符替换这些双反斜杠。

如果我打印示例列表的第一个元素，输出看起来像

print list[0]
that\u2019s

Run Code Online (Sandbox Code Playgroud)

我已经阅读了很多关于这个主题的其他问题/线程，但我最终更加困惑，因为我是一个考虑 unicode/编码/解码的初学者。

我希望有人能帮助我解决这个问题。

谢谢！MG

python unicode beautifulsoup backslash unicode-escapes

mgr*_*ber

lucky-day

1
推荐指数

1
解决办法

6886
查看次数

从列表中删除 `\n`

我有一个列表，其中包含从在线网站上抓取的数据。名单是这样的

list1 = ['\nJob Description\n\nDESCRIPTION: Interacts with users and technical team members to analyze requirements and develop
technical design specifications.  Troubleshoot complex issues and make recommendations to improve efficiency and accurac
y. Interpret complex data, analyze results using statistical techniques and provide ongoing reports. Identify, analyze,
and interpret trends or patterns in complex data sets. Filter and "clean data, review reports, and performance indicator
s to locate and correct code problems. Work closely with management to prioritize business and information …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping python-3.x

Moh*_*han

2017 04-09

1
推荐指数

1
解决办法

3万
查看次数

获取一个网站的所有链接

嗨，我想创建一个迷你爬虫，但不使用Scrapy，

我创建了这样的东西：

response = requests.get(url)
homepage_link_list = []
soup = BeautifulSoup(response.content, 'lxml')
for link in soup.findAll("a"):
    if link.get("href"):
        homepage_link_list.append(link.get("href"))


link_list = []
for item in homepage_link_list:
    response = requests.get(item)
    soup = BeautifulSoup(response.content, 'lxml')
    for link in soup.findAll("a"):
        if link.get("href"):
            link_list.append(link.get("href"))

Run Code Online (Sandbox Code Playgroud)

虽然我遇到的问题是它只获取网页链接中的链接，但我怎样才能让它获取网站所有链接中的所有链接。

python beautifulsoup web-scraping python-requests

Bry*_*Bry

lucky-day

1
推荐指数

1
解决办法

4893
查看次数

BeautifulSoup 无法找到具有特定类的表

从本质上讲，我试图从具有下面给定类标题的表格中提取文本。我已经编写了从每一行中提取文本的其余代码，因此我不需要这方面的任何帮助。我似乎无法弄清楚为什么我会收到此错误：

"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Run Code Online (Sandbox Code Playgroud)

代码是：

from bs4 import BeautifulSoup

import requests

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}

url …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

rah*_*f23

2017 06-19

1
推荐指数

1
解决办法

7776
查看次数