标签: beautifulsoup

Beautiful Soup 只提取一个标签,而可以在 html 代码中看到所有其他标签

尝试了解网络抓取的工作原理:

import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
    caption = item.find('div', {'class': 'caption'})
    price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
Run Code Online (Sandbox Code Playgroud)

然而,当我运行此命令时,所有返回的都是网站的最终价格(1799.00 美元)。为什么它会跳过所有其他 h4 标签并只返回最后一个?

任何帮助将非常感激!

如果您需要更多信息,请告诉我

python lxml beautifulsoup html-parsing web-scraping

2
推荐指数
1
解决办法
97
查看次数

如何使用python绕过cloudflare

我无法抓取此网站https://www.mentalhealthforum.net/,我收到 403 状态代码,即使我已经尝试了互联网上的所有可用解决方案。Cloudflare 具有 h-captcha 保护,因此绕过它更加复杂

这是我的代码

def scrape(self):
    baseurl = 'https://www.mentalhealthforum.net/'
    scraper = cloudscraper.create_scraper(delay=10,
                                        browser={
                                                'browser': 'chrome',
                                                'platform': 'android',
                                                'desktop': False
                                                },
                                        debug=True, 
                                        captcha={'provider': '2captcha',
                                                 'api_key': api_key})
    response = scraper.get(baseurl)
    return response.status_code

print(scrape())  
Run Code Online (Sandbox Code Playgroud)

输出:

< GET / HTTP/1.1
< Host: www.mentalhealthforum.net
< User-Agent: Mozilla/5.0 (Linux; Android 4.3; SM-G710 Build/JLS36C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.111 Mobile Safari/537.36
< Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
< Accept-Language: en-US,en;q=0.9
< Accept-Encoding: gzip, deflate
<

> HTTP/1.1 403 Forbidden
> Date: Thu, 04 Aug 2022 …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping anti-bot cloudflare

2
推荐指数
1
解决办法
2万
查看次数

如何以编程方式获取 javascript 页面后面的 CSV 链接?

我正在使用 python,当我单击本页DATA V CSV底部的按钮时,我试图获取 CSV 来源的链接。

我试过beautifulsoup

import requests
from bs4 import BeautifulSoup

url = 'https://www.ceps.cz/en/all-data#AktualniSystemovaOdchylkaCR'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# Find the link to the CSV file
csv_link = soup.find('a', string='DATA V CSV').get('href')
Run Code Online (Sandbox Code Playgroud)

我也尝试过:

soup.find("button", {"id":"DATA V CSV"})

但没有找到后面的链接DATA V CSV

python beautifulsoup web-scraping

2
推荐指数
1
解决办法
86
查看次数

安装easy_install,不是那么容易

我正在尝试安装easy_install以便使用BeautifulSoup ...但是我不知道我的PATH目录是什么...当我运行easy_install BeautifulSoup时......我得到了

错误:不是可识别的存档类型:C:\ docume~1\tom\locals~1\temp\weasy_install-w6haxs\BeautifulSoup-3.2.1.tar.gz

我猜这与在环境变量中没有设置的PATH有关.....但我不知道我的路径应该是什么...任何帮助都会受到赞赏...我很新所有这一切所以说英语而不是编程将不胜感激lol ..

python path beautifulsoup easy-install

1
推荐指数
1
解决办法
4434
查看次数

刮一个编码为iso-8859-1而不是utf-8的网站:如何在我的数据库中存储正确的unicode?

我想用一个充满可怕问题的Python来抓一个网站,其中一个是顶部错误的编码:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Run Code Online (Sandbox Code Playgroud)

这是错误的,因为页面中出现如下所示:

Nell’ambito

代替

Nell'ambito(请注意’替换')

如果我理解正确,这是因为utf-8字节(可能是数据库编码)被解释为iso-8859-1字节(由元标记中的字符集强制).我在这个链接http://www.i18nqa.com/debug/utf8-debug.html找到了一些初步的解释

我正在使用BeautifulSoup来浏览页面,Google App Engine的urlfetch来发出请求,但是我需要的是了解在数据库中存储’通过对字符串进行编码来修复的字符串的正确方法'.

python unicode beautifulsoup utf-8

1
推荐指数
1
解决办法
2211
查看次数

在python中使用json.loads()的问题

我试图从嵌套的java脚本结构中获取数据.我想为此使用json.loads().但是,我收到一个错误说"No JSON object could be decoded"

以下是我尝试过的代码和java脚本结构: 代码

page_us = urllib2.urlopen('http://www.verizonwireless.com/smartphones-2.shtml')

soup_us = BeautifulSoup(page_us)
scripts_us = soup_us.findAll('script')

script=[]
for s in scripts_us:
    if s.string and "$j('#module_1_Tile" in s.string:
        script.append(s.text.split('data')[1].replace("\n","").replace("(","").replace(")","").replace(";","").replace("\t",""))
Run Code Online (Sandbox Code Playgroud)

数据结构

script[1] = u'{"phones":{"id5986":{"id":"5986","rating":"stars_4","colorName":"White","colorCode":"#FFFFFF","capacity":"16 GB","price":"$149.99","fullPrice":"$599.99","addToCartQty":"0","image":"http://s7.vzw.com/is/image/VerizonWireless/Motorola%5Fdroid%5Frazr%5Fhd%5Fwhite?$device%2Dmed$","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","BAGX":"false","priceRange":"150","rating":"4","OOS":"","freeShipping":"freeOvernightShippingHTML","bagxGetPhone":"","badges":{"lteBadge","vzwExclusiveBadge","globalReadyBadge"},"vPrice":"$221.96","vFullPrice":"$671.96","vBundleName":"DROID RAZR HD by Motorola in White Bluetooth&reg Pack","vBundleImage":"http://s7.vzw.com/is/image/VerizonWireless/moto%5Fdroid%5Frazr%5Fhd%5Fwht%5Fbluetooth%5Fvirt%5Fbndl?$device%2Dmed$","vBundleDescription":"<ul><li>Bluetooth&reg Headset</li><li>Clear Hard Cover</li><li>Vehicle Charger</li></ul>"},"id5985":{"id":"5985","rating":"stars_4_5","colorName":"Black","colorCode":"#000000","capacity":"16 GB","price":"$149.99","fullPrice":"$599.99","addToCartQty":"0","image":"http://s7.vzw.com/is/image/VerizonWireless/Motorola%5Fdroid%5Frazr%5Fhd%5Fblack?$device%2Dmed$","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","BAGX":"false","priceRange":"150","rating":"4_5","OOS":"","freeShipping":"freeOvernightShippingHTML","bagxGetPhone":"","badges":{"lteBadge","vzwExclusiveBadge","globalReadyBadge"},"vPrice":"$221.96","vFullPrice":"$671.96","vBundleName":"DROID RAZR HD by Motorola Bluetooth&reg Pack","vBundleImage":"http://s7.vzw.com/is/image/VerizonWireless/moto%5Fdroid%5Frazr%5Fhd%5Fblk%5Fbluetooth%5Fvirt%5Fbndl?$device%2Dmed$","vBundleDescription":"<ul><li>Bluetooth&reg Headset</li><li>Silicone Cover</li><li>Vehicle Charger</li></ul>"}},"options":{"colorName":"Black","colorCode":"#000000","capacity":"16 GB"},"info":{"brand":"486","os":"10351","features":{502,569,501,568,247,318,85,503,497,431,458,11,49,150,44,17,141,145,165,20,58,15,24,172,186,184,159,187,185,199,156,249,157,189,142,168,211,13,188,239,14,167,321,41,25,357,443,441,442,444,459,418,416,12,413,5,61,7,446,504,362,573,202,522,"hasVB"},"priceRange":"150","phoneId":"id5985","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","bagxGetPhone":"","BAGX":"false"}}'
Run Code Online (Sandbox Code Playgroud)

json.loads工作正常script[0],但上面的错误script[1].请让我知道我在这里错过了什么.谢谢.

javascript python json beautifulsoup python-2.7

1
推荐指数
1
解决办法
157
查看次数

网络刮痧形成新闻数据库

我正在为不同的新闻媒体创建一个网络刮板.我试图为The Hindu报纸创建一个.

我想从其档案中提到的各种链接中获取新闻.让我们说我想在第二天提到的链接上获取新闻:http://www.thehindu.com/archive/web/2010/06/19/那是2010年6月19日.

现在我写了以下几行代码:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]
print articletext
Run Code Online (Sandbox Code Playgroud)

但我无法得到所需的结果.我基本上卡住了.有人可以帮我解决一下吗?

python beautifulsoup python-2.7 python-3.x

1
推荐指数
1
解决办法
1003
查看次数

Python中的TypeError - 美丽的汤

我正在网页抓取这个页面http://www.crmz.com/Directory/Industry806.htm,我应该得到所有的

  • #
  • 公司名
  • 国家
  • 州/省

但是compnay名称旁边有一个rss链接,所以我没有得到结果并显示一个typeError.

这是我的代码:

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://www.crmz.com/Directory/Industry806.htm"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"border":"0", "cellspacing":"1", "cellpadding":"2"})

rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = ''.join(td.find(text=True))
        print text+"|",
    print
Run Code Online (Sandbox Code Playgroud)

这是我的输出:

LRI$ python scrape.py

#| Company Name| Country| State/Province|
1.| 1300 Smiles Limited|

Traceback (most recent call last):
  File "scrape.py", …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup typeerror

1
推荐指数
1
解决办法
243
查看次数

在HTML中查找和替换字符串

从这个HTML代码:

<p class="description" dir="ltr">Name is a fine man. <br></p>
Run Code Online (Sandbox Code Playgroud)

我正在寻找使用以下代码替换"名称":

target = soup.find_all(text="Name")
for v in target:
    v.replace_with('Id')
Run Code Online (Sandbox Code Playgroud)

我想要的输出是:

<p class="description" dir="ltr">Id is a fine man. <br></p>
Run Code Online (Sandbox Code Playgroud)

当我:

print target
[]
Run Code Online (Sandbox Code Playgroud)

为什么不找到"名字"?

谢谢!

html python beautifulsoup python-2.7

1
推荐指数
1
解决办法
3937
查看次数

Python在页面上计算数字或字母数

我在Python中发出请求requests.

然后我bs4用来选择想要的div.我现在想要计算该div中文本的长度,但是我从中获取的字符串也包括所有标记,例如:

<div><a class="some_class">Text here!</a></div>
Run Code Online (Sandbox Code Playgroud)

我想只计算Text here!,没有所有diva标签.

任何人都知道我该怎么做?

html python beautifulsoup bs4

1
推荐指数
1
解决办法
535
查看次数