尝试了解网络抓取的工作原理:
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
Run Code Online (Sandbox Code Playgroud)
然而,当我运行此命令时,所有返回的都是网站的最终价格(1799.00 美元)。为什么它会跳过所有其他 h4 标签并只返回最后一个?
任何帮助将非常感激!
如果您需要更多信息,请告诉我
我无法抓取此网站https://www.mentalhealthforum.net/,我收到 403 状态代码,即使我已经尝试了互联网上的所有可用解决方案。Cloudflare 具有 h-captcha 保护,因此绕过它更加复杂
这是我的代码
def scrape(self):
baseurl = 'https://www.mentalhealthforum.net/'
scraper = cloudscraper.create_scraper(delay=10,
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
},
debug=True,
captcha={'provider': '2captcha',
'api_key': api_key})
response = scraper.get(baseurl)
return response.status_code
print(scrape())
Run Code Online (Sandbox Code Playgroud)
输出:
< GET / HTTP/1.1
< Host: www.mentalhealthforum.net
< User-Agent: Mozilla/5.0 (Linux; Android 4.3; SM-G710 Build/JLS36C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.111 Mobile Safari/537.36
< Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
< Accept-Language: en-US,en;q=0.9
< Accept-Encoding: gzip, deflate
<
> HTTP/1.1 403 Forbidden
> Date: Thu, 04 Aug 2022 …Run Code Online (Sandbox Code Playgroud) 我正在使用 python,当我单击本页DATA V CSV底部的按钮时,我试图获取 CSV 来源的链接。
我试过beautifulsoup:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ceps.cz/en/all-data#AktualniSystemovaOdchylkaCR'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find the link to the CSV file
csv_link = soup.find('a', string='DATA V CSV').get('href')
Run Code Online (Sandbox Code Playgroud)
我也尝试过:
soup.find("button", {"id":"DATA V CSV"})
但没有找到后面的链接DATA V CSV。
我正在尝试安装easy_install以便使用BeautifulSoup ...但是我不知道我的PATH目录是什么...当我运行easy_install BeautifulSoup时......我得到了
错误:不是可识别的存档类型:C:\ docume~1\tom\locals~1\temp\weasy_install-w6haxs\BeautifulSoup-3.2.1.tar.gz
我猜这与在环境变量中没有设置的PATH有关.....但我不知道我的路径应该是什么...任何帮助都会受到赞赏...我很新所有这一切所以说英语而不是编程将不胜感激lol ..
我想用一个充满可怕问题的Python来抓一个网站,其中一个是顶部错误的编码:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Run Code Online (Sandbox Code Playgroud)
这是错误的,因为页面中出现如下所示:
Nell’ambito
代替
Nell'ambito(请注意’替换')
如果我理解正确,这是因为utf-8字节(可能是数据库编码)被解释为iso-8859-1字节(由元标记中的字符集强制).我在这个链接http://www.i18nqa.com/debug/utf8-debug.html找到了一些初步的解释
我正在使用BeautifulSoup来浏览页面,Google App Engine的urlfetch来发出请求,但是我需要的是了解在数据库中存储’通过对字符串进行编码来修复的字符串的正确方法'.
我试图从嵌套的java脚本结构中获取数据.我想为此使用json.loads().但是,我收到一个错误说"No JSON object could be decoded"
以下是我尝试过的代码和java脚本结构: 代码
page_us = urllib2.urlopen('http://www.verizonwireless.com/smartphones-2.shtml')
soup_us = BeautifulSoup(page_us)
scripts_us = soup_us.findAll('script')
script=[]
for s in scripts_us:
if s.string and "$j('#module_1_Tile" in s.string:
script.append(s.text.split('data')[1].replace("\n","").replace("(","").replace(")","").replace(";","").replace("\t",""))
Run Code Online (Sandbox Code Playgroud)
数据结构
script[1] = u'{"phones":{"id5986":{"id":"5986","rating":"stars_4","colorName":"White","colorCode":"#FFFFFF","capacity":"16 GB","price":"$149.99","fullPrice":"$599.99","addToCartQty":"0","image":"http://s7.vzw.com/is/image/VerizonWireless/Motorola%5Fdroid%5Frazr%5Fhd%5Fwhite?$device%2Dmed$","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","BAGX":"false","priceRange":"150","rating":"4","OOS":"","freeShipping":"freeOvernightShippingHTML","bagxGetPhone":"","badges":{"lteBadge","vzwExclusiveBadge","globalReadyBadge"},"vPrice":"$221.96","vFullPrice":"$671.96","vBundleName":"DROID RAZR HD by Motorola in White Bluetooth® Pack","vBundleImage":"http://s7.vzw.com/is/image/VerizonWireless/moto%5Fdroid%5Frazr%5Fhd%5Fwht%5Fbluetooth%5Fvirt%5Fbndl?$device%2Dmed$","vBundleDescription":"<ul><li>Bluetooth® Headset</li><li>Clear Hard Cover</li><li>Vehicle Charger</li></ul>"},"id5985":{"id":"5985","rating":"stars_4_5","colorName":"Black","colorCode":"#000000","capacity":"16 GB","price":"$149.99","fullPrice":"$599.99","addToCartQty":"0","image":"http://s7.vzw.com/is/image/VerizonWireless/Motorola%5Fdroid%5Frazr%5Fhd%5Fblack?$device%2Dmed$","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","BAGX":"false","priceRange":"150","rating":"4_5","OOS":"","freeShipping":"freeOvernightShippingHTML","bagxGetPhone":"","badges":{"lteBadge","vzwExclusiveBadge","globalReadyBadge"},"vPrice":"$221.96","vFullPrice":"$671.96","vBundleName":"DROID RAZR HD by Motorola Bluetooth® Pack","vBundleImage":"http://s7.vzw.com/is/image/VerizonWireless/moto%5Fdroid%5Frazr%5Fhd%5Fblk%5Fbluetooth%5Fvirt%5Fbndl?$device%2Dmed$","vBundleDescription":"<ul><li>Bluetooth® Headset</li><li>Silicone Cover</li><li>Vehicle Charger</li></ul>"}},"options":{"colorName":"Black","colorCode":"#000000","capacity":"16 GB"},"info":{"brand":"486","os":"10351","features":{502,569,501,568,247,318,85,503,497,431,458,11,49,150,44,17,141,145,165,20,58,15,24,172,186,184,159,187,185,199,156,249,157,189,142,168,211,13,188,239,14,167,321,41,25,357,443,441,442,444,459,418,416,12,413,5,61,7,446,504,362,573,202,522,"hasVB"},"priceRange":"150","phoneId":"id5985","ATCST":"submitThisPhone","MAST":"false","CIL":"0","IRURL":"https://preorder.verizonwireless.com/iconic/","bagxGetPhone":"","BAGX":"false"}}'
Run Code Online (Sandbox Code Playgroud)
json.loads工作正常script[0],但上面的错误script[1].请让我知道我在这里错过了什么.谢谢.
我正在为不同的新闻媒体创建一个网络刮板.我试图为The Hindu报纸创建一个.
我想从其档案中提到的各种链接中获取新闻.让我们说我想在第二天提到的链接上获取新闻:http://www.thehindu.com/archive/web/2010/06/19/那是2010年6月19日.
现在我写了以下几行代码:
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.contents[0]
print articletext
Run Code Online (Sandbox Code Playgroud)
但我无法得到所需的结果.我基本上卡住了.有人可以帮我解决一下吗?
我正在网页抓取这个页面http://www.crmz.com/Directory/Industry806.htm,我应该得到所有的
但是compnay名称旁边有一个rss链接,所以我没有得到结果并显示一个typeError.
这是我的代码:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.crmz.com/Directory/Industry806.htm"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"border":"0", "cellspacing":"1", "cellpadding":"2"})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
Run Code Online (Sandbox Code Playgroud)
这是我的输出:
LRI$ python scrape.py
#| Company Name| Country| State/Province|
1.| 1300 Smiles Limited|
Traceback (most recent call last):
File "scrape.py", …Run Code Online (Sandbox Code Playgroud) 从这个HTML代码:
<p class="description" dir="ltr">Name is a fine man. <br></p>
Run Code Online (Sandbox Code Playgroud)
我正在寻找使用以下代码替换"名称":
target = soup.find_all(text="Name")
for v in target:
v.replace_with('Id')
Run Code Online (Sandbox Code Playgroud)
我想要的输出是:
<p class="description" dir="ltr">Id is a fine man. <br></p>
Run Code Online (Sandbox Code Playgroud)
当我:
print target
[]
Run Code Online (Sandbox Code Playgroud)
为什么不找到"名字"?
谢谢!
我在Python中发出请求requests.
然后我bs4用来选择想要的div.我现在想要计算该div中文本的长度,但是我从中获取的字符串也包括所有标记,例如:
<div><a class="some_class">Text here!</a></div>
Run Code Online (Sandbox Code Playgroud)
我想只计算Text here!,没有所有div和a标签.
任何人都知道我该怎么做?
beautifulsoup ×10
python ×10
python-2.7 ×3
web-scraping ×3
html ×2
anti-bot ×1
bs4 ×1
cloudflare ×1
easy-install ×1
html-parsing ×1
javascript ×1
json ×1
lxml ×1
path ×1
python-3.x ×1
typeerror ×1
unicode ×1
utf-8 ×1