我熟悉BeautifulSoup和urllib2来从网页上抓取数据.但是,如果在我想要刮取的结果返回之前需要在页面中输入参数怎么办?
我正在尝试使用此网站获取两个地址之间的地理距离:http: //www.freemaptools.com/how-far-is-it-between.htm
我希望能够转到页面,输入两个地址,单击"显示",然后提取"距离作为乌鸦飞行"和"按陆地运输距离"值并将其保存到字典中.
有没有办法使用Python将数据输入网页?
我有这个代码获取页面的HTML源代码:
$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);
Run Code Online (Sandbox Code Playgroud)
我想从中搜集一些内容.例如,假设页面的源包含:
<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />
Run Code Online (Sandbox Code Playgroud)
有没有办法可以从源代码中删除它并将其存储在变量中,所以它看起来像这样:
technorati.com连接失败
icerocket.com连接失败
eblogs.com完成
Ect.
因为页面是动态的,这就是我遇到问题的原因.我可以搜索源中的每个站点吗?但那我怎么得到它之后的结果呢?(连接失败/完成)非常
感谢您的帮助!
我正在尝试从Google Image搜索中获取特定查询的图像.但我下载的页面没有图片,它将我重定向到谷歌的原始页面.这是我的代码:
AGENT_ID = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"
GOOGLE_URL = "https://www.google.com/images?source=hp&q={0}"
_myGooglePage = ""
def scrape(self, theQuery) :
self._myGooglePage = subprocess.check_output(["curl", "-L", "-A", self.AGENT_ID, self.GOOGLE_URL.format(urllib.quote(theQuery))], stderr=subprocess.STDOUT)
print self.GOOGLE_URL.format(urllib.quote(theQuery))
print self._myGooglePage
f = open('./../../googleimages.html', 'w')
f.write(self._myGooglePage)
Run Code Online (Sandbox Code Playgroud)
我究竟做错了什么?
谢谢
使用 HTML Agility Pack 非常适合获取后代和整个表格等...但是在以下情况下如何使用它
...Html Code above...
<dl>
<dt>Location:</dt>
<dd>City, London</dd>
<dt style="padding-bottom:10px;">Distance:</dt>
<dd style="padding-bottom:10px;">0 miles</dd>
<dt>Date Issued:</dt>
<dd>26/10/2010</dd>
<dt>type:</dt>
<dd>cement</dd>
</dl>
...HTML Code below....
Run Code Online (Sandbox Code Playgroud)
如果在这种情况下英里小于 15,你怎么能找到?我不明白你可以对元素做一些事情,但是你是否必须让所有元素找到正确的元素,然后找到数字来检查其值?或者有没有办法将正则表达式与 Agility pack 一起使用以更好的方式实现这一目标......
我有跟随html,我试图弄清楚我是如何告诉BeautifulSoup在某些html元素后提取td.在这种情况下,我想在<td>之后获取数据<td>Color Digest</td>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
Run Code Online (Sandbox Code Playgroud)
这是整个HTML
<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud) 我已经安装Tor + Privoxy在我的服务器上并且工作正常!(已测试)。但现在,当我尝试使用urllib2 (python)代理来抓取谷歌购物结果时,当然,我总是被谷歌阻止(有时是 503 错误,有时是 403 错误)。那么任何人有任何解决方案可以帮助我避免这个问题吗?我们将非常感激!
我正在使用的源代码:
_HEADERS = {
'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'deflate',
'Connection': 'close',
'DNT': '1'
}
request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
try:
response = urllib2.urlopen(request)
html = response.read()
print html
except urllib2.HTTPError as e:
print e.code
print e.reason
Run Code Online (Sandbox Code Playgroud)
请注意:当我不使用代理时,它可以正常工作!
我试图遍历每个<ul>并获得每个的价值<li>.问题是,它只需要第一个<ul>并跳过其余部分.
HTML
<div id="browse-results">
<ul class="tips cf">
<li>tip11</li>
<li>tip12</li>
<li>tip13</li>
</ul>
<ul class="tips cf">
<li>tip21</li>
<li>tip22</li>
<li>tip23</li>
</ul>
<ul class="tips cf">
<li>tip31</li>
<li>tip32</li>
<li>tip33</li>
</ul>
<ul class="tips cf">
<li>tip41</li>
<li>tip42</li>
<li>tip43</li>
</ul>
</div>
Run Code Online (Sandbox Code Playgroud)
Cheerio解析
$('#browse-results').find('.tips.cf').each(function(i, elm) {
console.log($(this).text()) // for testing do text()
});
$('#browse-results').children().find('.tips').each(function(i, elm) {
console.log($(this).text())
});
I've tried many more
Run Code Online (Sandbox Code Playgroud)
输出只是第一个的值<ul>.
tip11
tip12
tip13
Run Code Online (Sandbox Code Playgroud)
请注意,这只是一个片段示例,其结构与我正在尝试解析的内容相同.
我花了将近2个小时,我找不到办法.
我正在尝试使用nodejs抓取一个网站,它可以在不需要任何身份验证的网站上完美运行.但每当我尝试使用需要用户名和密码的表单来抓取网站时,我只会从身份验证页面获取HTML(也就是说,如果您在身份验证页面上单击"查看页面源",那就是HTML I得到).我可以使用curl获得所需的HTML
curl -d "username=myuser&password=mypw&submit=Login" URL
Run Code Online (Sandbox Code Playgroud)
这是我的代码......
var express = require('express');
var fs = require('fs'); //access to file system
var request = require('request');
var cheerio = require('cheerio');
var app = express();
app.get('/scrape', function(req, res){
url = 'myURL'
request(url, function(error, response, html){
// check errors
if(!error){
// Next, we'll utilize the cheerio library on the returned html which will essentially give us jQuery functionality
var $ = cheerio.load(html);
var title, release, rating;
var json = { title : "", release : …Run Code Online (Sandbox Code Playgroud) 我试图从网站http://www.cityoflondon.gov.uk/events/中删除一个事件列表,但是当使用import.io废弃它时,我只能提取第一页.
我怎样才能一次提取所有页面?
图片http://i.imgur.com/OigSBjF.png
__CODE__不是魔术; 它与...基本相同__CODE__.因此,__CODE__不能用占位符搜索; 左参数被完全评估.相反,使用
r = requests.get("xxxxxxxxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
print link.get('src')
Run Code Online (Sandbox Code Playgroud) scrape ×10
python ×5
html ×2
node.js ×2
web-scraping ×2
.net ×1
cheerio ×1
image ×1
import.io ×1
javascript ×1
php ×1
python-2.7 ×1
tor ×1