标签: scrape

如何将数据输入网页以使用Python刮取结果输出？

我熟悉BeautifulSoup和urllib2来从网页上抓取数据.但是,如果在我想要刮取的结果返回之前需要在页面中输入参数怎么办？

我正在尝试使用此网站获取两个地址之间的地理距离:http: //www.freemaptools.com/how-far-is-it-between.htm

我希望能够转到页面,输入两个地址,单击"显示",然后提取"距离作为乌鸦飞行"和"按陆地运输距离"值并将其保存到字典中.

有没有办法使用Python将数据输入网页？

python scrape

use*_*166

lucky-day

6
推荐指数

1
解决办法

2万
查看次数

如何从页面源"抓取"内容？

我有这个代码获取页面的HTML源代码:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

Run Code Online (Sandbox Code Playgroud)

我想从中搜集一些内容.例如,假设页面的源包含:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Run Code Online (Sandbox Code Playgroud)

有没有办法可以从源代码中删除它并将其存储在变量中,所以它看起来像这样:

technorati.com连接失败
icerocket.com连接失败
eblogs.com完成
Ect.

因为页面是动态的,这就是我遇到问题的原因.我可以搜索源中的每个站点吗？但那我怎么得到它之后的结果呢？(连接失败/完成)非常
感谢您的帮助!

php scrape

Joe*_*ani

2015 08-24

6
推荐指数

1
解决办法

1万
查看次数

Python:从Google图片搜索下载图片的正确URL

我正在尝试从Google Image搜索中获取特定查询的图像.但我下载的页面没有图片,它将我重定向到谷歌的原始页面.这是我的代码:

AGENT_ID   = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"

GOOGLE_URL = "https://www.google.com/images?source=hp&q={0}"

_myGooglePage = ""

def scrape(self, theQuery) :
    self._myGooglePage = subprocess.check_output(["curl", "-L", "-A", self.AGENT_ID, self.GOOGLE_URL.format(urllib.quote(theQuery))], stderr=subprocess.STDOUT)
    print self.GOOGLE_URL.format(urllib.quote(theQuery))
    print self._myGooglePage
    f = open('./../../googleimages.html', 'w')
    f.write(self._myGooglePage)

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？

谢谢

python image scrape

slw*_*lwr

lucky-day

6
推荐指数

2
解决办法

6104
查看次数

Http 敏捷包 - 访问兄弟姐妹？

使用 HTML Agility Pack 非常适合获取后代和整个表格等...但是在以下情况下如何使用它

...Html Code above...

<dl>
<dt>Location:</dt>
<dd>City, London</dd>
<dt style="padding-bottom:10px;">Distance:</dt>
<dd style="padding-bottom:10px;">0 miles</dd>
<dt>Date Issued:</dt>
<dd>26/10/2010</dd>
<dt>type:</dt>
<dd>cement</dd>
</dl>

...HTML Code below....

Run Code Online (Sandbox Code Playgroud)

如果在这种情况下英里小于 15，你怎么能找到？我不明白你可以对元素做一些事情，但是你是否必须让所有元素找到正确的元素，然后找到数字来检查其值？或者有没有办法将正则表达式与 Agility pack 一起使用以更好的方式实现这一目标......

.net html html-content-extraction scrape html-agility-pack

Jay*_*Jay

2011 05-08

5
推荐指数

1
解决办法

1941
查看次数

BeautifulSoup:如何在特定的html标记后提取数据

我有跟随html,我试图弄清楚我是如何告诉BeautifulSoup在某些html元素后提取td.在这种情况下,我想在<td>之后获取数据<td>Color Digest</td>

<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>

Run Code Online (Sandbox Code Playgroud)

这是整个HTML

<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup scrape

add*_*ons

2012 07-24

5
推荐指数

1
解决办法

4619
查看次数

使用 Tor + Privoxy 抓取谷歌购物结果：如何避免被阻止？

我已经安装Tor + Privoxy在我的服务器上并且工作正常！（已测试）。但现在，当我尝试使用urllib2 (python)代理来抓取谷歌购物结果时，当然，我总是被谷歌阻止（有时是 503 错误，有时是 403 错误）。那么任何人有任何解决方案可以帮助我避免这个问题吗？我们将非常感激！

我正在使用的源代码：

 _HEADERS = {
      'User-Agent': 'Mozilla/5.0',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Encoding': 'deflate',
      'Connection': 'close',
      'DNT': '1'
  }

  request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)

  proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
  opener = urllib2.build_opener(proxy_support) 
  urllib2.install_opener(opener)

  try:
      response = urllib2.urlopen(request)
      html = response.read()
      print html

   except urllib2.HTTPError as e:
       print e.code
       print e.reason

Run Code Online (Sandbox Code Playgroud)

请注意：当我不使用代理时，它可以正常工作！

python tor scrape

Đôn*_*yễn

lucky-day

5
推荐指数

1
解决办法

2530
查看次数

CheerioJS,使用相同的类名循环<ul>

我试图遍历每个<ul>并获得每个的价值<li>.问题是,它只需要第一个<ul>并跳过其余部分.

HTML

<div id="browse-results">
    <ul class="tips cf">
        <li>tip11</li>
        <li>tip12</li>
        <li>tip13</li>
    </ul>
    <ul class="tips cf">
        <li>tip21</li>        
        <li>tip22</li>        
        <li>tip23</li>        
    </ul>
    <ul class="tips cf">
        <li>tip31</li>        
        <li>tip32</li>        
        <li>tip33</li>        
    </ul>
    <ul class="tips cf">
        <li>tip41</li>        
        <li>tip42</li>        
        <li>tip43</li>        
    </ul>
</div>

Run Code Online (Sandbox Code Playgroud)

Cheerio解析

$('#browse-results').find('.tips.cf').each(function(i, elm) {
    console.log($(this).text()) // for testing do text() 
});

$('#browse-results').children().find('.tips').each(function(i, elm) {
    console.log($(this).text())
});
I've tried many more

Run Code Online (Sandbox Code Playgroud)

输出只是第一个的值<ul>.

tip11
tip12
tip13

Run Code Online (Sandbox Code Playgroud)

请注意,这只是一个片段示例,其结构与我正在尝试解析的内容相同.

我花了将近2个小时,我找不到办法.

node.js scrape cheerio

Sob*_*lic

2014 12-12

5
推荐指数

1
解决办法

9893
查看次数

nodejs web scraper用于受密码保护的网站

我正在尝试使用nodejs抓取一个网站,它可以在不需要任何身份验证的网站上完美运行.但每当我尝试使用需要用户名和密码的表单来抓取网站时,我只会从身份验证页面获取HTML(也就是说,如果您在身份验证页面上单击"查看页面源",那就是HTML I得到).我可以使用curl获得所需的HTML

curl -d "username=myuser&password=mypw&submit=Login" URL

Run Code Online (Sandbox Code Playgroud)

这是我的代码......

var express = require('express');
var fs = require('fs'); //access to file system
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

app.get('/scrape', function(req, res){
url = 'myURL'

request(url, function(error, response, html){

    // check errors
    if(!error){
        // Next, we'll utilize the cheerio library on the returned html which will essentially give us jQuery functionality
        var $ = cheerio.load(html);

        var title, release, rating;
        var json = { title : "", release : …

Run Code Online (Sandbox Code Playgroud)

javascript authentication node.js web-scraping scrape

gth*_*hb7

lucky-day

5
推荐指数

1
解决办法

3454
查看次数

如何使用Import.io抓取多个页面

我试图从网站http://www.cityoflondon.gov.uk/events/中删除一个事件列表,但是当使用import.io废弃它时,我只能提取第一页.

我怎样才能一次提取所有页面？

web-scraping scrape import.io

Hua*_*der

lucky-day

5
推荐指数

1
解决办法

958
查看次数

如何从BeautifulSoup下载图像？

图片http://i.imgur.com/OigSBjF.png

__CODE__不是魔术; 它与...基本相同__CODE__.因此,__CODE__不能用占位符搜索; 左参数被完全评估.相反,使用

r = requests.get("xxxxxxxxx")
soup = BeautifulSoup(r.content)

for link in links:
    if "http" in link.get('src'):
       print link.get('src')

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup scrape python-2.7

Fis*_*art

2016 05-11

5
推荐指数

1
解决办法

6534
查看次数

标签统计

scrape ×10

python ×5

beautifulsoup ×2

html ×2

node.js ×2

web-scraping ×2

.net ×1

authentication ×1

cheerio ×1

html-agility-pack ×1

html-content-extraction ×1

image ×1

import.io ×1

javascript ×1

php ×1

python-2.7 ×1

tor ×1

标签 统计

标签统计