标签: scrape

Http 敏捷包 - 访问兄弟姐妹？

使用 HTML Agility Pack 非常适合获取后代和整个表格等...但是在以下情况下如何使用它

...Html Code above...

<dl>
<dt>Location:</dt>
<dd>City, London</dd>
<dt style="padding-bottom:10px;">Distance:</dt>
<dd style="padding-bottom:10px;">0 miles</dd>
<dt>Date Issued:</dt>
<dd>26/10/2010</dd>
<dt>type:</dt>
<dd>cement</dd>
</dl>

...HTML Code below....

Run Code Online (Sandbox Code Playgroud)

如果在这种情况下英里小于 15，你怎么能找到？我不明白你可以对元素做一些事情，但是你是否必须让所有元素找到正确的元素，然后找到数字来检查其值？或者有没有办法将正则表达式与 Agility pack 一起使用以更好的方式实现这一目标......

.net html html-content-extraction scrape html-agility-pack

Jay*_*Jay

2011 05-08

5
推荐指数

1
解决办法

1941
查看次数

BeautifulSoup:如何在特定的html标记后提取数据

我有跟随html,我试图弄清楚我是如何告诉BeautifulSoup在某些html元素后提取td.在这种情况下,我想在<td>之后获取数据<td>Color Digest</td>

<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>

Run Code Online (Sandbox Code Playgroud)

这是整个HTML

<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup scrape

add*_*ons

2012 07-24

5
推荐指数

1
解决办法

4619
查看次数

使用 Tor + Privoxy 抓取谷歌购物结果：如何避免被阻止？

我已经安装Tor + Privoxy在我的服务器上并且工作正常！（已测试）。但现在，当我尝试使用urllib2 (python)代理来抓取谷歌购物结果时，当然，我总是被谷歌阻止（有时是 503 错误，有时是 403 错误）。那么任何人有任何解决方案可以帮助我避免这个问题吗？我们将非常感激！

我正在使用的源代码：

 _HEADERS = {
      'User-Agent': 'Mozilla/5.0',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Encoding': 'deflate',
      'Connection': 'close',
      'DNT': '1'
  }

  request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)

  proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
  opener = urllib2.build_opener(proxy_support) 
  urllib2.install_opener(opener)

  try:
      response = urllib2.urlopen(request)
      html = response.read()
      print html

   except urllib2.HTTPError as e:
       print e.code
       print e.reason

Run Code Online (Sandbox Code Playgroud)

请注意：当我不使用代理时，它可以正常工作！

python tor scrape

Đôn*_*yễn

lucky-day

5
推荐指数

1
解决办法

2530
查看次数

如何从BeautifulSoup下载图像？

图片http://i.imgur.com/OigSBjF.png

__CODE__不是魔术; 它与...基本相同__CODE__.因此,__CODE__不能用占位符搜索; 左参数被完全评估.相反,使用

r = requests.get("xxxxxxxxx")
soup = BeautifulSoup(r.content)

for link in links:
    if "http" in link.get('src'):
       print link.get('src')

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup scrape python-2.7

Fis*_*art

2016 05-11

5
推荐指数

1
解决办法

6534
查看次数

使用SoupStrainer有选择地解析

我试图从购物网站解析一系列视频游戏.但是因为项目列表全部存储在标签内.

据说该文档的这一部分解释了如何解析文档的一部分,但我无法解决.我的代码:

from BeautifulSoup import BeautifulSoup
import urllib
import re

url = "Some Shopping Site"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for a in soup.findAll('a',{'title':re.compile('.+') }):
    print a.string

Run Code Online (Sandbox Code Playgroud)

目前在任何标签中打印字符串都有一个非空标题引用.但它也在侧栏中引用了"特价".如果我只能拿产品清单div,我将一石二鸟.

非常感谢.

python beautifulsoup scrape

Scr*_*per

lucky-day

4
推荐指数

2
解决办法

1万
查看次数

如何在Java中屏蔽Ajax站点？

我希望屏幕抓取几个基于Ajax的网站并模拟刷新部分网页的点击,然后阅读更新的HTML.有没有可以做到这一点的Java库？

java screen-scraping web-scraping scrape

Zub*_*air

2014 01-18

4
推荐指数

1
解决办法

2775
查看次数

Google 允许一个请求抓取多少个结果？

下面的 PHP 代码工作正常，但是当它用于抓取指定关键字的 1000 个 Google 结果时，它只返回 100 个结果。Google 对返回的结果是否有限制，或者是否存在其他问题？

<?php
require_once ("header.php");
$data2 = getContent("http://www.google.de/search?q=auch&hl=de&num=100&gl=de&ix=nh&sourceid=chrome&ie=UTF-8");
    $dom = new DOMDocument();
    @$dom->loadHtml($data2);
    $xpath = new DOMXPath($dom);

    $hrefs = $xpath->evaluate("//div[@id='ires']//li/h3/a/@href");
    $j = 0;

    foreach ($hrefs as $href)
    {            

        $url = "http://www.google.de/" . $href->value . "";
        echo "<b>";

        echo "$j ";
      echo   $url = get_string_between($url, "http://www.google.de//url?q=", "&sa=");
      echo "<br/>";

      $j++;
        }
?>

Run Code Online (Sandbox Code Playgroud)

php scrape

Zei*_*vic

2013 01-23

4
推荐指数

1
解决办法

8592
查看次数

Python-在本地保存请求或BeautifulSoup对象

我有一些很长的代码，因此需要很长时间才能运行。我只想在本地保存请求对象（在这种情况下为“ name”）或BeautifulSoup对象（在这种情况下为“ soup”），以便下次可以节省时间。这是代码：

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

Run Code Online (Sandbox Code Playgroud)

python file beautifulsoup scrape

bil*_*999

lucky-day

4
推荐指数

1
解决办法

3170
查看次数

SPARQL可以处理特定单元格的空白结果吗？

我正在编写SPARQL查询,无法弄清楚如何允许特定列的空白结果.

我目前的要求是:

select * where {
?game a dbpedia-owl:Game ;
dbpprop:name ?name ; 
dbpedia-owl:publisher ?publisher . }

Run Code Online (Sandbox Code Playgroud)

有些游戏有一个发布者的猫头鹰,而其他游戏没有.上述请求会过滤掉没有发布商的游戏.我希望能够在同一个csv中与发布者和没有发布者的游戏一起获得游戏.

我试图写出发布者owl的isset语句,但似乎无法获得正确的空白.

我希望发布者单元格的结果为空白,而不是在没有发布者的情况下过滤掉游戏.

有什么建议？

sparql web-scraping scrape dbpedia

Rya*_*anf

lucky-day

4
推荐指数

1
解决办法

67
查看次数

rvest-在1个标签中抓取2个类

我是rvest的新手。如何在标记中使用2个类名或仅1个类名提取这些元素？

这是我的代码和问题：

doc <- paste("<html>",
             "<body>",
             "<span class='a1 b1'> text1 </span>",
             "<span class='b1'> text2 </span>",
             "</body>",
             "</html>"
            )
library(rvest)
read_html(doc) %>% html_nodes(".b1")  %>% html_text()
#output: text1, text2
#what i want: text2

#I also want to extract only elements with 2 class names
read_html(doc) %>% html_nodes(".a1 .b1") %>% html_text()
# Output that i want: text1

Run Code Online (Sandbox Code Playgroud)

这是我的机器规格：

作业系统：Windows 10。

RVest版本：0.3.2

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Run Code Online (Sandbox Code Playgroud)

有人可以帮忙吗？

html r web-scraping scrape rvest

add*_*ted

2017 08-02

4
推荐指数

1
解决办法

3404
查看次数

标签统计

scrape ×10

python ×5

beautifulsoup ×4

html ×3

web-scraping ×3

.net ×1

dbpedia ×1

file ×1

html-agility-pack ×1

html-content-extraction ×1

java ×1

php ×1

python-2.7 ×1

r ×1

rvest ×1

screen-scraping ×1

sparql ×1

tor ×1

标签 统计

标签统计