ini*_*nix 15 html python web-crawler scrapy web-scraping
例如:
scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content
Run Code Online (Sandbox Code Playgroud)
然后,我得到了以下原始HTML代码:
<div id="content">
<h2>Welcome to Scrapy</h2>
<h3>What is Scrapy?</h3>
<p>Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from their
pages. It can be used for a wide range of purposes, from data mining to
monitoring and automated testing.</p>
<h3>Features</h3>
<dl>
<dt>Simple</dt>
<dt>
</dt>
<dd>Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
</dd>
<dt>Productive</dt>
<dd>Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
</dd>
<dt>Fast</dt>
<dd>Scrapy is used in production crawlers to completely scrape more than
500 retailer sites daily, all in one server
</dd>
<dt>Extensible</dt>
<dd>Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the framework
core
</dd>
<dt>Portable, open-source, 100% Python</dt>
<dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>
<dt>Batteries included</dt>
<dd>Scrapy comes with lots of functionality built in. Check <a
href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
section</a> of the documentation for a list of them.
</dd>
<dt>Well-documented & well-tested</dt>
<dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
with <a href="http://static.scrapy.org/coverage-report/">very good code
coverage</a></dd>
<dt><a href="/community">Healthy community</a></dt>
<dd>
1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
200 messages per month on mailing list (<a
href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
</dd>
<dt><a href="/support">Commercial support</a></dt>
<dd>A few companies provide Scrapy consulting and support</dd>
<p>Still not sure if Scrapy is what you're looking for?. Check out <a
href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
glance</a>.
</p>
<h3>Companies using Scrapy</h3>
<p>Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of <a href="/companies/">Companies
using Scrapy</a>.</p>
<h3>Where to start?</h3>
<p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
then <a href="/download/">download Scrapy</a> and follow the <a
href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.
</p></dl>
</div>
Run Code Online (Sandbox Code Playgroud)
---------->但我希望得到直接来自scrapy的纯文本:-----
欢迎来到Scrapy
什么是Scrapy?
Scrapy是一种快速的高级屏幕抓取和Web爬行框架,用于抓取网站并从其页面中提取结构化数据.它可用于各种用途,从数据挖掘到监控和自动化测试.
特征
- 简单
- Scrapy的设计考虑到了简单性,提供了您所需的功能而不会妨碍您
- 生产的
- 只需编写规则即可从网页中提取数据,并让Scrapy为您抓取整个网站
- 快速
- Scrapy用于生产爬虫,每天在一台服务器上完全刮掉超过500个零售商站点
- 扩展
- Scrapy在设计时考虑了可扩展性,因此它提供了几种插入新代码的机制,而无需触及框架核心
- 便携式,开源,100%Python
- Scrapy完全用Python编写,可在Linux,Windows,Mac和BSD上运行
- 包括电池
- Scrapy内置了许多功能.请查看文档的这一部分以获取它们的列表.
- 记录良好且经过充分测试
- Scrapy有大量文档记录,并且具有全面的测试套件,具有非常好的代码覆盖率
- 健康的社区
- 1500个观察者,在Github(链接)350个叉
700个跟随在Twitter(链路)
上的StackOverflow 850个问题(链接)
每月200个消息邮件列表(链接)
40-50用户总是连接到IRC信道(链路)- 商业支持
- 一些公司提供Scrapy咨询和支持
还不确定Scrapy是否是您正在寻找的东西?看看Scrapy一目了然.
使用Scrapy的公司
Scrapy正在大型生产环境中使用,每天抓取数千个站点.这里是一个使用Scrapy的公司列表.
从哪儿开始?
首先阅读Scrapy,然后下载Scrapy并按照教程进行操作.
我不想使用任何xPath选择器来提取那些p,h2,h3等标签,因为我正在抓取一个主要内容嵌入到表格中的网站tbody; 递归.找到那些xPath可能是一项繁琐的工作.这可以通过Scrapy中的内置函数实现吗?或者我需要外部工具来转换它?我已阅读了所有Scrapy的文档,但一无所获.这是一个示例站点,可以将原始html转换为纯文本:http://beaker.mailchimp.com/html-to-text
ale*_*cxe 22
Scrapy没有内置的功能.html2text就是你要找的.
这是一个示例蜘蛛刮擦维基百科的python页面,使用xpath获取第一段并使用以下命令将html转换为纯文本html2text:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text
class WikiSpider(BaseSpider):
name = "wiki_spider"
allowed_domains = ["www.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]
converter = html2text.HTML2Text()
converter.ignore_links = True
print(converter.handle(sample)) #Python 3 print syntax
Run Code Online (Sandbox Code Playgroud)
打印:
**Python**是一种广泛使用的通用高级编程语言.[11] [12] [13] 它的设计理念强调代码可读性,其语法允许程序员用比C语言更少的代码行表达概念.[14] [15] 该语言提供的结构旨在实现小规模和大规模的清晰程序.[16]
pau*_*rth 14
另一种解决方案使用lxml.html的tostring()与参数method="text".lxml在内部用于Scrapy.(参数encoding=unicode通常是你想要的.)
有关详细信息,请参见http://lxml.de/api/lxml.html-module.html.
from scrapy.spider import BaseSpider
import lxml.etree
import lxml.html
class WikiSpider(BaseSpider):
name = "wiki_spider"
allowed_domains = ["www.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
def parse(self, response):
root = lxml.html.fromstring(response.body)
# optionally remove tags that are not usually rendered in browsers
# javascript, HTML/HEAD, comments, add the tag names you dont want at the end
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")
# complete text
print lxml.html.tostring(root, method="text", encoding=unicode)
# or same as in alecxe's example spider,
# pinpoint a part of the document using XPath
#for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
# print lxml.html.tostring(p, method="text")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
21506 次 |
| 最近记录: |