solr会做网络抓取吗?

mur*_*ali 16 solr web-crawler

我有兴趣进行网页抓取.我在看solr.

是否solr进行网页抓取,或者进行网页抓取的步骤是什么?

Jon*_*Jon 20

事实上,Solr 5+ DOES现在进行网络爬行! http://lucene.apache.org/solr/

较旧的Solr版本不单独进行网络爬网,因为历史上它是一个提供全文搜索功能的搜索服务器.它建立在Lucene之上.

如果您需要使用另一个Solr项目抓取网页,那么您有许多选项,包括:

如果您想使用Lucene或SOLR提供的搜索工具,您需要从Web爬网结果中构建索引.

看到这个:

Lucene爬虫(它需要构建lucene索引)

  • 你能详细说明«Solr 5+ DOES实际上现在做网络爬行»?我没有在整个文档中看到任何抓取功能. (7认同)

mjv*_*mjv 9

Solr本身没有网络爬行功能.

Nutch是Solr的"事实上"的爬虫(然后是一些).


B.M*_*.W. 5

Solr 5 started supporting simple webcrawling (Java Doc). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)

To get it up and running, you can take a detail look at here. However, here is how to get it up and running in one line:

java 
-classpath <pathtosolr>/dist/solr-core-5.4.1.jar 
-Dauto=yes 
-Dc=gettingstarted     -> collection: gettingstarted
-Ddata=web             -> web crawling and indexing
-Drecursive=3          -> go 3 levels deep
-Ddelay=0              -> for the impatient use 10+ for production
org.apache.solr.util.SimplePostTool   -> SimplePostTool
http://datafireball.com/      -> a testing wordpress blog
Run Code Online (Sandbox Code Playgroud)

The crawler here is very "naive" where you can find all the code from this Apache Solr's github repo.

Here is how the response looks like:

SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked
Entering recursive mode, depth=3, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://datafireball.com (depth: 0)
Entering crawl at level 1 (52 links total, 51 new)
POSTed web resource http://datafireball.com/2015/06 (depth: 1)
...
Entering crawl at level 2 (266 links total, 215 new)
...
POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2)
...
Entering crawl at level 3 (846 links total, 656 new)
POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3)
SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302
423 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract...
Time spent: 0:05:55.059
Run Code Online (Sandbox Code Playgroud)

In the end, you can see all the data are indexed properly. 在此处输入图片说明