Nutch 1.13抓取脚本无法正常工作

Question

Nutch 1.13抓取脚本无法正常工作

我安装了Nutch 1.10,配置并使用爬网脚本但尝试升级到Nutch 1.13.我无法让Nutch抓取脚本与Nutch v1.13一起使用.

这通常适用于v1.10

bin/crawl -i -D elastic.server.url=http://localhost:9300/search-index/ urls/ searchcrawl/  2

Run Code Online (Sandbox Code Playgroud)

但是,当我尝试用它运行v1.13时,我得到了

Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] [-s <Seed Dir>] <Crawl Dir> <Num Rounds>
-i|--index  Indexes crawl results into a configured indexer
-D      A Java property to pass to Nutch calls
-w|--wait   NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
        are scheduled for fetching. Suffix can be: s for second,
        m for minute, h for hour and d for day. If no suffix is
        specified second is used by default.
-s Seed Dir Path to seeds file(s)
Crawl Dir   Directory where the crawl/link/segments dirs are saved
Num Rounds  The number of rounds to run this crawl for

Run Code Online (Sandbox Code Playgroud)

我在文档中看不到任何不同的东西......我错过了什么吗？如何让爬网脚本与v1.13一起使用？

Answer 1

use*_*823 5

在更好的搜索后找到了答案.

似乎在1.14中,bin/crawl脚本现在期望种子的路径以-s开头

这适用:bin/crawl -i -D elastic.server.url = http:// localhost:9300/search-index/ -s urls/searchcrawl/2

其他人

归档时间：	8 年，5 月前
查看次数：	379 次
最近记录：	8 年，5 月前