Kir*_*ril 7 java url performance web-crawler
我正在尝试编写一个快速的HTML scraper,在这一点上,我只是专注于在不解析的情况下最大化我的吞吐量.我已缓存URL的IP地址:
public class Data {
private static final ArrayList<String> sites = new ArrayList<String>();
public static final ArrayList<URL> URL_LIST = new ArrayList<URL>();
public static final ArrayList<InetAddress> ADDRESSES = new ArrayList<InetAddress>();
static{
/*
add all the URLs to the sites array list
*/
// Resolve the DNS prior to testing the throughput
for(int i = 0; i < sites.size(); i++){
try {
URL tmp = new URL(sites.get(i));
InetAddress address = InetAddress.getByName(tmp.getHost());
ADDRESSES.add(address);
URL_LIST.add(new URL("http", address.getHostAddress(), tmp.getPort(), tmp.getFile()));
System.out.println(tmp.getHost() + ": " + address.getHostAddress());
} catch (MalformedURLException e) {
} catch (UnknownHostException e) {
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
我的下一步是通过从互联网上获取100个URL来测试速度,读取前64KB并继续下一个URL.我创建了一个FetchTaskConsumer
s 的线程池,我尝试运行多个线程(在i7四核机器上运行16到64),以下是每个消费者的看法:
public class FetchTaskConsumer implements Runnable{
private final CountDownLatch latch;
private final int[] urlIndexes;
public FetchTaskConsumer (int[] urlIndexes, CountDownLatch latch){
this.urlIndexes = urlIndexes;
this.latch = latch;
}
@Override
public void run() {
URLConnection resource;
InputStream is = null;
for(int i = 0; i < urlIndexes.length; i++)
{
int numBytes = 0;
try {
resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();
resource.setRequestProperty("User-Agent", "Mozilla/5.0");
is = resource.getInputStream();
while(is.read()!=-1 && numBytes < 65536 )
{
numBytes++;
}
} catch (IOException e) {
System.out.println("Fetch Exception: " + e.getMessage());
} finally {
System.out.println(numBytes + " bytes for url index " + urlIndexes[i] + "; remaining: " + remaining.decrementAndGet());
if(is!=null){
try {
is.close();
} catch (IOException e1) {/*eat it*/}
}
}
}
latch.countDown();
}
}
Run Code Online (Sandbox Code Playgroud)
充其量我可以在大约30秒内浏览100个URL,但文献表明我应该能够每秒通过300个 150个URL.请注意,我可以访问千兆以太网,虽然我目前正在我的20 Mbit连接上运行测试......在任何一种情况下,连接都没有真正得到充分利用.
我已经尝试过直接使用Socket
连接,但我必须做错事,因为那甚至更慢!关于如何提高吞吐量的任何建议?
PS
我有一个大约100万个热门网址的列表,所以如果100不足以进行基准测试,我可以添加更多网址.
更新:我所指
的文献是与Najork Web Crawler有关的论文,Najork说:
在17天内处理了8.91亿个URL,
即每秒606次下载[on] 4个Compaq DS20E Alpha服务器[带] 4 GB主内存[,] 650 GB磁盘空间[和] 100 MBit/sec.
以太网ISP将带宽限制为160Mbits/sec
所以它实际上是每秒150页,而不是300页.我的计算机是带有4 GB RAM的Core i7,而我距离它还不远.我没有看到任何说明他们特别使用的东西.
更新:
好的,总结...最终的结果是!事实证明,对于基准测试,100个URL有点太低了.我把它提升到了1024个URL,64个线程,我为每次获取设置了2秒的超时时间,我每秒最多可以达到21页(实际上我的连接大约是10.5 Mbps,所以每秒21页*64KB每页大约10.5 Mbps).以下是抓取器的外观:
public class FetchTask implements Runnable{
private final int timeoutMS = 2000;
private final CountDownLatch latch;
private final int[] urlIndexes;
public FetchTask(int[] urlIndexes, CountDownLatch latch){
this.urlIndexes = urlIndexes;
this.latch = latch;
}
@Override
public void run() {
URLConnection resource;
InputStream is = null;
for(int i = 0; i < urlIndexes.length; i++)
{
int numBytes = 0;
try {
resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();
resource.setConnectTimeout(timeoutMS);
resource.setRequestProperty("User-Agent", "Mozilla/5.0");
is = resource.getInputStream();
while(is.read()!=-1 && numBytes < 65536 )
{
numBytes++;
}
} catch (IOException e) {
System.out.println("Fetch Exception: " + e.getMessage());
} finally {
System.out.println(numBytes + "," + urlIndexes[i] + "," + remaining.decrementAndGet());
if(is!=null){
try {
is.close();
} catch (IOException e1) {/*eat it*/}
}
}
}
latch.countDown();
}
}
Run Code Online (Sandbox Code Playgroud)
你确定你的金额吗?
每秒 300 个 URL,每个 URL 读取 64 KB
这需要: 300 x 64 = 19,200 千字节/秒
转换为位:19,200 千字节/秒 = ( 8 * 19,200 ) 千位/秒
所以我们有: 8*19,200 = 153,600 千位/秒
然后换算为 Mb/s:153,600 / 1024 = 150 兆位/秒
...但您只有 20 Mb/s 的频道。
然而,我想您正在获取的许多 URL 的大小都在 64Kb 以下,因此吞吐量看起来比您的频道更快。你不是慢,而是快!