Hzm*_*zmy 5 javascript webkit scraper web-scraping phantomjs
概观
我正在尝试用PhantomJS和pjscrape框架创建一个非常基本的刮刀.
我的守则
pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv',
csvFields: ['productTitle','price'],
writer: 'file',
outFile: 'D:\\prod_details.csv'
});
pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below
scrapers: [
function() {
var results [];
var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
results.push([linkTitle[0],linkPrice[0]]);
return results;
}
]
});
Run Code Online (Sandbox Code Playgroud)
URL数组已使用
第一个数组在第3个或第4个URL之后无法正常工作并失败.
var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}
Run Code Online (Sandbox Code Playgroud)
这第二个阵列WORKS并没有失败,即使是来自同一站点.
var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}
Run Code Online (Sandbox Code Playgroud)
问题
在迭代productURLsPhantomJS时,page.open可选的回调会自动假定失败.即使页面还没有完成加载.
我知道这一点,因为我在运行HTTP调试器时启动了脚本,即使PhantomJS报告了页面加载失败,HTTP请求仍在运行.
但是,运行时代码工作正常categoriesURLs.
假设
可能的解决方案
这些是我迄今为止尝试过的解决方案.
page.options.loadImages = falsetimeoutInterval在pjs.config此不实用显然是所产生的误差是一个的page.open故障和NOT超时故障.有任何想法吗?
| 归档时间: |
|
| 查看次数: |
1828 次 |
| 最近记录: |