cli*_*ray 4 wikipedia wikipedia-api web-scraping
我正在尝试寻找或建立一个能够通过并找到美国每个州/国家公园及其GPS坐标和土地面积的网络刮板.我已经研究了一些像Scrapy这样的框架,然后我看到有一些专门针对维基百科的网站,例如http://wiki.dbpedia.org/About.这些中的任何一个是否有任何特定优势,或者是否可以更好地将信息加载到在线数据库中?
Osc*_*ros 11
我们假设你要解析像这个维基百科页面这样的页面.以下代码应该有效.
var doc = new HtmlDocument();
doc = .. //Load the document here. See doc.Load(..), doc.LoadHtml(..), etc.
//We get all the rows from the table (except the header)
var rows = doc.DocumentNode.SelectNodes("//table[contains(@class, 'sortable')]//tr").Skip(1);
foreach (var row in rows) {
var name = HttpUtility.HtmlDecode(row.SelectSingleNode("./*[1]/a[@href and @title]").InnerText);
var loc = HttpUtility.HtmlDecode(row.SelectSingleNode(".//span[@class='geo-dec']").InnerText);
var areaNodes = row.SelectSingleNode("./*[5]").ChildNodes.Skip(1);
string area = "";
foreach (var a in areaNodes) {
area += HttpUtility.HtmlDecode(a.InnerText);
}
Console.WriteLine("{0,-30} {1,-20} {2,-10}", name, loc, area);
}
Run Code Online (Sandbox Code Playgroud)
我测试了它,它产生以下输出:
Acadia 44.35A°N 68.21A°W 47,389.67 acres (191.8 km2)
American Samoa 14.25A°S 170.68A°W 9,000.00 acres (36.4 km2)
Arches 38.68A°N 109.57A°W 76,518.98 acres (309.7 km2)
Badlands 43.75A°N 102.50A°W 242,755.94 acres (982.4 km2)
Big Bend 29.25A°N 103.25A°W 801,163.21 acres (3,242.2 km2)
Biscayne 25.65A°N 80.08A°W 172,924.07 acres (699.8 km2)
Black Canyon of the Gunnison 38.57A°N 107.72A°W 32,950.03 acres (133.3 km2)
Bryce Canyon 37.57A°N 112.18A°W 35,835.08 acres (145.0 km2)
Canyonlands 38.2A°N 109.93A°W 337,597.83 acres (1,366.2 km2)
Capitol Reef 38.20A°N 111.17A°W 241,904.26 acres (979.0 km2)
Carlsbad Caverns 32.17A°N 104.44A°W 46,766.45 acres (189.3 km2)
Channel Islands 34.01A°N 119.42A°W 249,561.00 acres (1,009.9 km2)
Congaree 33.78A°N 80.78A°W 26,545.86 acres (107.4 km2)
Crater Lake 42.94A°N 122.1A°W 183,224.05 acres (741.5 km2)
Cuyahoga Valley 41.24A°N 81.55A°W 32,860.73 acres (133.0 km2)
Death Valley 36.24A°N 116.82A°W 3,372,401.96 acres (13,647.6 km2)
Denali 63.33A°N 150.50A°W 4,740,911.72 acres (19,185.8 km2)
Dry Tortugas 24.63A°N 82.87A°W 64,701.22 acres (261.8 km2)
Everglades 25.32A°N 80.93A°W 1,508,537.90 acres (6,104.8 km2)
Gates of the Arctic 67.78A°N 153.30A°W 7,523,897.74 acres (30,448.1 km2)
Glacier 48.80A°N 114.00A°W 1,013,572.41 acres (4,101.8 km2)
(...)
Run Code Online (Sandbox Code Playgroud)
我认为这是一个开始.如果某个页面失败,您必须查看布局是否更改等.
当然,您还必须找到一种获取要解析的所有链接的方法.
一件重要的事情:你知道是否允许刮维基百科吗?我不知道,但你应该先看看它是否在做之前......;)
小智 5
尽管这个问题有点老了,但现在可以使用的另一种方法是避免刮擦并直接从protectedplanet.net获取原始数据-它包含来自世界保护区数据库和联合国保护区清单的数据。(公开:我曾在UNEP-WCMC工作,该组织负责生产和维护数据库和网站。)
它是免费的,用于非商业用途,但是您需要注册才能下载。例如,此页面使您可以下载美国22,600个保护区,分别为KMZ,CSV和SHP(包含纬度,经度,边界,IUCN类别和许多其他元数据)。