Bob*_*ith -2 java asp.net url web-scraping
我正在尝试从不同机场之间的路线上的站点提取数据.用户打算选择两个机场,然后程序将在给定的一天向他们显示所有不同的路线.只有在网站上搜索路由后,无论您正在查看哪条路线,网址都会更改为相同的.asp域名.有没有办法在不知道URL的情况下从特定路由中抓取数据,或者是否有可能获得真正的URL?
Tar*_*ani 10
我建议使用JSoup.为此,请在下面添加pom.xml
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.2</version>
</dependency>
Run Code Online (Sandbox Code Playgroud)
然后你发出第一个请求才能煮熟
Connection.Response initialPage = Jsoup.connect("https://www.flightview.com/flighttracker/")
.headers(headers)
.method(Connection.Method.GET)
.userAgent(userAgent)
.execute();
Map<String, String> initialCookies = initialPage.cookies();
Run Code Online (Sandbox Code Playgroud)
然后使用这些cookie触发下一个请求
Connection.Response flights = Jsoup.connect("https://www.flightview.com/TravelTools/FlightTrackerQueryResults.asp")
.userAgent(userAgent)
.headers(headers)
.data(postData)
.cookies(initialCookies)
.method(Connection.Method.POST)
.execute();
Run Code Online (Sandbox Code Playgroud)
该postData和headers在这种情况下
HashMap<String, String> postData = new HashMap<String, String>();
HashMap<String, String> headers = new HashMap<String, String>();
headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
headers.put("Accept-Encoding", "gzip, deflate, br");
headers.put("Accept-Language", "en-US,en;q=0.9");
headers.put("Cache-Control", "no-cache");
headers.put("DNT", "1");
headers.put("Pragma", "no-cache");
headers.put("Upgrade-Insecure-Requests", "1");
postData.put("qtype", "cpi");
postData.put("sfw", "/FV/FlightTracker/Main");
postData.put("namdep", "DFW Dallas, TX (Dallas/Ft Worth) - Dallas Fort Worth International");
postData.put("depap", "DFW");
postData.put("namarr", "JFK New York, NY (Kennedy) - John F Kennedy International");
postData.put("arrap", "JFK");
postData.put("namal2", "Enter name or code");
postData.put("al", "");
postData.put("whenArrDep", "dep");
postData.put("whenHour", "all");
postData.put("whenDate", "20180321");
postData.put("input", "Track Flight");
Run Code Online (Sandbox Code Playgroud)
现在,当您获得数据时,您可以解析并打印出来的东西
String page = flights.body();
System.out.println(page);
Document doc = Jsoup.parse(page);
Elements elems = doc.select("tr.FlightTrackerListRowOdd, tr.FlightTrackerListRowEven");
for(Element element : elems) {
Elements childElems = element.select("td");
String text1 = childElems.get(0).text();
String text2 = childElems.get(1).text();
System.out.println(text1 + " " + text2);
}
Run Code Online (Sandbox Code Playgroud)
它的输出是
Aeroflot Airlines 3453
Aeroflot Airlines 3455
AeroMexico 4966
AeroMexico 4935
Air France 2535
Alitalia 3403
American Airlines 1294
British Airways 1880
China Eastern Airlines 8804
Delta Air Lines 3869
Delta Air Lines 3789
Etihad Airways 3040
Finnair 5726
Gulf Air 4139
Iberia Airlines 4043
Jet Airways 7692
KLM Royal Dutch Airlines 6597
KLM Royal Dutch Airlines 8117
Korean Air 7326
Malaysia Airlines 9442
Qatar Airways 5107
TAM Brazilian Airlines 8379
Virgin Atlantic 4620
Virgin Atlantic 3471
Run Code Online (Sandbox Code Playgroud)
休息你可以根据你的需要开始改变它.这表明您可以举例说明如何操作
在浏览器中打开开发人员工具,并在搜索框中提交信息以便到达目的地并提交.
然后,如果您检查浏览器发送给服务器的请求,您会注意到您刚刚提交的带有表单数据的发布请求将发送到https://www.flightview.com/TravelTools/FlightTrackerQueryResults.asp
如果你想要抓取这些数据,那么你可以使用python requests模块向这个网址发送一个帖子请求.
注意:由于您使用的是Java,因此仍然可以发送简单的发布请求.您可以在此处查看如何发送帖子请求
| 归档时间: |
|
| 查看次数: |
406 次 |
| 最近记录: |