Vic*_*Vic 2 java parsing jsoup
import java.io.IOException;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class listGrabber {
public static void main(String[]args) {
try {
Document doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free").get();
int count = 0;
Elements elements;
String url;
ArrayList<String> list = new ArrayList<>();
do{
elements = doc.select("a[class^=title]").get(count).select("a[class^=title]");
url = "";
url = elements.attr("abs:title").replaceAll("https://play.google.com/store/apps/category/GAME_ACTION/collection/","");
url = url.replaceAll("®|™","");
url = url.replaceAll("[(](.*)[)]","");
list.add(url);
System.out.println(url);
count++;
}while (url!="" &&url!=null);
// String divContents =
// doc.select(".id-app-orig-desc").first().text();
// elements.remove("div");
} catch (IOException e) {
}
}
}
Run Code Online (Sandbox Code Playgroud)
如上所示,我正在尝试从https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free中获取单词列表
每次滚动到页面底部时,Google Play商店页面都会加载更多元素.
我的程序将抓住显示的前40个元素,但由于jsoup没有加载动态加载的其余网页,我无法抓住前40个以外的任何元素.
此外,如果您在页面上滚动到游戏#300,则会出现"显示更多"按钮,我还想解析显示更多按钮之外的元素.
有没有办法让Jsoup解析会动态加载到页面上的所有元素?
编辑 -在OP的一些评论之后,我完全理解他想要实现的目标.我改变了一点原始解决方案并进行了测试.
你可以做到JSOUP.在第一页之后,获取下一个页面需要您post使用一些标题来发送请求.标题包含(以及其他)起始编号和要获取的记录数.如果您发送非法号码(即您要求包含游戏号码700的页面但结果仅包含600个游戏),您将再次获得第一页.您可以循环浏览页面,直到获得已有的结果.
有时服务器返回600个结果,有时只返回540,我无法理解为什么.
代码是 -
import java.util.regex.Pattern;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class HelloWorld {
public static void main(String[] args) {
Connection.Response res = null;
Document doc = null;
Boolean OK = true;
int start = 0;
String query;
ArrayList<String> tempList = new ArrayList<>();
ArrayList<String> games = new ArrayList<>();
Pattern r = Pattern.compile("title=\"(.*)\" a");
try { //first connection with GET request
res = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free")
.method(Method.GET)
.execute();
doc = res.parse();
} catch (Exception ex) {
//Do some exception handling here
}
for (int i=1; i <= 60; i++) { //parse the result and add it to the list
query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
tempList.add(doc.select(query).toString());
}
while (OK) { //loop until you get the same results again
start += 60;
System.out.println("now at number " + start);
try { //send post request for each new page
doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free?authuser=0")
.cookies(res.cookies())
.data("start", String.valueOf(start))
.data("num", "60")
.data("numChildren", "0")
.data("ipf", "1")
.data("xhr", "1")
.post();
} catch (Exception ex) {
//Do some exception handling here
}
for (int i=1; i <= 60; i++) { //parse the result and add it to the list
query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
if (!tempList.contains(doc.select(query).toString())) {
tempList.add(doc.select(query).toString());
} else { //we've seen these games before, time to quit
OK = false;
break;
}
}
}
for (int i = 0; i < tempList.size(); i++) { //remove all redundent info.
Matcher m = r.matcher(tempList.get(i));
if (m.find()) {
games.add(m.group(1));
System.out.println((i + 1) + " " + games.get(i));
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
代码可以进一步改进(比如以单独的方法处理所有列表),所以这取决于你.
我希望这能帮到你.