使用JSoup刮取Google搜索结果

Question

使用JSoup刮取Google搜索结果

我正在尝试使用JSoup来搜索Google的搜索结果.目前这是我的代码.

public class GoogleOptimization {
public static void main (String args[])
{
    Document doc;
    try{
        doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("what should i put here?");
        for (Element link : links) {
                System.out.println("\n"+link.text());
    }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}

}

Run Code Online (Sandbox Code Playgroud)

我只是试图获得搜索结果的标题和标题下方的片段.所以,是的,我只是不知道为了刮掉这些要寻找的元素.如果有人有更好的方法来使用java来刮刮谷歌我很想知道.

谢谢.

Answer 1

Col*_*lin 11

干得好.

public class ScanWebSO 
{
public static void main (String args[])
{
    Document doc;
    try{
        doc =        Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("li[class=g]");
        for (Element link : links) {
            Elements titles = link.select("h3[class=r]");
            String title = titles.text();

            Elements bodies = link.select("span[class=st]");
            String body = bodies.text();

            System.out.println("Title: "+title);
            System.out.println("Body: "+body+"\n");
        }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}
}

Run Code Online (Sandbox Code Playgroud)

另外,为了自己这样做,我建议使用chrome.你只需右键点击你想要刮去的东西,然后去检查元素.它将带您到该元素所在的html中的确切位置.在这种情况下,您首先要查找所有结果列表的根目录.当你找到它时,你想要指定元素,最好是一个唯一的属性来搜索它.在这种情况下,根元素是

<ol eid="" id="rso">

Run Code Online (Sandbox Code Playgroud)

在下面你会看到一堆开头的列表

<li class="g">

Run Code Online (Sandbox Code Playgroud)

这是你想要放入初始元素数组的内容,然后对于每个元素,你需要找到标题和正文所在的位置.在这种情况下,我发现标题是在

<h3 class="r" style="white-space: normal;">

Run Code Online (Sandbox Code Playgroud)

元件.因此,您将在每个列表中搜索该元素.身体也是如此.我找到了身体,所以我使用.text()方法搜索它,并返回该元素下的所有文本.关键是要始终尝试找到具有原始属性的元素(使用类名是理想的).如果你不这样做,只搜索像"div"这样的东西,它将在整个页面中搜索包含div的任何元素并返回它.所以你会得到比你想要的更多的结果.我希望这能很好地解释.如果您还有其他问题,请与我们联系.

归档时间：	12 年，6 月前
查看次数：	7779 次
最近记录：	8 年，1 月前