提取Google搜索结果

use*_*829 4 python regex subdomain extract

我想定期检查Google列出的子域名.

要获取子域列表,请在Google搜索框中键入"site:example.com" - 这会列出所有子域结果(我们的域名超过20页).

仅提取"site:example.com"搜索返回的地址的URL的最佳方法是什么?

我正在考虑编写一个小的python脚本,它将执行上述搜索并从搜索结果中重新编写URL(在所有结果页面上重复).这是一个好的开始吗?有没有更好的方法?

干杯.

dan*_*neu 16

正则表达式是解析HTML的一个坏主意.阅读和依赖格式良好的HTML是神秘的.

试试BeautifulSoup for Python.这是一个示例脚本,它返回网站前10页的网址:domain.com Google查询.

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from BeautifulSoup import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,10):
        url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
        for cite in soup.findAll('cite'):
            print cite.text
Run Code Online (Sandbox Code Playgroud)

输出:

stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...
Run Code Online (Sandbox Code Playgroud)

当然,您可以将每个结果附加到列表中,以便为子域解析它.我刚刚进入Python并在几天前刮擦,但这应该让你开始.