在Python中解析Robots.txt

Rit*_*ari 2 python robots.txt

我想用 python 解析 robots.txt 文件。我已经探索了 robotsParser 和 robotsExclusionParser,但没有什么真正满足我的标准。我想一次性获取所有 diallowedUrls 和 allowedUrls ,而不是手动检查每个 url 是否允许。有没有图书馆可以做到这一点?

J. *_*Doe 9

为什么必须手动检查 URL?您可以在 Python 3 中使用urllib.robotparser,并执行如下操作:

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup


url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]
    
    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrape")
Run Code Online (Sandbox Code Playgroud)