使用 Beautiful Soup 在 Python 中递归抓取网站的所有子链接

Pol*_*Dot 3 python recursion for-loop beautifulsoup web-scraping

最新更新:我将我的问题简化为如何递归地从站点获取所有链接,包括每个页面的子链接等。

我想我知道如何获取一页的所有子链接:

from bs4 import BeautifulSoup
import requests
import re

def get_links(site, filename):
    f=open(filename, 'w')
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    for links in soup.find_all('a'):
        f.write(str(links.get('href'))+"\n")
    f.close()

r="https://en.wikipedia.org/wiki/Main_Page"
filename="wiki"
get_links(r,filename)
Run Code Online (Sandbox Code Playgroud)

我如何递归地确保网站上的所有链接也被收集并写入同一个文件?

所以我尝试了这个,它甚至没有编译。

def is_url(link):
    #checks using regex if 'link' is a valid url
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
    return (" ".join(url)==link)

def get_links(site, filename):
    f=open(filename, 'a')
    url = requests.get(site)
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    for links in soup.find_all('a'):
        if is_url(links):
            f.write(str(links.get('href'))+"\n")
            get_links(links, filename)
    f.close()
Run Code Online (Sandbox Code Playgroud)

bla*_*bla 5

回答您的问题,这就是我如何使用 beautilfulsoup 获取页面的所有链接并将它们保存到文件中:

from bs4 import BeautifulSoup
import requests


def get_links(url):
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data, 'lxml')

    links = []
    for link in soup.find_all('a'):
        link_url = link.get('href')

        if link_url is not None and link_url.startswith('http'):
            links.append(link_url + '\n')

    write_to_file(links)
    return links


def write_to_file(links):
    with open('data.txt', 'a') as f:
        f.writelines(links)


def get_all_links(url):
    for link in get_links(url):
        get_all_links(link)


r = 'https://en.wikipedia.org/wiki/Main_Page'
write_to_file([r])
get_all_links(r)
Run Code Online (Sandbox Code Playgroud)

但是,这不会阻止循环(这将导致无限递归)。为此,您可以使用 aset来存储已经访问过的链接,并且不再访问它们。

你真的应该考虑使用像Scrapy这样的东西来完成这种任务。我认为 aCrawlSpider是你应该研究的。

为了从wikipedia.org域中提取 url,您可以执行以下操作:

from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

from scrapy import Item
from scrapy import Field


class UrlItem(Item):
    url = Field()


class WikiSpider(CrawlSpider):
    name = 'wiki'
    allowed_domains = ['wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Main_Page/']

    rules = (
        Rule(LinkExtractor(), callback='parse_url'),
    )

    def parse_url(self, response):
        item = UrlItem()
        item['url'] = response.url

        return item
Run Code Online (Sandbox Code Playgroud)

并运行它

scrapy crawl wiki -o wiki.csv -t csv
Run Code Online (Sandbox Code Playgroud)

并且您csvwiki.csv文件中的格式获得网址。