Pol*_*Dot 3 python recursion for-loop beautifulsoup web-scraping
最新更新:我将我的问题简化为如何递归地从站点获取所有链接,包括每个页面的子链接等。
我想我知道如何获取一页的所有子链接:
from bs4 import BeautifulSoup
import requests
import re
def get_links(site, filename):
f=open(filename, 'w')
url = requests.get(site)
data = url.text
soup = BeautifulSoup(data, 'lxml')
for links in soup.find_all('a'):
f.write(str(links.get('href'))+"\n")
f.close()
r="https://en.wikipedia.org/wiki/Main_Page"
filename="wiki"
get_links(r,filename)
Run Code Online (Sandbox Code Playgroud)
我如何递归地确保网站上的所有链接也被收集并写入同一个文件?
所以我尝试了这个,它甚至没有编译。
def is_url(link):
#checks using regex if 'link' is a valid url
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\\,() ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', link)
return (" ".join(url)==link)
def get_links(site, filename):
f=open(filename, 'a')
url = requests.get(site)
data = url.text
soup = BeautifulSoup(data, 'lxml')
for links in soup.find_all('a'):
if is_url(links):
f.write(str(links.get('href'))+"\n")
get_links(links, filename)
f.close()
Run Code Online (Sandbox Code Playgroud)
回答您的问题,这就是我如何使用 beautilfulsoup 获取页面的所有链接并将它们保存到文件中:
from bs4 import BeautifulSoup
import requests
def get_links(url):
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
links = []
for link in soup.find_all('a'):
link_url = link.get('href')
if link_url is not None and link_url.startswith('http'):
links.append(link_url + '\n')
write_to_file(links)
return links
def write_to_file(links):
with open('data.txt', 'a') as f:
f.writelines(links)
def get_all_links(url):
for link in get_links(url):
get_all_links(link)
r = 'https://en.wikipedia.org/wiki/Main_Page'
write_to_file([r])
get_all_links(r)
Run Code Online (Sandbox Code Playgroud)
但是,这不会阻止循环(这将导致无限递归)。为此,您可以使用 aset
来存储已经访问过的链接,并且不再访问它们。
你真的应该考虑使用像Scrapy这样的东西来完成这种任务。我认为 aCrawlSpider
是你应该研究的。
为了从wikipedia.org
域中提取 url,您可以执行以下操作:
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Item
from scrapy import Field
class UrlItem(Item):
url = Field()
class WikiSpider(CrawlSpider):
name = 'wiki'
allowed_domains = ['wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Main_Page/']
rules = (
Rule(LinkExtractor(), callback='parse_url'),
)
def parse_url(self, response):
item = UrlItem()
item['url'] = response.url
return item
Run Code Online (Sandbox Code Playgroud)
并运行它
scrapy crawl wiki -o wiki.csv -t csv
Run Code Online (Sandbox Code Playgroud)
并且您csv
以wiki.csv
文件中的格式获得网址。