SIM*_*SIM 11 python beautifulsoup web-scraping python-3.x
我在python中使用BeautifulSoup库编写了一个刮刀来解析遍历网站不同页面的所有名称.如果不是一个不同分页的网址,我可以管理它,这意味着一些网址有一些不分页,因为内容很少.
我的问题是:我怎么能设法在一个函数中编译它们来处理它们是否有分页?
我最初的尝试(它只能解析每个网址首页的内容):
import requests
from bs4 import BeautifulSoup
urls = {
'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
'https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all',
'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
}
def get_names(link):
r = requests.get(link)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("td[class='table-row-price']"):
name = items.select_one("h2 a").text
print(name)
if __name__ == '__main__':
for url in urls:
get_names(url)
Run Code Online (Sandbox Code Playgroud)
如果有一个像下面这样的分页的网址,我本可以设法做到这一切:
from bs4 import BeautifulSoup
import requests
page_no = 0
page_link = "https://www.mobilehome.net/mobile-home-park-directory/new-hampshire/all/page/{}"
while True:
page_no+=1
res = requests.get(page_link.format(page_no))
soup = BeautifulSoup(res.text,'lxml')
container = soup.select("td[class='table-row-price']")
if len(container)<=1:break
for content in container:
title = content.select_one("h2 a").text
print(title)
Run Code Online (Sandbox Code Playgroud)
但是,所有网址都没有分页.那么,无论是否有任何分页,我怎么能设法抓住所有这些?
看来我已经找到了一个非常强大的解决方案来解决这个问题。该方法是迭代的。它首先会检查该页面中是否有next page可用的 url。如果找到,它将跟踪该 url 并重复该过程。但是,如果任何链接没有任何分页,则抓取工具将中断并尝试另一个链接。
这里是:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
urls = [
'https://www.mobilehome.net/mobile-home-park-directory/alaska/all',
'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
'https://www.mobilehome.net/mobile-home-park-directory/vermont/all'
]
def get_names(link):
while True:
r = requests.get(link)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select("td[class='table-row-price']"):
name = items.select_one("h2 a").text
print(name)
nextpage = soup.select_one(".pagination a.next_page")
if not nextpage:break #If no pagination url is there, it will break and try another link
link = urljoin(link,nextpage.get("href"))
if __name__ == '__main__':
for url in urls:
get_names(url)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
301 次 |
| 最近记录: |