如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

Question

如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

Gon*_*o68 6 python csv loops beautifulsoup web-scraping

我正试图从PGA.com网站上获取数据,以获得美国所有高尔夫球场的表格.在我的CSV表格中,我想要包括高尔夫球场的名称,地址,所有权,网站,电话号码.有了这些数据,我想对其进行地理编码并放入地图并在我的计算机上安装本地副本

我利用Python和Beautiful Soup4来提取我的数据.我已经达到了提取数据并将其导入CSV的目的,但我现在遇到了从PGA网站上的多个页面中抓取数据的问题.我想提取所有高尔夫球课程,但我的剧本仅限于一页,我想将其循环播放,它将从PGA网站的所有页面中捕获高尔夫球场的所有数据.大约有18000个黄金课程和900个页面来捕获数据

以下是我的剧本.我需要有关创建代码的帮助,这些代码将从PGA网站捕获所有数据,而不仅仅是一个站点而是多个站点.通过这种方式,它将为我提供美国所有黄金课程的数据.

这是我的脚本如下:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2:
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   #except:
      #pass

Run Code Online (Sandbox Code Playgroud)

这个脚本一次只能捕获20个,我希望在一个脚本中捕获所有18000个高尔夫球场和900个页面的脚本.

Answer 1

lia*_*ose 8

PGA网站的搜索有多个页面,网址遵循以下模式:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

Run Code Online (Sandbox Code Playgroud)

这意味着您可以阅读页面内容,然后将页面值更改为1,并阅读下一页....依此类推.

import csv
import requests 
from bs4 import BeautifulSoup
for i in range(907):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Your code for each individual page here

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，7 月前
查看次数：	34626 次
最近记录：	6 年，5 月前