小编Gon*_*o68的帖子

如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

我正试图从PGA.com网站上获取数据,以获得美国所有高尔夫球场的表格.在我的CSV表格中,我想要包括高尔夫球场的名称,地址,所有权,网站,电话号码.有了这些数据,我想对其进行地理编码并放入地图并在我的计算机上安装本地副本

我利用Python和Beautiful Soup4来提取我的数据.我已经达到了提取数据并将其导入CSV的目的,但我现在遇到了从PGA网站上的多个页面中抓取数据的问题.我想提取所有高尔夫球课程,但我的剧本仅限于一页,我想将其循环播放,它将从PGA网站的所有页面中捕获高尔夫球场的所有数据.大约有18000个黄金课程和900个页面来捕获数据

以下是我的剧本.我需要有关创建代码的帮助,这些代码将从PGA网站捕获所有数据,而不仅仅是一个站点而是多个站点.通过这种方式,它将为我提供美国所有黄金课程的数据.

这是我的脚本如下:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2: …

Run Code Online (Sandbox Code Playgroud)

python csv loops beautifulsoup web-scraping

Gon*_*o68

2016 03-17

6
推荐指数

1
解决办法

3万
查看次数

使用Geopy和Python进行地理编码

我正在尝试对包含位置名称和解析地址的CSV文件进行地理编码,该地址包括地址编号,街道名称,城市,邮编,国家/地区.我想通过Geopy使用GEOPY和ArcGIS地理编码.我想创建一个循环通过我的5000多个条目的csv的代码,并在我的CSV中的单独列中给出纬度和经度.我想通过Geopy使用ArcGIS Geocoding服务.任何人都可以为我提供入门代码吗？谢谢!

这是我的脚本:

import csv
from geopy.geocoders import ArcGIS


geolocator = ArcGIS()     # here some parameters are needed

with open('C:/Users/v-albaut/Desktop/Test_Geo.csv', 'rb') as csvinput:
    with open('output.csv', 'w') as csvoutput:
        output_fieldnames = ['Name','Address', 'Latitude', 'Longitude']
        writer = csv.DictWriter(csvoutput, delimiter=',', fieldnames=output_fieldnames)
        reader = csv.DictReader(csvinput)

        for row in reader:
            # here you have to replace the dict item by your csv column names
            query = ','.join(str(x) for x in (row['Name'], row['Address']))
            Address, (latitude, longitude) = geolocator.geocode(query)

            # here is the writing section
            output_row = …

Run Code Online (Sandbox Code Playgroud)

python csv arcgis geopy

Gon*_*o68

2016 05-18

6
推荐指数

1
解决办法

4857
查看次数

使用python和Beautifulsoup4编写和保存CSV文件以刮取数据

我利用Python和Beautiful Soup4来提取我的数据.我已经达到了从网站提取数据的目的,但是我在编写脚本以将数据导出到显示我需要的参数的CSV文件时遇到了困难.

以下是我的剧本.我需要帮助创建代码,将我提取的代码转换为CSV文件以及如何将其保存到桌面.

这是我的脚本如下:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})


for item in g_data1:
     try:
          print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     except:
          pass  
     try:
          print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     except:
          pass

for item in g_data2:
   try:
      print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   except:
      pass
   try:
      print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   except:
      pass
   try:
      print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   except:
      pass

Run Code Online (Sandbox Code Playgroud)

这是我运行脚本时目前获得的.我想将这些数据转换成CSV表,以便以后进行地理编码.

1801 Merrimac Trl
Williamsburg, Virginia 23185-5905

12551 Glades Rd
Boca Raton, Florida 33498-6830
Preserve Golf Club 
13601 SW 115th Ave
Dunnellon, Florida …

Run Code Online (Sandbox Code Playgroud)

python csv screen-scraping export beautifulsoup

Gon*_*o68

2015 06-26

5
推荐指数

1
解决办法

1万
查看次数

所以我试图从一个网站下载多个文件并保存到一个文件夹中.我正在尝试获取高速公路数据,并在他们的网站(http://www.wsdot.wa.gov/mapsdata/tools/InterchangeViewer/SR5.htm)中列出了pdf链接.我想创建一个代码,它将提取在其网站上找到的众多pdf.也许创建一个循环,将通过网站,并将每个文件提取并保存到我的桌面上的本地文件夹.任何人都知道我怎么能这样做？

python

Gon*_*o68

2015 07-07

5
推荐指数

1
解决办法

7068
查看次数

<br> 标记使用漂亮的汤和 python 从抓取中搞砸了我的数据

我试图从给定的网站获取高尔夫球场的详细列表。我创建了一个刮板工具来刮取美国不同高尔夫球场的名称和地址。

我的问题是在我能够抓取的地址中。我注意到当刮入我的 CSV 文件时，第一行文本和第二行文本之间没有空格。在 HTML 文件中，我注意到两行文本由<br>标记分隔。

我如何在我的代码中解决这个问题，以便我正在抓取的两行文本在抓取到 CSV 时在它们之间提供一个空格？

这是我试图抓取的 HTML 看起来像这样：

<div class="location">10924 Verterans Memorial Dr<br>Abbeville, Louisiana, United States</div>

Run Code Online (Sandbox Code Playgroud)

我的代码的输出如下所示：

10924 Verterans Memorial DrAbbeville, Louisiana, United States

Run Code Online (Sandbox Code Playgroud)

请注意，“Memorial Dr”和“Abbeville”之间没有空格。如何更改它以便在刮擦时提供空间？

这是我的代码：

import csv
import requests
from bs4 import BeautifulSoup

courses_list = []
geolocator =  ArcGIS ()

for i in range(1):
    url="http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(i*20)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    #print soup
    g_data2 = soup.find_all("div",{"class":"result"})
    #print g_data2
    for item in g_data2:
        try:
            name = item.find_all("div",{"class":"name"})[0].text
            print name
        except: …

Run Code Online (Sandbox Code Playgroud)

python csv screen-scraping beautifulsoup

Gon*_*o68

2021 06-09

3
推荐指数

1
解决办法

1563
查看次数