什么是从Zillow抓取数据的最佳方法?

Chr*_*ice -2 python beautifulsoup zillow web-scraping

我试图从Zillow收集数据是不成功的.

例:

url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy

我想从洛杉矶的所有家庭中提取地址,价格,zestimates,地点等信息.

我已经尝试使用像BeautifulSoup这样的包进行HTML抓取.我也尝试过使用json.我几乎肯定Zillow的API没有帮助.我的理解是,API最适合收集特定属性的信息.

我已经能够从其他站点获取信息,但似乎Zillow使用动态ID(更改每次刷新)使得访问该信息变得更加困难.

更新: 尝试使用以下代码,但仍然没有产生任何结果

import requests
from bs4 import BeautifulSoup

url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'

page = requests.get(url)
data = page.content

soup = BeautifulSoup(data, 'html.parser')

for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
    try:
        #There is sponsored links in the list. You might need to take care 
        #of that
        #Better check for null values which we are not doing in here
        print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
    except :
        print('An error occured')
Run Code Online (Sandbox Code Playgroud)

小智 6

这可能是因为你没有传递标题.

如果您查看开发人员工具中的Chrome网络标签,则这些是浏览器传递的标头:

:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Run Code Online (Sandbox Code Playgroud)

但是,如果您尝试发送所有这些,它将失败,因为requests不允许您发送以冒号':'开头的标头.

我尝试单独跳过这四个,并在此脚本中使用其他五个.有效.试试这个:

from bs4 import BeautifulSoup
import requests

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
    r = s.get(url, headers=req_headers)
Run Code Online (Sandbox Code Playgroud)

之后,您可以使用BeautifulSoup提取所需的信息:

soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text
Run Code Online (Sandbox Code Playgroud)

以下是从该页面提取的数据示例:

+--------------+-----------------------------------------------------------+
| $615,000     |  121 S Hope St APT 435 Los Angeles CA 90012               |
| $330,000     |  4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423  |
| $3,495,000   |  13446 Valley Vista Blvd Sherman Oaks CA 91423            |
| $1,199,000   |  6241 Crescent Park W UNIT 410 Los Angeles CA 90094       |
| $771,472+    |  Chase St. And Woodley Ave # HGS0YX North Hills CA 91343  |
| $369,000     |  8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293        |
| $595,000     |  6427 Klump Ave North Hollywood CA 91606                  |
+--------------+-----------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)