无法使用请求从 zillow 中抓取自定义属性链接

MIT*_*THU 2 python json web-scraping python-3.x python-requests

我正在尝试解析当我从 zillow 中选择两个下拉列表时填充的不同属性链接。选择完选项后,我可以在开发工具中看到 json 格式的结果。但是,当我使用下面的脚本执行相同操作时,我收到一些奇怪的文本。

手动操作:

  1. 导航到该网站
  2. 从第一个下拉列表中选择选项
  3. 从第二个下拉列表中选择选项

这就是我尝试自动化的方式:

import json
import requests
from pprint import pprint

link = 'https://www.zillow.com/search/GetSearchPageState.htm?'

params = {
    'searchQueryState': {"pagination":{},"usersSearchTerm":"Vista, CA","mapBounds":{"west":-117.44051346728516,"east":-116.99488053271484,"south":33.126944633035116,"north":33.27919773006566},"regionSelection":[{"regionId":41517,"regionType":6}],"isMapVisible":True,"filterState":{"doz":{"value":"6m"},"isForSaleByAgent":{"value":False},"isForSaleByOwner":{"value":False},"isNewConstruction":{"value":False},"isForSaleForeclosure":{"value":False},"isComingSoon":{"value":False},"isAuction":{"value":False},"isPreMarketForeclosure":{"value":False},"isPreMarketPreForeclosure":{"value":False},"isRecentlySold":{"value":True},"isAllHomes":{"value":True},"hasPool":{"value":True},"hasAirConditioning":{"value":True},"isApartmentOrCondo":{"value":False}},"isListVisible":True,"mapZoom":11},
    'wants': {"cat1":["listResults","mapResults"]},
    'requestId': 2
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link,params=json.dumps(params))
    pprint(res.content)
Run Code Online (Sandbox Code Playgroud)

这是它产生的输出:

b'<!-- This page outputs JSON instead of anything written here. -->'
Run Code Online (Sandbox Code Playgroud)

如何使用请求解析来自 zillow 的自定义属性链接?

bad*_*ker 5

您必须对查询字符串进行编码,因为它出现在request URL.

为此,您需要:

urllib.parse.urlencode()
Run Code Online (Sandbox Code Playgroud)

这是一个工作示例:

urllib.parse.urlencode()
Run Code Online (Sandbox Code Playgroud)

输出:

{
  "user": {
    "isLoggedIn": false,
    "hasHousingConnectorPermission": false,
    "savedSearchCount": 0,
    "savedHomesCount": 0,
    "personalizedSearchGaDataTag": null,
    "personalizedSearchTraceID": "607a9ecb5aabe489c361c1d91f368b37",
    "searchPageRenderedCount": 0,
    "guid": "33b7add3-bfd3-4d85-a88a-d9d99256d2a2",
    "zuid": "",
    "isBot": false,
    "userSpecializedSEORegion": false
  },
  "mapState": {
    "customRegionPolygonWkt": null,
    "schoolPolygonWkt": null,
    "isCurrentLocationSearch": false,
    "userPosition": {
      "lat": null,
      "lon": null
    },
    "regionBounds": {
      "north": 33.275284,
      "east": -117.145153,
      "south": 33.130865,
      "west": -117.290241
    }
  },

and much much more ...
Run Code Online (Sandbox Code Playgroud)

注意:在该网站上要小心,因为他们有非常敏感的反机器人措施,如果您继续过快地请求数据,他们会向您抛出验证码。