use*_*173 5 python beautifulsoup web-scraping
我构建了一个非常简单的抓取工具来查看 Airbnb 列表。目标是浏览给定站点(即此站点)。
first_page = BeautifulSoup(requests.get("https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz§ion_offset=1").text, 'html.parser')
listings = first_page.find_all('div', 'listing-card-wrapper')
for listing in listings:
print(listing.select("#listing-15616363 > div.infoContainer_v72lrv > a > div.ellipsized_1iurgbx > div > span:nth-child(1) > span:nth-child(1)"))
Run Code Online (Sandbox Code Playgroud)
该代码正确地循环访问页面上的 18 个元素。但是,它会打印 18 个空数组,表明listing.select 语句不起作用。我从 Chrome 开发工具复制选择器功能中获取了 CSS 标签。
这是因为listing-15616363它特定于每个列表(请注意格式),因此循环列表中listing-{listing_id}没有类。id = 'listing-15616363'
例如,如果你想获取 url,你可以这样做:
listing.find('a', class_ = "linkContainer_55zci1")['href']
Run Code Online (Sandbox Code Playgroud)
或者,您可以使用 python lxml ,它比BeautifulSoup快一个数量级(如果使用得当),如下所示:
import requests
from lxml import html
url = "https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz§ion_offset=1"
response = requests.get(url)
root = html.fromstring(response.content)
result_list = []
def remove_non_ascii(text) :
return ''.join([i if ord(i) < 128 else '' for i in text])
currency = root.xpath('//div[@itemprop="offers"]/meta[@itemprop="priceCurrency"]/@content')[0].strip()
for row in root.xpath('//div[contains(@class, "listing-card-wrapper")]') :
if row :
url = row.xpath('.//a[@class="linkContainer_55zci1"]/@href')[0].strip()
title = row.xpath('.//div[@class="ellipsized_1iurgbx"]/span/text()')[0].strip()
price = remove_non_ascii(row.xpath('.//div[@class="inline_g86r3e"]/span//text()')[0].strip())
result_list.append({'url' : "https://www.airbnb.com" + url,
'title' : title, 'price' : price, 'currency' : currency})
print result_list
Run Code Online (Sandbox Code Playgroud)
这将导致:
[{'url': 'https://www.airbnb.com/rooms/5316912', 'currency': 'INR', 'price': u' 3,823', 'title': 'Small City apt. next to the Metro'}, {'url': 'https://www.airbnb.com/rooms/16989400', 'currency': 'INR', 'price': u' 2,347', 'title': 'Cozy room close to city center'}, {'url': 'https://www.airbnb.com/rooms/17628374', 'currency': 'INR', 'price': u' 6,774', 'title': 'Cosy, quiet apartment in downtown Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/1206721', 'currency': 'INR', 'price': u' 4,426', 'title': 'Apt.close to Metro, Airport and CHP'}, {'url': 'https://www.airbnb.com/rooms/13813273', 'currency': 'INR', 'price': u' 3,622', 'title': 'Large room in Vesterbro'}, {'url': 'https://www.airbnb.com/rooms/14083881', 'currency': 'INR', 'price': u' 9,322', 'title': 'City Room'}, {'url': 'https://www.airbnb.com/rooms/6221130', 'currency': 'INR', 'price': u' 5,365', 'title': 'cosy flat 2 min from Central Statio'}, {'url': 'https://www.airbnb.com/rooms/15804159', 'currency': 'INR', 'price': u' 3,823', 'title': 'Cozy, central near waterfront. Quality breakfast!'}, {'url': 'https://www.airbnb.com/rooms/17266268', 'currency': 'INR', 'price': u' 3,756', 'title': 'Cosy room in Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/2647233', 'currency': 'INR', 'price': u' 3,353', 'title': 'Bedroom & Living Room Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/12083235', 'currency': 'INR', 'price': u' 5,969', 'title': 'Wonderful Copenhagen is right here'}, {'url': 'https://www.airbnb.com/rooms/7787976', 'currency': 'INR', 'price': u' 7,042', 'title': 'Homely renovated flat with garden'}, {'url': 'https://www.airbnb.com/rooms/17556785', 'currency': 'INR', 'price': u' 1,610', 'title': u'Small Cosy home above our Caf\xe9 ( Breakfast incl )'}, {'url': 'https://www.airbnb.com/rooms/894420', 'currency': 'INR', 'price': u' 10,261', 'title': 'Wonderful apt. right in the city!'}, {'url': 'https://www.airbnb.com/rooms/17028460', 'currency': 'INR', 'price': u' 7,847', 'title': 'Nyhavn 3-bed apartment for families'}, {'url': 'https://www.airbnb.com/rooms/17651114', 'currency': 'INR', 'price': u' 6,371', 'title': 'Spacious place by canals in heart of Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/10564051', 'currency': 'INR', 'price': u' 3,420', 'title': u'\u623f\u95f4\u5728\u54e5\u672c\u54c8\u6839\u7684\u5fc3\u810f'}, {'url': 'https://www.airbnb.com/rooms/17709435', 'currency': 'INR', 'price': u' 2,951', 'title': u'Hyggelig lejlighed t\xe6t p\xe5 centrum.'}]
Run Code Online (Sandbox Code Playgroud)
您还可以参考scraping和lxml的文档来进一步了解。
| 归档时间: |
|
| 查看次数: |
8696 次 |
| 最近记录: |