如果类不同且包含不同的内容,如何从类中提取内容并将它们按时间顺序添加到列表中?

Moo*_*rer 0 text-extraction beautifulsoup web-scraping

在抓取代码时,我需要以不同的方式处理 2 个场景。2 个相似的类都包含建筑物的价格,需要按时间顺序添加到 excel 中,因为它们必须与我正在抓取的其他数据相匹配。

我正在抓取数据的属性有 2 个不同的类。一个看起来像这样:

<div class="xl-price rangePrice">
                                375.000 €  
                            </div>
Run Code Online (Sandbox Code Playgroud)

另一个看起来像这样:

<div class="xl-price-promotion rangePrice">
                                <span>from </span> 250.000 € <br><span>to</span> 695.000 €  
                            </div>
Run Code Online (Sandbox Code Playgroud)

我的代码能够提取其中之一,但不能同时提取两者。我需要它做的是浏览搜索结果页面上的所有价格,并将它们附加到列表“价格表”中。

我对平方米、建筑类型等做了同样的处理,并将每个列表项输入到一个 excel 文件中。

出于这个原因,将它们按时间顺序添加到列表中是至关重要的,因为如果它们不是,结果是价格在 excel 中的行位置将与平方米和建筑类型的行位置不匹配。

有谁知道为什么我的代码无法提取这两个类?

这是我的代码和我试图从中提取价格的页面:

获取网站并循环浏览前 4 页:

    for number in range(1, 4):
        listplace = (number - 1) * len(buildinglist1)
        immo_page = requests.get(f'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={number}',
                                 headers=header)
        soup = Beautiful

Soup(immo_page.content, 'lxml')  # html parser

     pricelist = ['Price']


        for item in soup.findAll('div', attrs={'class': 'xl-price'}):
            # item = item.text.strip().split()
            try:
                for item in soup.findAll('div', attrs={'class': 'xl-price-promotion rangePrice'}):
                    temp_list = []
                    item = item.text.strip().split()
                    item.remove('from'), item.remove('€'), item.remove('to'), item.remove('€')
                    for price in item: temp_list.append(price.replace('.', ''))
                    print(temp_list)
                    temp_list = [int(temp_list[0]) + int(temp_list[1])]
                    print(temp_list)
                    for item in temp_list: pricelist.append(item / 2)
            except ValueError:
                for item in soup.findAll('div', attrs={'class': 'xl-price rangePrice'}):
                    item = item.contents[0]
                    item = item.strip()[0:-1]
                    item = item.replace(' ', '')
                    item = item.replace('.', '')
                    pricelist.append(item)
        print(pricelist)
Run Code Online (Sandbox Code Playgroud)

所以这就是我试图获取价格并将它们附加到列表中的方法。

仅使用两者之一时的输出(在本例中,我显示了在“Except”值中运行的代码的输出:

['Price', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000']
['Price', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000']
['Price', '235000']
Run Code Online (Sandbox Code Playgroud)

每个“价格”表示一个新页面。但正如您在第 3 页中看到的那样,它并不完整,仅显示它遇到的第一个值,即单一价格,但不采用双倍价格值。

  • 当有超过 1 个价格时,我取该价格的平均值,然后将其附加到价目表中。

非常感激!

αԋɱ*_*cαη 5

import requests
from bs4 import BeautifulSoup
import csv

types = []
sqs = []
prices = []
des = []
links = []

for url in range(1, 11):
    print(f"Extracting Page# {url}")
    r = requests.get(
        f"https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={url}")
    soup = BeautifulSoup(r.text, 'html.parser')
    for ty in soup.findAll('div', attrs={'class': 'title-bar-left'}):
        ty = ty.text.strip()
        types.append(ty)
    for sq in soup.select('div[class*="surface-ch"]'):
        sq = sq.text.strip()
        if 'm²' in sq:
            sq = sq[0:sq.find('m')]
        else:
            sq = 'N/A'
        sqs.append(sq)
    for price in soup.select('div[class*="-price"]'):
        price = price.get_text(strip=True)
        if 'from' in price:
            price = price.replace('from', 'From: ')
            price = price.replace('to', ' To: ')
        else:
            price = price[0:price.find('€') + 1]
        prices.append(price)
    for de in soup.select('div[class*="-desc"]'):
        de = de.get_text(strip=True)
        des.append(de)
    for url in soup.findAll('a'):
        url = url.get('href')
        if url is not None and 'for-sale/leuven/3000/id' in url:
            links.append(url)
final = []
for item in zip(types, sqs, prices, des, links):
    final.append(item)
with open('output.csv', 'w+', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Type', 'Size', 'Price', 'Desc', 'Link'])
    writer.writerows(final)
    print("Operation Completed")
Run Code Online (Sandbox Code Playgroud)

在线查看输出:点击这里

截屏:

在此处输入图片说明