requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构

sna*_*ers 2 python web-scraping python-requests

我正在做一个网页抓取项目,遇到了以下错误。

requests.exceptions.MissingSchema:无效的 URL 'h':未提供架构。也许你的意思是http://h

下面是我的代码。我从 html 表中检索所有链接,并按预期打印出来。但是当我尝试使用 request.get 遍历它们(链接)时,我得到了上面的错误。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)
Run Code Online (Sandbox Code Playgroud)

fur*_*ras 5

你的错误是for代码中的第二个循环

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
Run Code Online (Sandbox Code Playgroud)

ref['href']为您提供单个网址,但您将其用作下一个for循环中的列表。

所以你有了

for link in ref['href']:
Run Code Online (Sandbox Code Playgroud)

它给你的网址第一个字符http://properties.kimcore...h

完整的工作代码

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)
Run Code Online (Sandbox Code Playgroud)

顺便说一句:如果你使用逗号,(ref['href'], )那么你会得到元组,然后第二个for工作正确。


编辑:table_data在开始时创建列表并将所有数据添加到此列表中。最后转换为DataFrame。

但是现在我看到它多次读取同一页面 - 因为在每一行中,每一列中都有相同的 url。您只需要从一列中获取 url。

编辑:现在它不会多次读取相同的网址

编辑:现在它从第一个链接获取 text 和 hre,并在您使用append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)
Run Code Online (Sandbox Code Playgroud)