无法使用发布请求从某些搜索结果中解析表格内容

MIT*_*THU 5 python beautifulsoup web-scraping python-3.x python-requests

我正在尝试使用下面的脚本从网页中获取一些表格内容。要手动填充内容,必须在点击按钮之前从此图像中显示的下拉列表中选择选项Submit。我试图相应地模仿 post http 请求。但是,我可能出错了,这就是脚本不起作用的原因。具体来说,这就是我想要获取的内容。

这是我尝试过的方式:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.lgindiasocial.com/microsites/brand-store-web-five/locate.aspx'

headers = {
    'x-microsoftajax': 'Delta=true',
    'origin': 'https://www.lgindiasocial.com',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'referer': 'https://www.lgindiasocial.com/microsites/brand-store-web-five/locate.aspx',
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    r = s.get(URL)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['ScriptManager1'] = 'UpdatePanel1|btnsubmit'
    payload['ddlState:'] = 'Assam'
    payload['ddlCity'] = 'Golaghat'
    payload['ddllocation'] = 'Golaghat'
    s.headers.update(headers)
    r = s.post(URL,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    item = soup.select_one("table")
    print(item)
Run Code Online (Sandbox Code Playgroud)

当我运行脚本时,我得到 None 作为输出。

如何使用发布请求从搜索结果中获取表格内容?

编辑:如果我直接从开发工具复制有效载荷的内容并在有效载荷中使用相同的内容,我会得到想要的结果。

import requests
from bs4 import BeautifulSoup

URL = 'https://www.lgindiasocial.com/microsites/brand-store-web-five/locate.aspx'

payload = "ScriptManager1=UpdatePanel1%7Cbtnsubmit&hidcity=&ddlState=Assam&ddlCity=Golaghat&ddllocation=Golaghat&__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=M%2BqldpZhV90EX2sawXMrHD7jYtOMXnrPuP8XfVtS21GKmxK0YYuBnqm3I7tU%2BKMtFGZgzWpsYK%2FYJtfTBUK%2F0WobR21tjbWjdrZiXS5FlLcS6qgYMNKfqyZRcK13dbz667H7T6QZqpITTRSqsM%2BrM91VW989KXoknFdx0H6EkRFCJRu4WsBsUxeJnd5Lf5IAUN%2BTNKDYE5GuclDNKnmU1pMmHhrjKQysvYtw8cjD5DdDkNb7NDkLiVxm7DISyXZtVJyOBV6dFa%2Blm1%2FR9M7F2nyepARAl0XIiNP9dhFvomLNdlP%2BU%2FNyllJ5IXW4D%2Fl5Kfx5yaRP8XSKURtAc915i%2F2T48a0dyAR42tJ40eit1IWs7MCwgesNtF35zkuKN1SRhyhHqcnKjcMYW%2BkLqKsLvKpLQcDuXrIAzYyqlgJZ%2FlBQJo%2BiM4tTOH4mEqDkSZW%2Fk94KX1OM70s9%2FS%2Fd5trrHIgNoKw1bCRI8IQ41ZEicMsJPTp67KnqoMZz0F0cCmo%2F49zYkuHw0kqaZmKCrRUNW8Xcr%2F5A3AfNg%2FB8WURD0g2x%2BwzcLXDcVCJ6ngf0LdOc%2BTppM6EOZpTGJGjjDqK116tzWAOPfiJHgBuIPkiZJTaEHnwwjcYXuuLN%2FTgPFUJkXVjBSyRdCnPXsebInNd4Wsu2lnNdwZUO3rnNuu5eY%2FHf7YemcmCEzji%2FxLG%2FynnG0sG61TC1bJCyFw2E3V6ZGshbuqDfh7QQyxqPDEt2uaCN7s%2FOZ%2FwiXeVY2henUVBZSVrxUvF6QT0eO4SIY0OlNYBLK7cO4YG4zC0tURSBr7lZwR%2B%2FowLieNGSO7sOeLQVwL71GKnzBAOZVQH1hw%2B8FIRPoc0pn3v7RjK5CMgTtrZlar67Cv1lTi2nUyAIpX%2BhGkaQeOsg%2ByaIqDIo%2FWwcrg9VV9QP%2FdmwP8hTtq3KTVs0Ncja4Yvizm12BkEwWtMJ9fqzLBXt%2F2J2EjsG7GudgXypwSU7U8oY%2Fq%2BCk93y%2FeTr1ftEFbpGRTRm4hNVXeoCYRyuJceU%2BvO4U5E29ZPqBIolidYtKKH7lnRxKNk2BHtY93VNHPZEjTEDnHcGbgtHmxlBjHRQZlzJKWTjY5ccdFABihGx%2FzY0VCwaehpx2BWxy5qXqW1fX7e5uxxxHteYVt7YyrzYPsX%2B%2FlKiYwt23fsJzmmVkHwmu5%2FTSk1Ms9yJmBE%2B8pEF%2Bum01L8jRH4zxyTaD4s779uLZwLAUUzpi5cfseKTrjGv7uNjCpNci9BXbSdCdqrKa8aPiJX0lWUH9zid%2B8Jc7Jhx%2Bb6nzJpbZ8E9sPpUlcHVGUSzqixsiK91W%2FDDk2LCOvTqJJ9JXmy5cwRhL9r95okWq%2BDImTetFhdYk9%2F9VH3JsACpv4dqqdviEjjFpvmEp7SBMLSWw7toPUIRortPtriz3u9velTqNpHgmbmig8Znb%2F4Q8JrYfjPZzfRxN%2FuQXQyxUNUY2IsYbC5Bm7JWTMZe869muBdE%2FlMLujUkOFCXaOwZXuZHbr7neq0nro3RvYUggBLqxGFlG1Bp52iDNklcx8nfjVMOhOybfCMcxz6mq4Ew2hdLv4IslLRawI5u%2FPQe0vu0TG9LeBeR6Ok1sf72rWpvhD6yl4GTy8oJC1UglabWo8i5aMprxxAWuz%2BzLzizI3aRTQsl1MFKsD9gIGZsaFNAIb7gEgFgw%2B%2BSjTGR51mGES3sOUYXscIJVBciBs3F9vnr8u5gfKD3hLwqvc4djKMBxVQfjLEs%2FQwb7mlOx8XodaV6uOrkiZpw2WZNja5RPBIp4VXeXKXIxqBNsNA4eGT%2Bx2b2JadVB8%3D&__VIEWSTATEGENERATOR=06ED1D24&__VIEWSTATEENCRYPTED=&__ASYNCPOST=true&btnsubmit=Submit"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    s.headers.update({'content-type':'application/x-www-form-urlencoded; charset=UTF-8'})
    r = s.post(URL,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    item = soup.select_one("table")
    print(item)
Run Code Online (Sandbox Code Playgroud)

小智 2

首先,您的代码中存在一个小拼写错误(其中有额外的冒号)

payload['ddlState:'] = 'Assam'
Run Code Online (Sandbox Code Playgroud)

更大的问题与页面的构建方式有关。该页面有三个下拉菜单,这些下拉菜单发送 POST 请求。每个 POST 请求都会返回一个修改后的 __VIEWSTATE,需要将其包含在后续请求的标头中。

在您的代码中,您仅从原始 GET 请求的 input[form] 中获取 __VIEWSTATE,您需要从最后一个 POST 请求中获取 __VIEWSTATE。所以以下应该有效:

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    r = s.get(URL)
    soup = BeautifulSoup(r.text, "lxml")

    # first POST = Select State
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ScriptManager1'] = 'UpdatePanel1|btnsubmit'
    payload['ddlState'] = 'Assam'
    payload['ddlCity'] = 'Select City'
    payload['ddllocation'] = 'Select Location'
    payload['__EVENTTARGET'] = 'ddlState'
    r = s.post(URL, data=payload)
    soup = BeautifulSoup(r.text, "lxml")

    # second POST = Select City
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ScriptManager1'] = 'UpdatePanel1|btnsubmit'
    payload['ddlCity'] = 'Golaghat'
    payload['ddllocation'] = 'Select Location'
    payload['__EVENTTARGET'] = 'ddlCity'
    r = s.post(URL, data=payload)
    soup = BeautifulSoup(r.text, "lxml")

    # third POST = Select Location
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ScriptManager1'] = 'UpdatePanel1|btnsubmit'
    payload['ddlCity'] = 'Golaghat'
    payload['ddllocation'] = 'Golaghat'
    payload['__EVENTTARGET'] = ''

    s.headers.update(headers)
    r = s.post(URL, data=payload)
    soup = BeautifulSoup(r.text, "lxml")
    item = soup.select_one("table")
    print(item)
Run Code Online (Sandbox Code Playgroud)

这段代码还有一些优化的空间。我试图让问题变得透明。