具有“加载更多”分页的列表的BeautifulSoup子页面

tay*_*ese 3 python parsing screen-scraping beautifulsoup

这里很新,所以提前致歉。我正在寻找https://angel.co/companies的所有公司描述的列表,以便进行尝试。我尝试过的基于Web的解析工具并没有削减它,因此我希望编写一个简单的python脚本。我应该首先获取所有公司URL的数组,然后遍历它们吗?任何资源或方向都将有所帮助-我看过BeautifulSoup的文档和一些帖子/视频教程,但是我挂在模拟json请求等方面(请参见此处:从中获取与BeautifulSoup的所有链接)单页网站(“加载更多”功能)

我看到一个脚本,我相信它正在调用其他列表:

o.on("company_filter_fetch_page_complete", function(e) {
    return t.ajax({
        url: "/companies/startups",
        data: e,
        dataType: "json",
        success: function(t) {
            return t.html ? 
                (E().find(".more").empty().replaceWith(t.html),
                 c()) : void 0
        }
    })
}),
Run Code Online (Sandbox Code Playgroud)

谢谢!

Pad*_*ham 5

您要抓取的数据是使用Ajax动态加载的,您需要做很多工作才能获取实际想要的html:

import requests
from bs4 import BeautifulSoup

header = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
    }

with requests.Session() as s:
    r = s.get("https://angel.co/companies").content
    csrf = BeautifulSoup(r).select_one("meta[name=csrf-token]")["content"]
    header["X-CSRF-Token"] = csrf
    ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
    _ids = "".join(["ids%5B%5D={}&".format(i)  for i in ids.pop("ids")])
    rest = "&".join(["{}={}".format(k,v) for k,v in ids.items()])
    url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
    rsp = s.get(url, headers=header)
    print(rsp.json())
Run Code Online (Sandbox Code Playgroud)

我们首先需要获得一个有效的csrf令牌,这是初始请求所做的,然后我们需要发布到https://angel.co/company_filters/search_data

在此处输入图片说明

这给了我们:

{"ids":[296769,297064,60,63,112,119,130,160,167,179,194,236,281,287,312,390,433,469,496,516],"total":908164,"page":1,"sort":"signal","new":false,"hexdigest":"3f4980479bd6dca37e485c80d415e848a57c43ae"}
Run Code Online (Sandbox Code Playgroud)

它们是我们到达https://angel.co/companies/startups最后一个请求所需的参数:

在此处输入图片说明

然后,该请求为我们提供了更多json,其中包含html和所有公司信息:

{"html":"<div class=\" dc59 frs86 _a _jm\" data-_tn=\"companies/results ...........
Run Code Online (Sandbox Code Playgroud)

要发布的内容太多了,但这就是您需要解析的内容。

所以放在一起:

In [3]: header = {
   ...:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
   ...:     "X-Requested-With": "XMLHttpRequest",
   ...: }

In [4]: with requests.Session() as s:
   ...:         r = s.get("https://angel.co/companies").content
   ...:         csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
   ...:         header["X-CSRF-Token"] = csrf
   ...:         ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
   ...:         _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
   ...:         rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()])
   ...:         url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
   ...:         rsp = s.get(url, headers=header)
   ...:         soup = BeautifulSoup(rsp.json()["html"], "lxml")
   ...:         for comp in soup.select("div.base.startup"):
   ...:                 text = comp.select_one("div.text")
   ...:                 print(text.select_one("div.name").text.strip())
   ...:                 print(text.select_one("div.pitch").text.strip())
   ...:         
Frontback
Me, now.
Outbound
Optimizely for messages
Adaptly
The Easiest Way to Advertise Across The Social Web.
Draft
Words with Friends for Fantasy (w/ real money)
Graphicly
an automated ebook publishing and distribution platform
Appstores
App Distribution Platform
eVenues
Online Marketplace & Booking Engine for Unique Meeting Spaces
WePow
Video & Mobile Recruitment
DoubleDutch
Event Marketing Automation Software
ecomom
It's all good
BackType
Acquired by Twitter
Stipple
Native advertising for the visual web
Pinterest
A Universal Social Catalog
Socialize
Identify and reward your most influential users with our drop-in social platform.
StyleSeat
Largest and fastest growing marketplace in the $400B beauty and wellness industry
LawPivot
99 Designs for legal
Ostrovok
Leading hotel booking platform for Russian-speakers
Thumb
Leading mobile social network that helps people get instant opinions
AppFog
Making developing applications on the cloud easier than ever before
Artsy
Making all the world’s art accessible to anyone with an Internet connection.
Run Code Online (Sandbox Code Playgroud)

就分页而言,每天限制为20页,但是要获取全部20页只是添加page:page_no到我们的表单数据中以获取所需的新参数的一种情况data={"sort": "signal","page":page},当您单击“加载更多”时,您可以看到已发布的内容:

在此处输入图片说明

所以最后的代码:

import requests
from bs4 import BeautifulSoup

def parse(soup):

        for comp in soup.select("div.base.startup"):
            text = comp.select_one("div.text")
            yield (text.select_one("div.name").text.strip()), text.select_one("div.pitch").text.strip()

def connect(page):
    header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest",
    }

    with requests.Session() as s:
        r = s.get("https://angel.co/companies").content
        csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
        header["X-CSRF-Token"] = csrf
        ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal","page":page}, headers=header).json()
        _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
        rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()])
        url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
        rsp = s.get(url, headers=header)
        soup = BeautifulSoup(rsp.json()["html"], "lxml")
        for n, p in parse(soup):
            yield n, p
for i in range(1, 21):
    for name, pitch in connect(i):
        print(name, pitch)
Run Code Online (Sandbox Code Playgroud)

显然,您要解析的内容由您决定,但是您在浏览器中看到的结果中的所有内容都将可用。