如何从Instagram网页浏览器网页抓取粉丝?

use*_*765 5 python selenium web-scraping instagram-api

谁能告诉我如何访问底层URL以查看给定用户的Instagram粉丝?我可以使用Instagram API执行此操作,但鉴于审批流程有待更改,我决定切换到抓取.

Instagram网络浏览器允许您查看任何给定公共用户的关注者列表 - 例如,查看Instagram的关注者,访问" https://www.instagram.com/instagram ",然后单击关注者URL以打开通过查看者分页的窗口(注意:您必须登录到您的帐户才能查看此内容).

我注意到,当弹出此窗口时,URL会更改为" https://www.instagram.com/instagram/followers ",但我似乎无法查看此URL的基础页面源.

由于它出现在我的浏览器窗口中,我认为我将能够刮擦.但是我必须使用像Selenium这样的软件包吗?有谁知道底层URL是什么,所以我不必使用Selenium?

例如,我可以通过访问"instagram.com/instagram/media/"直接访问基础Feed数据,我可以从中搜索和分页所有迭代.我想对关注者列表做类似的事情,并直接访问这些数据(而不是使用Selenium).

Lev*_*ker 15

编辑:2018年12月更新:

自发布以来,Insta land的情况发生了变化.这是一个更新的脚本,更加pythonic和更好地利用XPATH/CSS路径.

请注意,要使用此更新的脚本,您必须安装explicitpackage(pip install explicit),或将每行转换waiter为纯selenium显式等待.

import itertools

from explicit import waiter, XPATH
from selenium import webdriver


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    waiter.find_write(driver, "//div/input[@name='username']", username, by=XPATH)
    waiter.find_write(driver, "//div/input[@name='password']", password, by=XPATH)
    waiter.find_element(driver, "//div/button[@type='submit']", by=XPATH).click()

    # Wait for the user dashboard page to load
    waiter.find_element(driver, "//a/span[@aria-label='Find People']", by=XPATH)


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click()
    waiter.find_element(driver, "//a[@href='/instagram/followers/']", by=XPATH).click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)

    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child({}) a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # that
        last_follower = waiter.find_element(driver, follower_css.format(follower_index))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = 'instagram'
    driver = webdriver.Chrome()
    try:
        login(driver)
        # Print the first 75 followers for the "instagram" account
        print('Followers of the "{}" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t{:>3}: {}".format(count, follower))
            if count >= 75:
                break
    finally:
        driver.quit()
Run Code Online (Sandbox Code Playgroud)

我做了一个快速的基准测试,以显示性能如何以指数方式下降,您尝试以这种方式抓取的追随者越多:

$ python example.py
Followers of the "instagram" account
Found    100 followers in 11 seconds
Found    200 followers in 19 seconds
Found    300 followers in 29 seconds
Found    400 followers in 47 seconds
Found    500 followers in 71 seconds
Found    600 followers in 106 seconds
Found    700 followers in 157 seconds
Found    800 followers in 213 seconds
Found    900 followers in 284 seconds
Found   1000 followers in 375 seconds
Run Code Online (Sandbox Code Playgroud)

原帖:您的问题有点令人困惑.例如,我不确定"从中我可以通过所有迭代进行刮擦和分页"实际上意味着什么.您目前使用什么来刮擦和分页?

无论如何,instagram.com/instagram/media/端点都不是同一类型instagram.com/instagram/followers.的media端点似乎是一个REST API,用于返回一个可解析容易JSON对象.

followers从我所知道的,端点实际上并不是一个RESTful端点.相反,在点击"关注者"按钮后,Instagram AJAX会将信息发送到页面源(使用React?).我不认为你能够在不使用像Selenium这样的东西的情况下获得这些信息,Selenium可以加载/渲染向用户显示关注者的javascript.

此示例代码将起作用:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "//div[@style='position: relative; z-index: 1;']/div/div[2]/div/div[1]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    # You'll need to figure out some scrolling magic here. Something that can
    # scroll to the bottom of the followers modal, and know when its reached
    # the bottom. This is pretty impractical for people with a lot of followers

    # Finally, scrape the followers
    xpath = "//div[@style='position: relative; z-index: 1;']//ul/li/div/div/div/div/a"
    followers_elems = driver.find_elements_by_xpath(xpath)

    return [e.text for e in followers_elems]


if __name__ == "__main__":
    driver = webdriver.Chrome()
    try:
        login(driver)
        followers = scrape_followers(driver, "instagram")
        print(followers)
    finally:
        driver.quit()
Run Code Online (Sandbox Code Playgroud)

出于多种原因,这种方法存在问题,其中主要原因是它相对于API而言有多慢.

  • 2019 年 10 月的新闻:XPATH 更改:`"//div/label/input[@name='username']"` (2认同)

小智 6

更新:2020 年 3 月

这只是Levi 的回答,在某些部分有一些小的更新,因为现在它没有成功退出驱动程序。默认情况下,这也会获得所有关注者,正如其他人所说,它不适合很多关注者

import itertools

from explicit import waiter, XPATH
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep

def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")
    sleep(3)
    # Login
    driver.find_element_by_name("username").send_keys(username)
    driver.find_element_by_name("password").send_keys(password)
    submit = driver.find_element_by_tag_name('form')
    submit.submit()

    # Wait for the user dashboard page to load
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click
    sleep(2)
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)
    allfoll = int(driver.find_element_by_xpath("//li[2]/a/span").text)
    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child({}) a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            if follower_index > allfoll:
                raise StopIteration
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # tha
        last_follower = waiter.find_element(driver, follower_css.format(group+11))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = ""  # <account to check>
    driver = webdriver.Firefox(executable_path="./geckodriver")
    try:
        login(driver)
        print('Followers of the "{}" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t{:>3}: {}".format(count, follower))
    finally:
        driver.quit()
Run Code Online (Sandbox Code Playgroud)