Selenium/python:每次滚动后从动态加载的网页中提取文本

jru*_*003 11 javascript css python selenium selenium-webdriver

我正在使用Selenium/python自动向下滚动社交媒体网站并抓取帖子.我目前正在提取所有的文字在一个"打" 滚动一定的次数(下面的代码),而是我想只提取每个滚动后的新装入的文字.

例如,如果页面最初包含文本"A,B,C",那么在第一次滚动后它显示"D,E,F",我想要存储"A,B,C",然后滚动,然后存储"D,E,F"等.

我想要提取的具体项目是帖子的日期和消息文本,可以分别使用css选择器'.message-date''.message-body'(例如dates = driver.find_elements_by_css_selector('.message-date'))获得.

任何人都可以建议如何在每次滚动后只提取新加载的文本?

这是我当前的代码(我完成滚动提取所有日期/消息):

from selenium import webdriver
import sys
import time
from selenium.webdriver.common.keys import Keys

#load website to scrape
driver = webdriver.PhantomJS()
driver.get("https://stocktwits.com/symbol/USDJPY?q=%24USDjpy")

#Scroll the webpage
ScrollNumber=3 #max scrolls
print(str(ScrollNumber)+ " scrolldown will be done.")
for i in range(1,ScrollNumber):  #scroll down X times
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3) #Delay between 2 scrolls down to be sure the page loaded
    ## I WANT TO SAVE/STORE ANY NEWLY LOADED POSTS HERE RATHER 
    ## THAN EXTRACTING IT ALL IN ONE GO AT THE END OF THE LOOP

# Extract messages and dates.
## I WANT TO EXTRACT THIS DATA ON THE FLY IN THE ABOVE
## LOOP RATHER THAN EXTRACTING IT HERE
dates = driver.find_elements_by_css_selector('.message-date')
messages = driver.find_elements_by_css_selector('.message-body')
Run Code Online (Sandbox Code Playgroud)

Guy*_*Guy 6

您可以将消息数存储在变量中并使用xpathposition()获取新添加的帖子

dates = []
messages = []
num_of_posts = 1
for i in range(1, ScrollNumber):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    dates.extend(driver.find_elements_by_xpath('(//div[@class="message-date"])[position()>=' + str(num_of_posts) + ']'))
    messages.extend(driver.find_elements_by_xpath('(//div[contains(@class, "message-body")])[position()>=' + str(num_of_posts) + ']'))
    num_of_posts = len(dates)
Run Code Online (Sandbox Code Playgroud)