jru*_*003 11 javascript css python selenium selenium-webdriver
我正在使用Selenium/python自动向下滚动社交媒体网站并抓取帖子.我目前正在提取所有的文字在一个"打" 后滚动一定的次数(下面的代码),而是我想只提取每个滚动后的新装入的文字.
例如,如果页面最初包含文本"A,B,C",那么在第一次滚动后它显示"D,E,F",我想要存储"A,B,C",然后滚动,然后存储"D,E,F"等.
我想要提取的具体项目是帖子的日期和消息文本,可以分别使用css选择器'.message-date'和'.message-body'(例如dates = driver.find_elements_by_css_selector('.message-date'))获得.
任何人都可以建议如何在每次滚动后只提取新加载的文本?
这是我当前的代码(在我完成滚动后提取所有日期/消息):
from selenium import webdriver
import sys
import time
from selenium.webdriver.common.keys import Keys
#load website to scrape
driver = webdriver.PhantomJS()
driver.get("https://stocktwits.com/symbol/USDJPY?q=%24USDjpy")
#Scroll the webpage
ScrollNumber=3 #max scrolls
print(str(ScrollNumber)+ " scrolldown will be done.")
for i in range(1,ScrollNumber): #scroll down X times
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3) #Delay between 2 scrolls down to be sure the page loaded
## I WANT TO SAVE/STORE ANY NEWLY LOADED POSTS HERE RATHER
## THAN EXTRACTING IT ALL IN ONE GO AT THE END OF THE LOOP
# Extract messages and dates.
## I WANT TO EXTRACT THIS DATA ON THE FLY IN THE ABOVE
## LOOP RATHER THAN EXTRACTING IT HERE
dates = driver.find_elements_by_css_selector('.message-date')
messages = driver.find_elements_by_css_selector('.message-body')
Run Code Online (Sandbox Code Playgroud)
您可以将消息数存储在变量中并使用xpath和position()获取新添加的帖子
dates = []
messages = []
num_of_posts = 1
for i in range(1, ScrollNumber):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
dates.extend(driver.find_elements_by_xpath('(//div[@class="message-date"])[position()>=' + str(num_of_posts) + ']'))
messages.extend(driver.find_elements_by_xpath('(//div[contains(@class, "message-body")])[position()>=' + str(num_of_posts) + ']'))
num_of_posts = len(dates)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1257 次 |
| 最近记录: |