rob*_*txt 2 python proxy selenium web-scraping python-3.x
我创建了一个脚本,使用 python 结合 selenium 在其中实现代理登录到 Facebook 并抓取其帖子位于我的提要顶部的用户的名称。我希望脚本在无限时间内每五分钟执行一次。
由于这种连续登录可能会导致我的帐户被禁止,我想在脚本中实现代理以匿名完成所有工作。
到目前为止我已经写过:
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_first_user(random_proxy):
options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
options.add_argument(f'--proxy-server={random_proxy}')
with webdriver.Chrome(options=options) as driver:
wait = WebDriverWait(driver,10)
driver.get("https://www.facebook.com/")
driver.find_element_by_id("email").send_keys("username")
driver.find_element_by_id("pass").send_keys("password",Keys.RETURN)
user = wait.until(EC.presence_of_element_located((By.XPATH,"//h4[@id][@class][./span[./a]]/span/a"))).text
return user
if __name__ == '__main__':
proxies = [`list of proxies`]
while True:
random_proxy = proxies.pop(random.randrange(len(proxies)))
print(get_first_user(random_proxy))
time.sleep(60000*5)
Run Code Online (Sandbox Code Playgroud)
在从需要身份验证的站点连续抓取数据时如何保持不被发现?
我不知道您为什么要Facebook每 5 分钟连续登录一次您的帐户来抓取内容。并且使用随机代理地址每次登录时可能会提出一个红旗与Facebook's安全规则。
Facebook我建议保持登录状态,而不是每 5 分钟登录一次。 Selenium具有刷新由自动化控制的网页的功能。通过使用此方法,您可以Facebook按预定义的时间间隔(例如 5 分钟)刷新您的
提要。
下面的代码使用此刷新方法重新加载页面。该代码还会从您提要顶部的用户帖子中进行检查。
在测试中,我注意到Facebook使用了一些随机标记,这可能用于减轻刮擦。我还注意到Facebook更改了链接到群组的帖子的用户名格式,因此如果您希望将用户名链接到这些帖子,则需要进行更多测试。我强烈建议进行更多测试以确定哪些用户元素没有被正确抓取。
from time import sleep
from random import randint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
# disable the banner "Chrome is being controlled by automated test software"
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
# global driver
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
driver.get('https://www.facebook.com')
driver.implicitly_wait(20)
driver.find_element_by_id("email").send_keys("your_username")
driver.find_element_by_id("pass").send_keys("your_password")
driver.implicitly_wait(10)
driver.find_element_by_xpath(("//button[text()='Log In']")).click()
# this function checks for a standard username tag
def user_element_exist():
try:
if driver.find_element_by_xpath("//h4[@id][@class][./span[./a]]/span/a"):
return True
except NoSuchElementException:
return False
# this function looks for username linked to Facebook Groups at the top of your feed
def group_element():
try:
if driver.find_element_by_xpath("//*[starts-with(@id, 'jsc_c_')]/span[1]/span/span/a/b"):
poster_name = driver.find_element_by_xpath("//*[starts-with(@id, 'jsc_c_')]/span[1]/span/span/a/b").text
return poster_name
if driver.find_element_by_xpath("//*[starts-with(@id, 'jsc_c_')]/strong[1]/span/a/span/span"):
poster_name = driver.find_element_by_xpath("//*[starts-with(@id, 'jsc_c_')]/strong["
"1]/span/a/span/span").text
return poster_name
except NoSuchElementException:
return "No user information found"
while True:
element_exists = user_element_exist()
if not element_exists:
user_name = group_element()
print(user_name)
driver.refresh()
elif element_exists:
user_name = driver.find_element_by_xpath("//h4[@id][@class][./span[./a]]/span/a").text
print(user_name)
driver.refresh()
# set the sleep timer to fit your needs
sleep(300) # This sleeps for 300 seconds, which is 5 minutes.
# I would likely use a random sleep function
# sleep(randint(180, 360))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
263 次 |
| 最近记录: |