Sur*_*ian 7 python selenium xpath web-scraping webdriverwait
推特中的所有主题都可以在这个链接中找到 我想用里面的每个子类别抓取。
BeautifulSoup 在这里似乎没有用。我尝试使用 selenium,但我不知道如何匹配单击主类别后出现的 Xpath。
from selenium import webdriver
from selenium.common import exceptions
url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()
main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')
topics = {}
for main_topic in main_topics[2:]:
print(main_topic.text.strip())
topics[main_topic.text.strip()] = {}
Run Code Online (Sandbox Code Playgroud)
我知道我可以使用 来单击主类别main_topics[3].click(),但我不知道如何递归单击它们,直到我只找到Follow右侧的那些。
要使用Selenium和python抓取所有主要主题,例如艺术与文化、商业与金融等,您必须引发WebDriverWait for并且您可以使用以下任一定位器策略:visibility_of_all_elements_located()
使用XPATH和文本属性:
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
Run Code Online (Sandbox Code Playgroud)
使用XPATH和get_attribute():
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
Run Code Online (Sandbox Code Playgroud)
控制台输出:
['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
Run Code Online (Sandbox Code Playgroud)
要使用 Selenium 和WebDriver抓取所有主要和子主题,您可以使用以下定位器策略:
使用XPATH和get_attribute("textContent"):
driver.get("https://twitter.com/i/flow/topics_selector")
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
for element in elements:
element.click()
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
driver.quit()
Run Code Online (Sandbox Code Playgroud)
控制台输出:
['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
Run Code Online (Sandbox Code Playgroud)
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
372 次 |
| 最近记录: |