如何从推特上抓取所有主题

Sur*_*ian 7 python selenium xpath web-scraping webdriverwait

推特中的所有主题都可以在这个链接中找到 我想用里面的每个子类别抓取。

BeautifulSoup 在这里似乎没有用。我尝试使用 selenium,但我不知道如何匹配单击主类别后出现的 Xpath。

from selenium import webdriver
from selenium.common import exceptions

url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()

main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')

topics = {}
for main_topic in main_topics[2:]:
    print(main_topic.text.strip())
    topics[main_topic.text.strip()] = {}
Run Code Online (Sandbox Code Playgroud)

我知道我可以使用 来单击主类别main_topics[3].click(),但我不知道如何递归单击它们,直到我只找到Follow右侧的那些。

Deb*_*anB 8

要使用Selenium抓取所有主要主题,例如艺术与文化商业与金融等,您必须引发WebDriverWait for并且您可以使用以下任一定位器策略visibility_of_all_elements_located()

  • 使用XPATH文本属性:

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
    Run Code Online (Sandbox Code Playgroud)
  • 使用XPATHget_attribute()

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
    Run Code Online (Sandbox Code Playgroud)
  • 控制台输出:

    ['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
    
    Run Code Online (Sandbox Code Playgroud)

要使用 Selenium 和WebDriver抓取所有主要子主题,您可以使用以下定位器策略

  • 使用XPATHget_attribute("textContent")

    driver.get("https://twitter.com/i/flow/topics_selector")
    elements =  WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
    for element in elements:
        element.click()
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
    driver.quit()
    
    Run Code Online (Sandbox Code Playgroud)
  • 控制台输出:

    ['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
    
    Run Code Online (Sandbox Code Playgroud)
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    Run Code Online (Sandbox Code Playgroud)