sta*_*oob 7 html python xml r selenium-webdriver
我正在使用 R 编程语言,并尝试了解如何使用 Selenium 与网页交互。
例如,使用谷歌地图 - 我试图找到某个区域周围所有披萨店的名称、地址和经度/纬度。据我了解,这将涉及输入您感兴趣的位置,单击“附近”按钮,输入您要查找的内容(例如“披萨”),一直滚动到底部以确保所有披萨店均已加载- 然后复制所有披萨店的名称、地址和经度/纬度。
我一直在自学如何在 R 中使用 Selenium,并且能够自己解决这个问题的部分内容。这是我到目前为止所做的:
第 1 部分:搜索地址(例如美国纽约自由女神像)并返回经度/纬度:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))
Sys.sleep(5)
url <- remDr$getCurrentUrl()[[1]]
long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\\1,\\2", url)
long_lat <- unlist(strsplit(long_lat, ","))
> long_lat
[1] "40.7269409" "-74.0906116"
Run Code Online (Sandbox Code Playgroud)
第 2 部分:搜索某个位置周围的所有披萨店:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
Run Code Online (Sandbox Code Playgroud)
但从这里开始,我不知道如何继续。我不知道如何将页面一直滚动到底部以查看所有可用的此类结果 - 而且我不知道如何开始提取名称。
通过一些研究(即检查 HTML 代码),我得出以下结论:
餐厅位置的名称可以在以下标签中找到:<a class="hfpxzc" aria-label=
餐厅位置的地址可在以下标签中找到:<div class="W4Efsd">
最后,我会寻找这样的结果:
name address longitude latitude
1 pizza land 123 fake st, city, state, zip code 45.212 -75.123
Run Code Online (Sandbox Code Playgroud)
有人可以告诉我如何继续吗?
注意:看到越来越多的人可能通过 Python 使用 Selenium - 我非常高兴学习如何在 Python 中解决这个问题,然后尝试将答案转换为 R 代码。
谢谢!
参考:
更新:地址方面的一些进一步进展
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])
result <- data.frame(name = unlist(names), address = unlist(addresses))
Run Code Online (Sandbox Code Playgroud)
我看到你更新了你的问题以包含 Python 答案,所以这里是它在 Python 中的完成方式。您可以对 R 使用相同的方法。
\n该页面是延迟加载的,这意味着当您滚动时,数据会被分页并加载。
\n因此,您需要做的是不断查找数据的最后一个HTML 标签,从而加载更多内容。
\n您需要了解数据是如何加载的。这就是我所做的:
\n首先,在网络调用中禁用浏览器的互联网访问(F12 -> 网络 -> 离线)
\n\n然后,滚动到最后加载的元素,您将看到一个加载指示器(由于没有互联网,它只会挂起)
\n\n现在,重要的部分来了,找出这个加载指示器在什么 HTML 标签下:
\n\n正如您所看到的,该元素位于div.qjESneCSS 选择器下方。
您可以调用 javascript 代码scrollIntoView()函数,该函数会将特定元素滚动到浏览器视口中的视图中。
为了找出何时停止滚动以加载更多数据,我们需要找出没有数据时出现的元素。
\n如果滚动直到没有更多结果,您将看到:
\n\n这是 CSS 选择器下的一个元素span.HlvSq。
from selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\n\n\nURL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"\n\ndriver = webdriver.Chrome()\n\n\ndriver.get(URL)\n\n# Waits 10 seconds for the elements to load before scrolling\nwait = WebDriverWait(driver, 10)\nelements = wait.until(\n EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))\n)\n\nwhile True:\n new_elements = wait.until(\n EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))\n )\n\n # Pick the last element in the list - this is the one we want to scroll to\n last_element = elements[-1]\n # Scroll to the last element\n driver.execute_script("arguments[0].scrollIntoView(true);", last_element)\n\n # Update the elements list\n elements = new_elements\n\n # Check if there are any new elements loaded - the "You\'ve reached the end of the list." message\n if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):\n print("No more elements")\n break\nRun Code Online (Sandbox Code Playgroud)\n如果您检查该页面,您将看到数据位于 的 CSS 选择器下的卡片下div.lI9IFe。
你需要做的是等待滚动完成,然后你就可以获取CSS选择器下的所有数据div.lI9IFe
\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nimport pandas as pd\n\nURL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"\n\ndriver = webdriver.Chrome()\ndriver.get(URL)\n\n# Waits 10 seconds for the elements to load before scrolling\nwait = WebDriverWait(driver, 10)\nelements = wait.until(\n EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))\n)\ntitles = []\nlinks = []\naddresses = []\n\nwhile True:\n new_elements = wait.until(\n EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))\n )\n\n # Pick the last element in the list - this is the one we want to scroll to\n last_element = elements[-1]\n # Scroll to the last element\n driver.execute_script("arguments[0].scrollIntoView(true);", last_element)\n\n # Update the elements list\n\n elements = new_elements\n # time.sleep(0.1)\n\n # Check if there are any new elements loaded - the "You\'ve reached the end of the list." message\n if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):\n # now we can parse the data since all the elements loaded\n for data in driver.find_elements(By.CSS_SELECTOR, "div.lI9IFe"):\n title = data.find_element(\n By.CSS_SELECTOR, "div.qBF1Pd.fontHeadlineSmall"\n ).text\n restaurant = data.find_element(\n By.CSS_SELECTOR, ".W4Efsd > span:nth-of-type(2)"\n ).text\n\n titles.append(title)\n addresses.append(restaurant)\n\n # This converts the list of titles and links into a dataframe\n df = pd.DataFrame(list(zip(titles, addresses)), columns=["title", "addresses"])\n print(df)\n break\nRun Code Online (Sandbox Code Playgroud)\n印刷:
\n title addresses\n0 Domino\'s Pizza \xc2\xb7 741 Communipaw Ave A\n1 Tommy\'s Family Restaurant \xc2\xb7 349 Central Ave\n2 VIP RESTAURANT LLC BARSHAY\'S \xc2\xb7 175 Sip Ave\n3 The Hutton Restaurant and Bar \xc2\xb7 225 Hutton St\n4 Barge Inn \xc2\xb7 324 3rd St\n.. ... ...\n116 Bettie\'s Restaurant \xc2\xb7 579 West Side Ave\n117 Mahboob-E-El Ahi \xc2\xb7 580 Montgomery St\n118 Samosa Paradise \xc2\xb7 804 Newark Ave\n119 TACO DRIVE \xc2\xb7 195 Newark Ave\n120 Two Boots Pizza \xc2\xb7 133 Newark Ave\n\n[121 rows x 2 columns]\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
352 次 |
| 最近记录: |