don*_*nic 7 javascript python selenium beautifulsoup reactjs
我想从这个特定页面的搜索结果中抓取 class="_1UoZlX" 的锚链接 - https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung +手机&sid=tyy%2F4io
当我从页面创建一个 soup 时,我意识到搜索结果是使用 React JS 渲染的,因此我无法在页面源(或 soup)中找到它们。
这是我的代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ['https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]
for url in listUrls:
browser.get(url)
wait = WebDriverWait(browser, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"_1UoZlX"})
for result in results:
link = result["href"]
print link
urls.append(link)
print urls
Run Code Online (Sandbox Code Playgroud)
这是我收到的错误。
Traceback (most recent call last):
File "fetch_urls.py", line 19, in <module>
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Screenshot: available via screen
Run Code Online (Sandbox Code Playgroud)
有人在这个答案中提到有一种方法可以使用selenium来处理页面上的javascript。有人可以详细说明一下吗?我做了一些谷歌搜索,但找不到适用于这种特殊情况的方法。
您的代码没有问题,但是您正在抓取的网站 - 它不会由于某种原因而停止加载,从而阻止解析页面和您编写的后续代码。
我尝试使用维基百科来确认相同的内容:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ["https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"]
# browser = webdriver.PhantomJS('/usr/local/bin/phantomjs')
browser = webdriver.Chrome("./chromedriver")
urls=[]
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"mw-redirect"})
for result in results:
link = result["href"]
urls.append(link)
print urls
Run Code Online (Sandbox Code Playgroud)
输出:
[u'/wiki/List_of_states_and_territories_of_India_by_area', u'/wiki/List_of_Indian_states_by_GDP_per_capita', u'/wiki/Constitutional_republic', u'/wiki/States_and_territories_of_India', u'/wiki/National_Capital_Territory_of_Delhi', u'/wiki/States_Reorganisation_Act', u'/wiki/High_Courts_of_India', u'/wiki/Delhi_NCT', u'/wiki/Bengaluru', u'/wiki/Madras', u'/wiki/Andhra_Pradesh_Capital_City', u'/wiki/States_and_territories_of_India', u'/wiki/Jammu_(city)']
Run Code Online (Sandbox Code Playgroud)
PS 我使用 Chrome 驱动程序来针对真实的 Chrome 浏览器运行脚本以进行调试。从https://chromedriver.storage.googleapis.com/index.html?path=2.27/下载 chrome 驱动程序
| 归档时间: |
|
| 查看次数: |
12897 次 |
| 最近记录: |