Mil*_*ano 9 python selenium scroll selenium-webdriver
我正在尝试从航班搜索页面抓取一些数据。
这个页面是这样工作的:
您填写表格,然后单击按钮搜索 - 没问题。当您单击按钮时,您将被重定向到带有结果的页面,这就是问题所在。此页面连续添加结果,例如一分钟,这没什么大不了的 - 问题是要获得所有这些结果。当您在真实浏览器中时,您必须向下滚动页面,才会出现这些结果。所以我尝试使用 Selenium 向下滚动。它在页面底部向下滚动可能如此之快,或者它是跳转而不是滚动页面不会加载任何新结果。
当您缓慢向下滚动时,它会重新加载结果,但如果您非常快速地向下滚动,它会停止加载。
我不确定我的代码是否有助于理解,所以我附上了它。
SEARCH_STRING = """URL"""
class spider():
def __init__(self):
self.driver = webdriver.Firefox()
@staticmethod
def prepare_get(dep_airport,arr_airport,dep_date,arr_date):
string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)
return string
def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):
if isinstance(dep_airport, list):
airports_string = str(r'%20').join(dep_airport)
dep_airport = airports_string
wait = WebDriverWait(self.driver, 60) # wait for results
self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))
wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivos?, h?adáme pre Vás ešte viac letov"]/preceding-sibling::img')))
self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)
return self.driver.page_source
@staticmethod
def get_info_from_borderbox(div):
arrival = div.find('div',class_='departure').text
price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))
departure = div.find_all('div',class_='departure')[1].contents
date_departure = departure[1].text
airport_departure = departure[5].text
arrival = div.find_all('div', class_= 'arrival')[0].contents
date_arrival = arrival[1].text
airport_arrival = arrival[3].text[1:]
print 'DEPARTURE: '
print date_departure,airport_departure
print 'ARRIVAL: '
print date_arrival,airport_arrival
@staticmethod
def get_flights_from_result_page(html):
def match_tag(tag, classes):
return (tag.name == 'div'
and 'class' in tag.attrs
and all([c in tag['class'] for c in classes]))
soup = mLib.getSoup_html(html)
divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))
for div in divs:
spider.get_info_from_borderbox(div)
print len(divs)
spider_inst = spider()
print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))
Run Code Online (Sandbox Code Playgroud)
所以主要问题是我认为它滚动太快而无法触发新的结果加载。
你知道如何使它工作吗?
我需要它来解决同样的问题,我需要抓取一个社交媒体网站
y = 1000
for timer in range(0,50):
driver.execute_script("window.scrollTo(0, "+str(y)+")")
y += 1000
time.sleep(1)
Run Code Online (Sandbox Code Playgroud)
每 1000 次睡眠是允许加载
您可以使用 Selenium 进行平滑滚动,如下所示:
total_height = int(driver.execute_script("return document.body.scrollHeight"))
for i in range(1, total_height, 5):
driver.execute_script("window.scrollTo(0, {});".format(i))
Run Code Online (Sandbox Code Playgroud)
小智 7
经过一些实验,我终于找到了一个很好的解决方案:
def __scroll_down_page(self, speed=8):
current_scroll_position, new_height= 0, 1
while current_scroll_position <= new_height:
current_scroll_position += speed
self.__driver.execute_script("window.scrollTo(0, {});".format(current_scroll_position))
new_height = self.__driver.execute_script("return document.body.scrollHeight")
Run Code Online (Sandbox Code Playgroud)
这是对我有用的另一种方法,涉及滚动到最后一个搜索结果的视图并等待其他元素加载,然后再次滚动:
\n\n# -*- coding: utf-8 -*-\nfrom selenium import webdriver\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.common.exceptions import StaleElementReferenceException\nfrom selenium.webdriver.support import expected_conditions as EC\n\n\nclass wait_for_more_than_n_elements(object):\n def __init__(self, locator, count):\n self.locator = locator\n self.count = count\n\n def __call__(self, driver):\n try:\n count = len(EC._find_elements(driver, self.locator))\n return count >= self.count\n except StaleElementReferenceException:\n return False\n\n\ndriver = webdriver.Firefox()\n\ndep_airport = [\'BTS\', \'BRU\', \'PAR\']\narr_airport = \'MAD\'\ndep_date = \'2015-07-15\'\narr_date = \'2015-08-15\'\n\nairports_string = str(r\'%20\').join(dep_airport)\ndep_airport = airports_string\n\nurl = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date)\ndriver.maximize_window()\ndriver.get(url)\n\nwait = WebDriverWait(driver, 60)\nwait.until(EC.invisibility_of_element_located((By.XPATH, \'//img[contains(@src, "loading")]\')))\nwait.until(EC.invisibility_of_element_located((By.XPATH,\n u\'//div[. = "Popros\xc3\xadme o trpezlivos\xc5\xa5, h\xc4\xbead\xc3\xa1me pre V\xc3\xa1s e\xc5\xa1te viac letov"]/preceding-sibling::img\')))\n\nwhile True: # TODO: make the endless loop end\n results = driver.find_elements_by_css_selector("div.flightbox")\n print "Results count: %d" % len(results)\n\n # scroll to the last element\n driver.execute_script("arguments[0].scrollIntoView();", results[-1])\n\n # wait for more results to load\n wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, \'div.flightbox\'), len(results)))\nRun Code Online (Sandbox Code Playgroud)\n\n笔记:
\n\nlen(results)值wait_for_more_than_n_elements是一个自定义的预期条件,有助于识别何时加载下一部分并且我们可以再次滚动| 归档时间: |
|
| 查看次数: |
11817 次 |
| 最近记录: |