小编Kar*_*ren的帖子

使用Python/Selenium刮擦动态/ Javascript生成的网站

我正在试图抓住这个网站:

http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210

使用Python和Selenium(参见下面的代码).内容是动态生成的,显然未加载浏览器中不可见的数据.我尝试使浏览器窗口变大,并滚动到页面底部.扩大窗口可以获得我想要的所有水平方向数据,但仍有大量数据需要在垂直方向上进行刮擦.滚动似乎根本不起作用.

有没有人对如何做到这一点有任何好主意？

谢谢!

from selenium import webdriver
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5) # wait to load

soup = BeautifulSoup(driver.page_source)

table = soup.find("table", {"id":"DataTable"})

### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in  row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
  file_writer = csv.writer(test_file)
  for row in rows:
      file_writer.writerow(row)

Run Code Online (Sandbox Code Playgroud)

python selenium

Kar*_*ren

lucky-day

3
推荐指数

1
解决办法

6230
查看次数

标签统计

python ×1

selenium ×1

使用Python/Selenium刮擦动态/ Javascript生成的网站

标签 统计

小编Kar_ren的帖子

标签统计