相关疑难解决方法(0)

加快beautifulsoup

我正在运行这个课程网站的刮刀,我想知道是否有一个更快的方法来刮掉页面,一旦我把它放入beautifulsoup.它比我预期的要长.

提示？

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))

parsedClasses = {}

for i in range(len(select.options)):
    print i
    select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
    select.options[i].click()
    upperLevelClassButton = driver.find_element_by_id("Body_Level500")
    upperLevelClassButton.click()
    driver.find_element_by_name("ctl00$Body$ctl15").click()

    soup = BeautifulSoup(driver.page_source, "lxml")

    courses = soup.select(".CrsOpen")
    for course in courses:
        courseName = course.find_next(class_="ResultTable")["id"][13:]
        parsedClasses[courseName] = []
        print courseName
        for section in course.select(".SecOpen"):
            classInfo = section.find_all_next(class_="ItemRowCenter")
            parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), …

Run Code Online (Sandbox Code Playgroud)

python selenium beautifulsoup html-parsing web-scraping

tbo*_*son

2014 08-28

9
推荐指数

3
解决办法

1万
查看次数