我需要从.aspx网页上抓取查询结果.
http://legistar.council.nyc.gov/Legislation.aspx
网址是静态的,那么如何向此网页提交查询并获得结果?假设我们需要从相应的下拉菜单中选择"所有年份"和"所有类型".
那里的人必须知道如何做到这一点.
我正在运行这个课程网站的刮刀,我想知道是否有一个更快的方法来刮掉页面,一旦我把它放入beautifulsoup.它比我预期的要长.
提示?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.implicitly_wait(10) # seconds
driver.get("https://acadinfo.wustl.edu/Courselistings/Semester/Search.aspx")
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
parsedClasses = {}
for i in range(len(select.options)):
print i
select = Select(driver.find_element_by_name("ctl00$Body$ddlSchool"))
select.options[i].click()
upperLevelClassButton = driver.find_element_by_id("Body_Level500")
upperLevelClassButton.click()
driver.find_element_by_name("ctl00$Body$ctl15").click()
soup = BeautifulSoup(driver.page_source, "lxml")
courses = soup.select(".CrsOpen")
for course in courses:
courseName = course.find_next(class_="ResultTable")["id"][13:]
parsedClasses[courseName] = []
print courseName
for section in course.select(".SecOpen"):
classInfo = section.find_all_next(class_="ItemRowCenter")
parsedClasses[courseName].append((int(classInfo[0].string), int(classInfo[1].string), …Run Code Online (Sandbox Code Playgroud) stackoverflow上已有很多好的资源,但我仍然遇到问题.我访问过这些来源:
我正试图访问http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx并选择一个教区.我相信这会强制发布一个帖子,并允许我选择一年,再次发布,并允许更多选择.我按照上述来源以不同的方式编写了我的脚本,并且未能成功提交网站以允许我输入一年.
我目前的代码
import urllib
from bs4 import BeautifulSoup
import mechanize
headers = [
('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Origin', 'http://www.indiapost.gov.in'),
('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'),
('Content-Type', 'application/x-www-form-urlencoded'),
('Referer', 'http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-US,en;q=0.8'),
]
br = mechanize.Browser()
br.addheaders = headers
url = 'http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx'
response = br.open(url)
# first HTTP request without form data
soup = BeautifulSoup(response)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"}) …Run Code Online (Sandbox Code Playgroud) 我正在移植一个使用curl的bash脚本,并将代码中的有效负载POST到URL并运行.基本问题是,使用robobrowser,我在使用页面表单发布时遇到了麻烦.
逐步浏览网站:
我已经能够成功地对网站进行身份验证并使用RoboBrowser和Requests + bs4执行GET,但是我很难在POST回到页面本身.
使用RoboBrowser(liboncall.py)
#!/usr/bin/python
from robobrowser import RoboBrowser
from bs4 import BeautifulSoup as BS
oc_mailbox = '123456'
oc_password_hashed = 'ABCDEFG'
base_uri = 'http://example.com'
auth_uri = oc_base_uri + '/SubLogin.aspx'
find_uri = oc_base_uri + '/FindMe.aspx'
phne_uri = oc_base_uri + '/PhoneLists.aspx'
p_auth_payload = {
'SubLoginControl:javascriptTest': 'true',
'SubLoginControl:mailbox': mailbox,
'SubLoginControl:phoneNumber': '',
'SubLoginControl:password': password_hashed,
'SubLoginControl:btnLogOn': 'Logon',
'SubLoginControl:webLanguage': 'en-US',
'SubLoginControl:initialLanguage': 'en-US',
'SubLoginControl:errorCallBackNumber': 'Entered telephone number contains non-dialable characters.',
'SubLoginControl:cookieMailbox': 'mailbox',
'SubLoginControl:cookieCallbackNumber': 'callbackNumber',
'SubLoginControl:serverDomain': ''
}
p_find_payload = …Run Code Online (Sandbox Code Playgroud) python ×4
asp.net ×2
web-scraping ×2
asp.net-ajax ×1
html-parsing ×1
javascript ×1
mechanize ×1
robobrowser ×1
selenium ×1
urllib2 ×1