Dae*_*oom 7 r web-scraping rcurl
我试图从交互式aspx网页webscrape表.我已经阅读了所有关于堆栈的R webscraping问题,我想我已经接近了,但我似乎无法得到它.
我想从这里生成的表中提取数据.最终我想循环遍历每个日期和状态选项,但我的挑战实际上只是到R提交我的参数并为任何特定查询拉入结果表.
根据我收集的内容,答案可能涉及RCurl和XML包,使用我的参数发布"表单",然后读取结果页面的html.
我最近的努力看起来像这样:
library(RCurl)
library(XML)
curl = getCurlHandle()
link = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_HabitationWiseLabTesting_S.aspx"
html = getURL(link, curl = curl)
params = list('ctl00$ContentPlaceHolder$ddFinYear' = '2005-2006',
'ctl00$ContentPlaceHolder$ddState' = 'BIHAR')
html2 = postForm(link, .params = params, curl = curl)
table = readHTMLTable(html2 )
Run Code Online (Sandbox Code Playgroud)
我真的很难说我遇到了什么问题.一方面html == html2产生错误,所以我认为html2在提交表单后已经进展到某个点,但是如果表单提交不正确或者如果有效并且它的读数是那张桌子没用.
任何建议和帮助表示赞赏.谢谢!
我已经能够使用以下代码提取表的内容:
library(RDCOMClient)
library(stringr)
url <- "https://ejalshakti.gov.in/IMISReports/Reports/Physical/rpt_RWS_TargetAchievement_S.aspx?Rep=0&RP=Y&APP=IMIS"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
doc <- IEApp$Document()
mouseEvent <- doc$createEvent("MouseEvent")
mouseEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$getElementByID("ContentPlaceHolder_ddfinyear")
web_Obj_Date[['Value']] <- "2015-2016"
web_Obj_Submit <- doc$getElementByID("ContentPlaceHolder_btnGO")
web_Obj_Submit$dispatchEvent(mouseEvent)
Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()
text_Table <- stringr::str_extract_all(string = html_Content, pattern = "Financial Year:((.|\\r\\n)*)Disclaimer and Privacy Policy")[[1]]
strsplit(text_Table, "\r\n")[[1]]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
751 次 |
| 最近记录: |