查找Javascript链接的Web地址

And*_*age 7 javascript r

RCurl在R中使用尝试从网站下载数据,但我无法找到要使用的URL.这是网站:

http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX

请查看右上角,在显示的工作表上方,是否有将数据下载为.csv文件的链接?我想知道是否有办法找到该.csv文件的常规HTTP地址,因为RCurl无法处理Javascript命令.

shh*_*its 10

我会给你一个快速而肮脏的方式来获取数据.首先,您可以使用Fiddler2 http://www.fiddler2.com/fiddler2/来检查浏览器发送的POST.这导致以下POST:

POST http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX HTTP/1.1
Host: www.invescopowershares.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Referer: http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX
Content-Type: application/x-www-form-urlencoded
Content-Length: 70669

__EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........
Run Code Online (Sandbox Code Playgroud)

所以我们可以看到正在发布3个参数,即__EVENTTARGET,__ EVENTVALIDATION和__VIEWSTATE.

postForm调用所需的表单是:

postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
Run Code Online (Sandbox Code Playgroud)

快来又脏了.我只是打开一个浏览器并获得它收到的相关参数,如下所示:

library(rcom)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
ie$Navigate2("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
myPW<-comGetProperty(myDoc,"parentWindow")
comInvoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript")
event.val<-myPW[["dumVar1"]]
view.state<-myPW[["dumVar2"]]
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
ie$Quit()
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')


> fin.data[1,]
  ticker SecurityNum                      Name CouponRate maturitydate
1    PGX   949746879 WELLS FARGO & COMPANY PFD       0.08             
     rating  Shares PercentageOfFund PositionDate
1 BBB+/Baa3 2538656       0.04442112   06/11/2012
Run Code Online (Sandbox Code Playgroud)

__EVENTVALIDATION,__ reviewSTATE可能总是相同,也可能是会话cookie.你可能可以使用RCurl来获取它们,但正如我所说,这是一个快速而肮脏的解决方案,我们只需要使用Internet Explorer.注意事项:

1).这需要安装IE的Windows才能使用rcom位.

2).如果您正在运行ie9,您可能需要将invescopowershares.com添加到兼容性视图设置(因为microsoft似乎已阻止event.val <-myPW [["dumVar1"]]类型com调用)

编辑(更新)

通过网站更详细地查看了__EVENTVALIDATION,__ CopyrightSTATE被设置为初始页面上的javascript变量.我们可以按照以下快速和肮脏的方式解析这些,而无需调用浏览器.

dum<-getURL("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')
Run Code Online (Sandbox Code Playgroud)

以上应该跨平台工作.


Jef*_*eff 7

单击"下载"链接可执行以下JavaScript代码:

__doPostBack('ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1','')
Run Code Online (Sandbox Code Playgroud)

__doPostBack函数似乎只是填写该页面上的几个隐藏表单字段,然后提交POST请求.

快速的Google搜索显示RCurl能够提交POST请求.因此,您需要做的是查看该页面的源代码,找到名称为"aspnetForm"的表单,从该表单中获取所有字段,并创建自己的POST请求,将字段提交到操作URL(http ://www.invescopowershares.com/products/holdings.aspx?ticker = PGX).

但不能保证这会起作用.似乎有一个隐藏的表单字段名称__VIEWSTATE似乎编码一些信息,我不知道这是如何影响的.