Can*_*ice 5 r web-scraping rvest xml2
之前在此页面上发布了有关在 PGA 网站的排行榜页面上抓取表格的相关 stackoverflow 问题。总结那篇文章,由于该页面使用 javascript 呈现页面和表格的方式,排行榜显然难以抓取。
我可以检查并在标签中看到有一个global.leaderboardConfig
包含有用信息的对象:
是否可以将此对象作为 R 中的列表获取?我能够使用 获取页面上的所有 76 个脚本元素xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script')
,但是我不确定如何识别所需的特定脚本标记,也不知道如何从中获取对象。
编辑:在 devtools 的网络选项卡中,还有这个请求提供获取数据的 API 调用的链接。与从脚本标签中获取对象相比,获取所有网络请求并筛选这些请求是否更容易?
该站点从使用特定算法的 JS 函数生成hmac
和expire
url 参数值。该算法的参数是根据其作为URL参数来托管该功能的JS文件通过了划时代的时间在这里。这样,hmac
每次的值都不同,因为它是从这个 url 不断变化的文件中处理的。
该算法由按位和 & xor 组成,如下所示(伪代码):
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
Run Code Online (Sandbox Code Playgroud)
这些xorKey
数字是https://microservice.pgatour.com/js
基于纪元时间动态生成的。您只需要使用当前纪元时间作为 url 参数请求此 js 文件,并使用正则表达式提取stepValues
上述算法中所需的所有内容(以 开头-1
)。您还需要在r 中重现上面的算法
以下脚本生成 url 参数并进行 API 调用:
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
Run Code Online (Sandbox Code Playgroud)
kaggle 链接:https ://www.kaggle.com/bertrandmartel/pgatourextract
我在 Javascript 代码中进行了搜索,并将混淆的代码反转为可以理解的内容。这还有很长的路要走。让我们一步一步地去那里。
leaderboardUrl
您已经在问题中给出了第一个提示,config
即存在leaderboardUrl
.
有这个JS文件命名stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js
具有的出现次数leaderboardUrl
在config.leaderboardUrl
:
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
Run Code Online (Sandbox Code Playgroud)
让我们看看performFetch
似乎发送请求的函数
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
Run Code Online (Sandbox Code Playgroud)
我们发现了这个getUrlWithAuth
函数:
{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}
Run Code Online (Sandbox Code Playgroud)
现在,我们有getUserId
和getTrackingUserIdParam
看起来像函数和变量添加授权参数的URL。问题是我们必须找到这个函数的位置。
我发现这个文件命名main.c03ddfd249437fcce43410c35a21c6f8.js
,其中有一个occurencegetUserId
和getTrackingUserIdParam
:
{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................
Run Code Online (Sandbox Code Playgroud)
我在上面的代码片段中跳过了很多代码,所以它更清楚。
你可以看到这里有替换,使用t
数组作为基数,它将使用A
函数偏移字符串,并且有一个 init 函数更新初始t
数组,以便它解码为正确的字符串
您可以将此代码段粘贴到 nodejs 脚本中,稍微修改一下,然后您可以使用以下内容:
{
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},
Run Code Online (Sandbox Code Playgroud)
这里e
是window
如此你“只是”需要替换所有的A(XXX)
,以更好地了解正在发生的事情。
你会发现这个:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},
Run Code Online (Sandbox Code Playgroud)
解码后给出如下内容:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);
// prints e[pgatour] = e[pgatour] || {};
Run Code Online (Sandbox Code Playgroud)
我们正在寻找的函数是window["pgatour"]["setTrackingUserId"]
。但我们本可以从第 1 号任务开始就知道这一点。记住在第一个 JS 文件中:
var t = u.setTrackingUserId
Run Code Online (Sandbox Code Playgroud)
并且u
存在window.pgatour
但在这里,我们有I
硬编码的输入参数:
onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},
Run Code Online (Sandbox Code Playgroud)
这相当于 var I = "id8730931"
现在让我们看看window["pgatour"]["setTrackingUserId"]
函数
在网站上打开 chrome 开发者控制台,粘贴window["pgatour"]["setTrackingUserId"]
你会得到这样的东西:
onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},
Run Code Online (Sandbox Code Playgroud)
是的 :( 再次处理更多混淆的代码
通过查看应用程序脚本,您可能会发现它位于这个文件中。这是JS文件的网址:
https://microservice.pgatour.com/js?_=1618868625306
Run Code Online (Sandbox Code Playgroud)
有一个 url 参数指定一个纪元时间,代码根据这个参数改变
查看代码本身,在替换输入参数后,我们得到了这样的结果String.fromCharCode
和Math.abs
var t = u.setTrackingUserId
Run Code Online (Sandbox Code Playgroud)
我们可以制作一个nodejs脚本,通过提取步长值(在 xor 阶段)以更简单的方式重现该算法:
var I = A(428) + A(423) + A(407);
Run Code Online (Sandbox Code Playgroud)
输出:
exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
Run Code Online (Sandbox Code Playgroud)
如果你改变纪元时间,它会给出不同的结果
repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt
然后你只需要在r 中转换这个nodejs脚本并使用 url 参数进行 http 调用
请注意,encodedId
来自id8730931
使用此函数转换的输入 id (这些值似乎不会随着纪元时间而改变):
function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
Run Code Online (Sandbox Code Playgroud)
我的猜测是服务器正在检查 hmac 是否正确引用了初始 id 字符串,id8730931
因此硬编码是安全的(因为它也在服务器中被硬编码)
归档时间: |
|
查看次数: |
282 次 |
最近记录: |