在 R 中,使用 rvest 和 xml2 从网站上的 <script> 元素中提取 JSON 对象

Can*_*ice 5 r web-scraping rvest xml2

之前在此页面上发布了有关在 PGA 网站的排行榜页面上抓取表格的相关 stackoverflow 问题。总结那篇文章,由于该页面使用 javascript 呈现页面和表格的方式,排行榜显然难以抓取。

我可以检查并在标签中看到有一个global.leaderboardConfig包含有用信息的对象:

在此处输入图片说明

是否可以将此对象作为 R 中的列表获取?我能够使用 获取页面上的所有 76 个脚本元素xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script'),但是我不确定如何识别所需的特定脚本标记,也不知道如何从中获取对象。

编辑:在 devtools 的网络选项卡中,还有这个请求提供获取数据的 API 调用的链接。与从脚本标签中获取对象相比,获取所有网络请求并筛选这些请求是否更容易?

在此处输入图片说明

Ber*_*tel 8

该站点从使用特定算法的 JS 函数生成hmacexpireurl 参数值。该算法的参数是根据其作为URL参数来托管该功能的JS文件通过了划时代的时间在这里。这样,hmac每次的值都不同,因为它是从这个 url 不断变化的文件中处理的。

该算法由按位和 & xor 组成,如下所示(伪代码):

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....
Run Code Online (Sandbox Code Playgroud)

这些xorKey数字是https://microservice.pgatour.com/js基于纪元时间动态生成的。您只需要使用当前纪元时间作为 url 参数请求此 js 文件,并使用正则表达式提取stepValues上述算法中所需的所有内容(以 开头-1)。您还需要在重现上面的算法

以下脚本生成 url 参数并进行 API 调用:

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
    "_" = format(time, digits=13)
  )), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
    step <- bitXor(bitAnd(value * value - encodedId, init), t)
    result <- paste0(result, rawToChar(as.raw(step)));
    value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)
Run Code Online (Sandbox Code Playgroud)

kaggle 链接:https ://www.kaggle.com/bertrandmartel/pgatourextract

如何找到这个算法?

我在 Javascript 代码中进行了搜索,并将混淆的代码反转为可以理解的内容。这还有很长的路要走。让我们一步一步地去那里。

任务 n°1 - 寻找 leaderboardUrl

您已经在问题中给出了第一个提示,config即存在leaderboardUrl.

这个JS文件命名stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js具有的出现次数leaderboardUrlconfig.leaderboardUrl

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....
Run Code Online (Sandbox Code Playgroud)

让我们看看performFetch似乎发送请求的函数

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
    "_" = format(time, digits=13)
  )), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
    step <- bitXor(bitAnd(value * value - encodedId, init), t)
    result <- paste0(result, rawToChar(as.raw(step)));
    value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)
Run Code Online (Sandbox Code Playgroud)

我们发现了这个getUrlWithAuth函数:

{
    key: "getLeaderboardData",
    value: function (t, r, n) {
      var o = this,
        e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()),  <===================== HERE
        a = [this.performFetch(e)].concat(
          g(
            "initial" === n && this.config.translationsUrl
              ? [y.default.load(this.config.translationsUrl)]
              : []
          )
        ),
        ..........
}
Run Code Online (Sandbox Code Playgroud)

现在,我们有getUserIdgetTrackingUserIdParam看起来像函数和变量添加授权参数的URL。问题是我们必须找到这个函数的位置。

任务 2 - 反混淆挑战:替换

我发现这个文件命名main.c03ddfd249437fcce43410c35a21c6f8.js,其中有一个occurencegetUserIdgetTrackingUserIdParam

{
    key: "performFetch",
    value: function (t) {
      var r = this,
        e =
          1 < arguments.length && void 0 !== arguments[1]
            ? arguments[1]
            : {};
      return t
        ? ((0, a.isProtectedUrl)(t) &&
            (t = this.getUrlWithAuth(t)), <===================== HERE
          (0, o.default)(t, e)
            .then(function (e) {
              return r.checkFetchResponseStatus(e, t);
    .................
Run Code Online (Sandbox Code Playgroud)

我在上面的代码片段中跳过了很多代码,所以它更清楚。

你可以看到这里有替换,使用t数组作为基数,它将使用A函数偏移字符串,并且有一个 init 函数更新初始t数组,以便它解码为正确的字符串

您可以将此代码段粘贴到 nodejs 脚本中,稍微修改一下,然后您可以使用以下内容:

  {
    key: "getUrlWithAuth",
    value: function (e) {
      var t = u.setTrackingUserId, 
        r = u.UserIdTracker, 
        n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
      if (t && n) {
        var o = r.getTrackingUserIdParam(), <===================== HERE
          a = t(r.getUserId());
        return u.setUrlParameter(e, o, a);
      }
      return e;
    },
  },
Run Code Online (Sandbox Code Playgroud)

这里ewindow如此你“只是”需要替换所有的A(XXX),以更好地了解正在发生的事情。

你会发现这个:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
    return t[g -= 398]
},
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
.................
function(g, e) {
    var t = A
      , C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
      , I = t(428) + t(423) + t(407)
      , o = t(483) + "rTr" + t(446) + t(477) + "Id";
    C[t(489) + t(463) + t(469) + "cker"] = {
        ........................
        getTrackingUserIdParam: function() {
            return o
        },
        getUserId: function() {
            return I
        },
        ......................
    }
}(jQuery, window)
},
Run Code Online (Sandbox Code Playgroud)

解码后给出如下内容:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];

var A = function(g, e) {
    return t[g -= 398]
};
console.log(t);
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);

// prints e[pgatour] = e[pgatour] || {};
Run Code Online (Sandbox Code Playgroud)

我们正在寻找的函数是window["pgatour"]["setTrackingUserId"]。但我们本可以从第 1 号任务开始就知道这一点。记住在第一个 JS 文件中:

var t = u.setTrackingUserId
Run Code Online (Sandbox Code Playgroud)

并且u存在window.pgatour

但在这里,我们有I硬编码的输入参数:

onBeforeSendRequest: function(g, e) {
    var A = t;
    if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
        var I = this["getUse" + A(463)]()
            , o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
            , n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
        e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
    }
},
Run Code Online (Sandbox Code Playgroud)

这相当于 var I = "id8730931"

现在让我们看看window["pgatour"]["setTrackingUserId"]函数

任务 n°3 - 加密/反向

在网站上打开 chrome 开发者控制台,粘贴window["pgatour"]["setTrackingUserId"]你会得到这样的东西:

onBeforeSendRequest: function(g, e) {
    if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
        var I = this["getUserId"]()
            , o = window["pgatour"]["setTrackingUserId"](I)
            , n = this["getTrackingUserIdParam"]();
        e.url = C["setUrlParameter"](e["url"], n, o)
    }
},
Run Code Online (Sandbox Code Playgroud)

是的 :( 再次处理更多混淆的代码

通过查看应用程序脚本,您可能会发现它位于这个文件中。这是JS文件的网址:

https://microservice.pgatour.com/js?_=1618868625306
Run Code Online (Sandbox Code Playgroud)

有一个 url 参数指定一个纪元时间,代码根据这个参数改变

查看代码本身,在替换输入参数后,我们得到了这样的结果String.fromCharCodeMath.abs

var t = u.setTrackingUserId
Run Code Online (Sandbox Code Playgroud)

我们可以制作一个脚本,通过提取步长值(在 xor 阶段)以更简单的方式重现该算法:

var I = A(428) + A(423) + A(407);
Run Code Online (Sandbox Code Playgroud)

输出:

exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
Run Code Online (Sandbox Code Playgroud)

如果你改变纪元时间,它会给出不同的结果

repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt

然后你只需要在转换这个脚本并使用 url 参数进行 http 调用

请注意,encodedId来自id8730931使用此函数转换的输入 id (这些值似乎不会随着纪元时间而改变):

function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
Run Code Online (Sandbox Code Playgroud)

我的猜测是服务器正在检查 hmac 是否正确引用了初始 id 字符串,id8730931因此硬编码是安全的(因为它也在服务器中被硬编码)

  • 谢谢你。我解决了一半,然后失去了金线。真的很欣赏细节+ (2认同)