如何在JavaScript代码中获取JavaScript对象?

Won*_*Kim 4 javascript web-crawler node.js puppeteer

TL; DR

我想要parseParameter像下面的代码那样解析JSON。 someCrawledJSCode被抓取的JavaScript代码。

const data = parseParameter(someCrawledJSCode);
console.log(data);  // data1: {...}
Run Code Online (Sandbox Code Playgroud)

问题

我正在使用puppeteer抓取一些JavaScript代码,并且想从中提取JSON对象,但是我不知道如何解析给定的JavaScript代码。

爬行的JavaScript代码示例:

const somecode = 'somevalue';
arr.push({
  data1: {
    prices: [{
      prop1: 'hi',
      prop2: 'hello',
    },
    {
      prop1: 'foo',
      prop2: 'bar',
    }]
  }
});
Run Code Online (Sandbox Code Playgroud)

在这段代码中,我想获取prices数组(或data1)。

我做了什么

我尝试将代码解析为JSON,但无法正常工作。因此,我搜索了解析工具并获得了Esprima。但是我认为这对解决这个问题没有帮助。

Tho*_*orf 6

Short answer: Don't (re)build a parser in Node.js, use the browser instead

I strongly advise against evaluating or parsing crawled data in Node.js if you are anyway using puppeteer for crawling. When you are using puppeteer you already have a browser with a great sandbox for JavaScript code running in another process. Why risk that kind of isolation and "rebuild" a parser in your Node.js script? If your Node.js script breaks, your whole script will fail. In the worst case, you might even expose your machine to serious risks when you try to run untrusted code inside your main thread.

Instead, try to do as much parsing as possible inside the context of the page. You can even do an evil eval call there. There worst that could happen? Your browser hangs or crashes.

Example

Imagine the following HTML page (very much simplified). You are trying to read the text which is pushed into an array. The only information you have is that there is an additional attribute id which is set to target-data.

<html>
<body>
  <!--- ... -->
  <script>
    var arr = [];
    // some complex code...
    arr.push({
      id: 'not-interesting-data',
      data: 'some data you do not want to crawl',
    });
    // more complex code here...
    arr.push({
      id: 'target-data',
      data: 'THIS IS THE DATA YOU WANT TO CRAWL', // <---- You want to get this text
    });
    // more code...
    arr.push({
      id: 'some-irrelevant-data',
      data: 'again, you do not want to crawl this',
    });
  </script>
  <!--- ... -->
</body>
</html>
Run Code Online (Sandbox Code Playgroud)

Bad code

Here is a simple example what your code might look like right now:

await page.goto('http://...');
const crawledJsCode = await page.evaluate(() => document.querySelector('script').innerHTML);
Run Code Online (Sandbox Code Playgroud)

In this example, the script extracts the JavaScript code from the page. Now we have the JavaScript code from the page and we "only" need to parse it, right? Well, this is the wrong approach. Don't try to rebuild a parser inside Node.js. Just use the browser. There are basically two approaches you can take to do that in your case.

  1. Inject proxy functions into the page and fake some built-in functions (recommended)
  2. Parse the data on the client-side (!) by using JSON.parse, a regex or eval (eval only if really necessary)

Option 1: Inject proxy functions into the page

In this approach you are replacing native browser functions with your own "fake functions". Example:

const originalPush = Array.prototype.push;
Array.prototype.push = function (item) {
    if (item && item.id === 'target-data') {
        const data = item.data; // This is the data we are trying to crawl
        window.exposedDataFoundFunction(data); // send this data back to Node.js
    }
    originalPush.apply(this, arguments);
}
Run Code Online (Sandbox Code Playgroud)

This code replaces the original Array.prototype.push function with our own function. Everything works as normal, but when an item with our target id is pushed into an array, a special condition is triggered. To inject this function into the page, you could use page.evaluateOnNewDocument. To receive the data from Node.js you would have to expose a function to the browser via page.exposeFunction:

// called via window.dataFound from within the fake Array.prototype.push function
await page.exposeFunction('exposedDataFoundFunction', data => {
    // handle the data in Node.js
});
Run Code Online (Sandbox Code Playgroud)

Now it doesn't really matter how complex the code of the page is, whether it happens inside some asynchronous handler or whether the page changes the surrounding code. As long as the target data is pushing the data into an array, we will get it.

You can use this approach for a lot of crawling. Check how the data is processed and replace the low level functions processing the data with your own proxy version of it.

Option 2: Parse the data

Let's assume the first approach does not work for some reason. The data is in some script tag, but you are not able to get it by using fake functions.

Then you should parse the data, but not inside your Node.js environment. Do it inside the page context. You could run a regular expression or use JSON.parse. But do it before returning the data back to Node.js. This approach has the benefit that if your code will crash your environment for some reason, it will not be your main script, but just your browser that crashes.

To give some example code. Instead of running the code from the original "bad code" sample, we change it to this:

const crawledJsCode = await page.evaluate(() => {
    const code = document.querySelector('script').innerHTML; // instead of returning this
    const match = code.match(/some tricky regex which extracts the data you want/); // we run our regex in the browser
    return match; // and only return the results
});
Run Code Online (Sandbox Code Playgroud)

This will only return the parts of the code we need, which can then be fruther processed from within Node.js.


Independent of which approach you choose, both ways are much better and more secure than running unknown code inside your main thread. If you absolutely have to process the data in your Node.js environment, use a regular expression for it like shown in the answer from trincot. You should never use eval to run untrusted code.