如何在Nodejs中读取和解析html?

Kuc*_*ova 3 html parsing node.js

我有一个简单的项目。我需要帮助,这是一个相关项目。我需要读取 HTML 文件,然后将其转换为 JSON 格式。我想以代码和文本形式获取匹配项。我如何实现这一目标?

这样我就有了两个HTML标签

<p>In practice, it is usually a bad idea to modify global variables inside the function scope since it often is the cause of confusion and weird errors that are hard to debug.<br />
If you want to modify a global variable via a function, it is recommended to pass it as an argument and reassign the return-value.<br />
For example:</p>

<pre><code class="{python} language-{python}">a_var = 2

def a_func(some_var):
    return 2**3

a_var = a_func(a_var)
print(a_var)
</code></pre>
Run Code Online (Sandbox Code Playgroud)

我的代码:

const fs = require('fs')
const showdown  = require('showdown')

var read =  fs.readFileSync('./test.md', 'utf8')

function importer(mdFile) {

    var result = []
    let json = {}

    var converter = new showdown.Converter()
    var text      = mdFile
    var html      = converter.makeHtml(text);

    for (var i = 0; i < html.length; i++) {
        htmlRead = html[i]
        if(html == html.match(/<p>(.*?)<\/p>/g))
            json.text = html.match(/<p>(.*?)<\/p>/g)

       if(html == html.match(/<pre>(.*?)<\/pre>/g))
            json.code = html.match(/<pre>(.*?)<\/pre>/g

    }

    return html
}
console.log(importer(read))
Run Code Online (Sandbox Code Playgroud)

我如何在代码上获得这些匹配项?

新代码:我将所有 p 标签写入同一个 json 中,如何将每个 p 标签写入不同的 json 块中?

$('html').each(function(){
    if ($('p').text != undefined) {
        json.code = $('p').text()
        json.language = "Text"
    }
})
Run Code Online (Sandbox Code Playgroud)

小智 6

我建议使用 Cheerio。它尝试将 jQuery 功能实现到 Node.js。

const cheerio = require('cheerio')

var html = "<p>In practice, it is usually a bad idea to modify global variables inside the function scope since it often be the cause of confusion and weird errors that are hard to debug.<br />If you want to modify a global variable via a function, it is recommended to pass it as an argument and reassign the return-value.<br />For example:</p>"

const $ = cheerio.load(html)
var paragraph = $('p').html(); //Contents of paragraph. You can manipulate this in any other way you like

//...You would do the same for any other element you require
Run Code Online (Sandbox Code Playgroud)

您应该查看Cheerio并阅读其文档。我觉得它真的很整洁!

编辑:对于你问题的新部分

您可以迭代每个元素并将其插入到 JSON 对象数组中,如下所示:

var jsonObject = []; //An array of JSON objects that will hold everything
$('p').each(function() { //Loop for each paragraph
   //Now let's take the content of the paragraph and put it into a json object
    jsonObject.push({"paragraph":$(this).html()}); //Add data to the main jsonObject    
});
Run Code Online (Sandbox Code Playgroud)

因此,生成的 JSON 对象数组应如下所示:

[
  {
    "paragraph": "text"
  },
  {
    "paragraph": "text 2"
  },
  {
    "paragraph": "text 3"
  }
]
Run Code Online (Sandbox Code Playgroud)

我相信您还应该阅读JSON及其工作原理。