NodeJS：根据短语从html文本中提取句子

Question

NodeJS：根据短语从html文本中提取句子

我在数据库中存储了一些文本，如下所示：

let text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>"

Run Code Online (Sandbox Code Playgroud)

文本可以有许多段落和 HTML 标签。

现在，我还有一句话：

let phrase = 'lose touch'

Run Code Online (Sandbox Code Playgroud)

我想要做的是搜索phrasein text，并返回包含phraseinstrong标签的完整句子。

在上面的例子中，即使第一个段落也包含短语'lose touch'，它应该返回第二个句子，因为在第二个句子中，该短语在strong标签内。结果将是：

They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.

Run Code Online (Sandbox Code Playgroud)

在客户端，我可以用这个 HTML 文本创建一个 DOM 树，将它转换成一个数组并搜索数组中的每个项目，但在 NodeJS 中文档不可用，所以这基本上只是带有 HTML 标签的纯文本。我如何在这段文本中找到正确的句子？

Answer 1

pla*_*ter 2

我想这可能对你有帮助。

如果我正确理解了这个问题，就不需要涉及 DOM 了。

即使 p 或 Strong 标签中有属性，该解决方案也可以工作。

如果您想搜索除 p 之外的标签，只需更新它的正则表达式，它就应该可以工作。

const search_phrase = "lose touch";
const strong_regex = new RegExp(`<\s*strong[^>]*>${search_phrase}<\s*/\s*strong>`, "g");
const paragraph_regex = new RegExp("<\s*p[^>]*>(.*?)<\s*/\s*p>", "g");
const text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>";

const paragraphs = text.match(paragraph_regex);

if (paragraphs && paragraphs.length) {
    const paragraphs_with_strong_text =  paragraphs.filter(paragraph => {
        return strong_regex.test(paragraph);
    });
    console.log(paragraphs_with_strong_text);
    // prints [ '<p>They don\'t just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>' ]
}

Run Code Online (Sandbox Code Playgroud)

PS 该代码尚未优化，您可以根据应用程序的要求进行更改。

归档时间：	5 年，4 月前
查看次数：	177 次
最近记录：	5 年，4 月前