use*_*852 3 html-sanitizing node.js
如何判断"sanitize-html"实际删除html标签(仅保留其中的内容)?目前,如果我将它设置为保留div部分,在输出中它也写入<div>some content</div>- 我只想要内部...('一些内容')
简而言之 - 我不想要标签,属性等 - 只有那些元素的内容..
var Crawler = require("js-crawler");
var download = require("url-download");
var sanitizeHtml = require('sanitize-html');
var util = require('util');
var fs = require('fs');
new Crawler().configure({depth: 1})
.crawl("http://www.cnn.com", function onSuccess(page) {
var clean = sanitizeHtml(page.body,{
allowedTags: [ 'p', 'em', 'strong','div' ],
});
console.log(clean);
fs.writeFile('sanitized.txt', clean, function (err) {
if (err) throw err;
console.log('It\'s saved! in same location.');
});
console.log(util.inspect(clean, {showHidden: false, depth: null}));
var str = JSON.stringify(clean.toString());
console.log(str);
/*download(page.url, './download')
.on('close', function () {
console.log('One file has been downloaded.');
});*/
});
Run Code Online (Sandbox Code Playgroud)
Tom*_*ell 12
我是sanitize-html的作者.
You can set allowedTags to an empty array. sanitize-html does not discard the contents of a disallowed tag, only the tag itself (with the exception of a few tags like "script" and "style" for which this would not make sense). Otherwise it wouldn't be much use for its original intended purpose, which is cleaning up markup copied and pasted from word processors and the like into a rich text editor.
However, if you have markup like:
<div>One</div><div>Two</div>
Run Code Online (Sandbox Code Playgroud)
That will come out as:
OneTwo
To work around that, you can use the textFilter option to ensure the text of a tag is always followed by at least one space:
textFilter: function(text) {
return text + ' ';
}
Run Code Online (Sandbox Code Playgroud)
However, this will also introduce extra spaces in sentences that contain inline tags like "strong" and "em".
So the more I think about it, the best answer for you is probably a completely different npm module:
https://www.npmjs.com/package/html-to-text
It's widely used and much better suited than your use case. sanitize-html is really meant for situations where you want the tags... just not the wrong tags.