从抓取的网页获取页面标题

Question

从抓取的网页获取页面标题

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
http.get(urlOpts, function (response) {
response.on('data', function (chunk) {
var str=chunk.toString();
var re = new RegExp("(<\s*title[^>]*>(.+?)<\s*/\s*title)\>", "g")
console.log(str.match(re));
});

});

Run Code Online (Sandbox Code Playgroud)

产量

user @ dev~ $ node app.js ['node.js'] null null

我只需要获得头衔.

Answer 1

bdu*_*kes 7

我建议使用RegEx.exec而不是String.match.您还可以使用文字语法定义正则表达式,并且只能使用一次:

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
var re = /(<\s*title[^>]*>(.+?)<\s*\/\s*title)>/gi;
http.get(urlOpts, function (response) {
    response.on('data', function (chunk) {
        var str=chunk.toString();
        var match = re.exec(str);
        if (match && match[2]) {
          console.log(match[2]);
        }
    });    
});

Run Code Online (Sandbox Code Playgroud)

该代码还假设title将完全在一个块中,而不是在两个块之间分割.如果title在块之间进行分割,最好保留块的聚合.您可能还想停止寻找title一旦找到它.

归档时间：	13 年，6 月前
查看次数：	3964 次
最近记录：	11 年，11 月前