来自HTML的C++ Screen Scraping

Question

来自HTML的C++ Screen Scraping

我试图从HTML中提取数据"Lady Gaga的名人堂怪物"的使用SUBSTR找到,但我没能检索数据.

<div class="album-name"><strong>Album</strong> > Lady Gaga Fame Monster</div>

Run Code Online (Sandbox Code Playgroud)

我试图先提取整个字符串,但我只能在命令下提取到Albumcout << line_found,因为它的间距会阻止它继续前进.

我试试cout << extract_line.我在提取的html代码中看不到空格.

我想从这个基础教程http://www.cplusplus.com/reference/string/string/substr/,它的工作原理,即使有空间.我正在密切关注但是一旦它到达空间就会停止提取.请帮助真的很感激.谢谢.找出2天没有任何解决方案.

这是源代码:

#include "parser.h"
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>

using namespace std;

int main() {

    string line_found, extract_line, result, finalResult="";
    int firstPosition, secondPosition, input, location;

    ifstream sourceFile ("cd1.htm"); // extracts from sourcefile

    while(!sourceFile.eof())
    {
        sourceFile >> extract_line;
        location = extract_line.find("album-name");
       // cout << extract_line;

       if (location >=0)
       {       
            line_found = extract_line.substr(location);
            cout << line_found << endl;
            firstPosition= line_found.find_first_of(">");

            result = line_found.substr(firstPosition);

       }
    }    
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mar*_*tos 6

该>>运营商不取线.它获取以空格分隔的标记.请改用std::getline(见此处).

更好的是,不要使用字符串搜索工具来解析HTML.这是一场等待发生的灾难.事实上,它正在发生在你身上.请注意,>您的行中有多个实例,因此您可能会找到错误的实例并让自己处于完全混乱状态,试图跳过所有无关紧要的实例(您可以尝试寻找" > ",但如果您遇到这个:...class="album-name" > <strong>...,这是完全有效的HTML.

如果HTML是正确的XHTML,请改用XML解析器.例如,Expat小巧,快速且(相对)易于使用.你可以在这里找到一个不错的简单介绍.

如果HTML很混乱,那么你将很难使用C++.有一个SO相关问题在这里.或者,使用具有良好HTML库的语言,例如Python(Beautiful Soup),您可以从C++调用它.

归档时间：	15 年，10 月前
查看次数：	3234 次
最近记录：	15 年，10 月前