C++获取HTML源代码

Question

C++获取HTML源代码

我想知道如何在不使用LibCurl的情况下将网站的HTML源代码下载到字符串中.我在网上搜索了使用Wininet的例子.

下面是我用于Wininet的示例代码.我如何使用Winsock做同样的事情？

    #include "stdafx.h"
#include <windows.h>
#include <wininet.h>
#include <iostream>
#include <string>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

#pragma comment ( lib, "Wininet.lib" )

int main()
{
    HINTERNET hInternet = InternetOpenA("InetURL/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);

    HINTERNET hConnection = InternetConnectA(hInternet, "google.com", 80, " ", " ", INTERNET_SERVICE_HTTP, 0, 0);

    HINTERNET hData = HttpOpenRequestA(hConnection, "GET", "/", NULL, NULL, NULL, INTERNET_FLAG_KEEP_CONNECTION, 0);

    char buf[2048];
    string lol;
    HttpSendRequestA(hData, NULL, 0, NULL, 0);

    DWORD bytesRead = 0;
    DWORD totalBytesRead = 0;
    // http://msdn.microsoft.com/en-us/library/aa385103(VS.85).aspx
    // To ensure all data is retrieved, an application must continue to call the
    // InternetReadFile function until the function returns TRUE and the
    // lpdwNumberOfBytesRead parameter equals zero. 
    while (InternetReadFile(hData, buf, 2000, &bytesRead) && bytesRead != 0)
    {
        buf[bytesRead] = 0; // insert the null terminator.

        puts(buf);          // print it to the screen.
        lol = lol + buf;

        printf("%d bytes read\n", bytesRead);

        totalBytesRead += bytesRead;
    }

    printf("\n\n END -- %d bytes read\n", bytesRead);
    printf("\n\n END -- %d TOTAL bytes read\n", totalBytesRead);

    InternetCloseHandle(hData);
    InternetCloseHandle(hConnection);
    InternetCloseHandle(hInternet);

    cout << "\nThe beginning." << endl << endl << endl;

    cout << lol << endl;


    system("PAUSE");
}

Run Code Online (Sandbox Code Playgroud)

此WinSock示例适用于没有其他路径的站点.我如何获取这样的页面的HTML:(www.website.com/page)

    #include "stdafx.h"
#include <iostream>
#include <winsock2.h>
#include <string>
#include <fstream>
using namespace std;


string get_source()
{
    WSADATA WSAData;
    WSAStartup(MAKEWORD(2, 0), &WSAData);

    SOCKET sock;
    SOCKADDR_IN sin;

    char buffer[1024];

    ////////////////This is portion that is confusing me//////////////////////////////////////////////////
    string srequete = "GET /id/AeroNX/ HTTP/1.1\r\n";
    srequete += "Host: steamcommunity.com\r\n";
    srequete += "Connection: close\r\n";
    srequete += "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n";
    srequete += "Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3\r\n";
    srequete += "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n";
    srequete += "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3\r\n";
    srequete += "Referer: http://pozzyx.net/\r\n";
    srequete += "\r\n";
    ///////////////////////////////////////////////////////////////////////////////////////////////////////

    size_t requete_taille = srequete.size() + 1;

    char crequete[5000];
    strncpy(crequete, srequete.c_str(), requete_taille);

    int i = 0;
    string source = "";

    sock = socket(AF_INET, SOCK_STREAM, 0);

    sin.sin_addr.s_addr = inet_addr("63.228.223.103"); // epguides.com //why wont it work for 72.233.89.200 (whatismyip.com)
    sin.sin_family = AF_INET;
    sin.sin_port = htons(80); // port HTTP.

    connect(sock, (SOCKADDR *)&sin, sizeof(sin)); // on se connecte sur le site web.
    send(sock, crequete, strlen(crequete), 0); // why do we send the string??


    do
    {
        i = recv(sock, buffer, sizeof(buffer), 0); // le buffer récupère les données reçues.
        source += buffer;
    } while (i != 0);


    closesocket(sock); // on ferme le socket.
    WSACleanup();

    return source;
}

void main()
{
    ofstream fout;
    fout.open("Buffer.txt");
    fout << get_source(); // the string url doesnt matter
    fout.close();
    system("PAUSE");
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dav*_*hun 9

好的,我看到你只需要一点点HTTP的帮助,而不是整个事情的细分.不过,在我给你简短的答案之后,我将为未来的读者留下我的完整描述.

简短回答:

在第一行中,您说GET /foo/bar.html HTTP/1.1,中间部分(/foo/bar.html)是资源的路径.所以,例如,如果你想获得,http://www.myserver.com/foo/bar.html那么你放在/foo/bar.html那里.如果你想获得,http://www.myserver.com/get/my/file.html那么请求的第一行就是GET /get/my/file.html HTTP/1.1.您的请求的其余行不需要更改以获取不同的资源(尽管Host:如果您从完全不同的服务器获取某些内容,则需要更改,例如,Host: www.myserver.com).

HTTP的完整描述:

您是否尝试在不使用任何库的情况下获取它,只是原始套接字？如果是这样,你将不得不实现HTTP协议(无论如何都是客户端),但好消息是HTTP 非常容易学习并且几乎同样易于实现.:)

要发送页面请求,请打开Web服务器上端口80的连接.然后发送它:

GET <resource> HTTP/1.1\r\n
Host: <web_server_name>\r\n
Connection: close\r\n
\r\n

Run Code Online (Sandbox Code Playgroud)

请注意,我已经明确地将\r\n谎言告诉你.关于它们有两个重要的事情:1)你必须使用\r\n而不仅仅是\n在协议中,2)HTTP头的末尾必须有一个double \r\n\r\n.(对于您的请求,没有数据部分,因此标题的末尾也是整个请求消息的结尾.)

替换<resource>为您要获取的文件的路径,以及<web_server_name>Web服务器的DNS名称.例如,如果你想找回http://www.cc.gatech.edu/~davel/classes/cs3251/summer2011/test/hypertext.html那么<web_server_name>(Host字段)是www.cc.gatech.edu和<resource>是/~davel/classes/cs3251/summer2011/test/hypertext.html.

Web服务器将在同一套接字上发回HTTP响应消息.如果一切顺利,您将收到一条消息,看起来像这样:

HTTP/1.1 200 OK\r\n
Date: Mon, 23 May 2005 22:38:34 GMT\r\n
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)\r\n
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT\r\n
ETag: "3f80f-1b6-3e1cb03b"\r\n
Content-Type: text/html; charset=UTF-8\r\n
Content-Length: 131\r\n
Connection: close\r\n
\r\n
<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

\r\n\r\n再次注意double ,表示HTTP头的结尾.之后是数据部分,其中包含页面的HTML源代码.我没有明确地显示数据部分的换行符,因为它们是数据本身的一部分,而不是HTTP协议(所以它们不一定是这样\r\n).另请注意Content-Length字段.它告诉您数据部分的长度是多少字节(在这种情况下是HTML源),因此您可以从套接字读取正确的长度.\r\n数据部分的末尾没有.(数据本身可能包括也可能不包括末尾的换行符.如果是,则它将包含在Content-Length字节中.)

唯一有点困难的部分是接收和解析HTTP消息.我发现接收HTTP的最简单方法是从套接字一次读取一行,解析每个头字段(你不必处理每个字段;你可以忽略其中的许多字段).获得空行后,您就知道标题已完成.然后从Content-Length指定的数据负载的套接字中读取正确的字节数.(在通过验证1)你200 OK在响应的第一行得到错误检查之前错误检查可能是一个好主意- 其他东西表示某种错误,2)你实际上在某处有Content-Length字段在标题中.)

此外,Connection: close请求中的字段(在响应中回显)表示服务器在向您发送响应后可以关闭TCP连接.如果你想提出很多请求,你可能会使用Connection: keep-alive它,但它会变得有点复杂,因为你必须注意响应中的Connection字段.从技术上讲,Connection: close即使您请求保持活动状态,也允许服务器发回和关闭套接字.因此,只需要Connection: close生成更简单的代码,如果您只想要一个页面,那么它就足够了.

HTTP的维基百科页面有一些帮助,但缺乏细节.(尽管我从那里无耻地撕掉我的HTTP响应示例.) https://en.wikipedia.org/wiki/Http

如果某人有一个更好的在线HTTP参考链接(比阅读标准文档更容易理解),请随时添加/编辑此帖子,或将其置于评论中.

归档时间：	12 年，6 月前
查看次数：	6529 次
最近记录：	11 年，4 月前