使用 Go 解析 HTML

Ake*_*eem 2 html go html-parsing web-scraping

我正在尝试使用 Go 构建一个网络爬虫,我对这门语言相当陌生,我不确定在使用 html 解析器时我做错了什么。我正在尝试解析 html 以查找锚标记,但我不断收到 html.TokenTypeEnd 。

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "io/ioutil"
    "net/http"
)

func GetHtml(url string) (text string, resp *http.Response, err error) {
    var bytes []byte
    if url == "https://www.coastal.edu/scs/employee" {
        resp, err = http.Get(url)
        if err != nil {
            fmt.Println("There seems to ben an error with the Employee Console.")
        }
        bytes, err = ioutil.ReadAll(resp.Body)
        if err != nil {
            fmt.Println("Cannot read byte response from Employee Console.")
        }
        text = string(bytes)
    } else {
        fmt.Println("Issue with finding URL. Looking for: " + url)
    }

    return text, resp, err
}

func main() {
    htmlSrc, response, err := GetHtml("https://www.coastal.edu/scs/employee")
    if err != nil {
        fmt.Println("Cannot read HTML source code.")
    }
    _ = htmlSrc
    htmlTokens := html.NewTokenizer(response.Body)
    i := 0
    for i < 1 {

        tt := htmlTokens.Next()
        fmt.Printf("%T", tt)
        switch tt {

        case html.ErrorToken:
            fmt.Println("End")
            i++

        case html.TextToken:
            fmt.Println(tt)

        case html.StartTagToken:
            t := htmlTokens.Token()

            isAnchor := t.Data == "a"
            if isAnchor {
                fmt.Println("We found an anchor!")
            }

        }

    }
Run Code Online (Sandbox Code Playgroud)

每当我打印时,我都会收到 html.TokenTypeEnd fmt.Printf("%T", tt)

Cer*_*món 8

应用程序读取到主体的末尾GetHtml。分词器返回,html.TokenTypeEnd因为在主体上读取返回 EOF。

使用此代码:

htmlTokens := html.NewTokenizer(strings.NewReader(htmlSrc))
Run Code Online (Sandbox Code Playgroud)

创建分词器。

另外,关闭响应主体GetHtml以防止连接泄漏。

代码可以简化为:

    response, err := http.Get("https://www.coastal.edu/scs/employee")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    htmlTokens := html.NewTokenizer(response.Body)
loop:
    for {
        tt := htmlTokens.Next()
        fmt.Printf("%T", tt)
        switch tt {
        case html.ErrorToken:
            fmt.Println("End")
            break loop
        case html.TextToken:
            fmt.Println(tt)
        case html.StartTagToken:
            t := htmlTokens.Token()
            isAnchor := t.Data == "a"
            if isAnchor {
                fmt.Println("We found an anchor!")
            }
        }
    }
Run Code Online (Sandbox Code Playgroud)