Ake*_*eem 2 html go html-parsing web-scraping
我正在尝试使用 Go 构建一个网络爬虫,我对这门语言相当陌生,我不确定在使用 html 解析器时我做错了什么。我正在尝试解析 html 以查找锚标记,但我不断收到 html.TokenTypeEnd 。
package main
import (
"fmt"
"golang.org/x/net/html"
"io/ioutil"
"net/http"
)
func GetHtml(url string) (text string, resp *http.Response, err error) {
var bytes []byte
if url == "https://www.coastal.edu/scs/employee" {
resp, err = http.Get(url)
if err != nil {
fmt.Println("There seems to ben an error with the Employee Console.")
}
bytes, err = ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println("Cannot read byte response from Employee Console.")
}
text = string(bytes)
} else {
fmt.Println("Issue with finding URL. Looking for: " + url)
}
return text, resp, err
}
func main() {
htmlSrc, response, err := GetHtml("https://www.coastal.edu/scs/employee")
if err != nil {
fmt.Println("Cannot read HTML source code.")
}
_ = htmlSrc
htmlTokens := html.NewTokenizer(response.Body)
i := 0
for i < 1 {
tt := htmlTokens.Next()
fmt.Printf("%T", tt)
switch tt {
case html.ErrorToken:
fmt.Println("End")
i++
case html.TextToken:
fmt.Println(tt)
case html.StartTagToken:
t := htmlTokens.Token()
isAnchor := t.Data == "a"
if isAnchor {
fmt.Println("We found an anchor!")
}
}
}
Run Code Online (Sandbox Code Playgroud)
每当我打印时,我都会收到 html.TokenTypeEnd
fmt.Printf("%T", tt)
应用程序读取到主体的末尾GetHtml。分词器返回,html.TokenTypeEnd因为在主体上读取返回 EOF。
使用此代码:
htmlTokens := html.NewTokenizer(strings.NewReader(htmlSrc))
Run Code Online (Sandbox Code Playgroud)
创建分词器。
另外,关闭响应主体GetHtml以防止连接泄漏。
代码可以简化为:
response, err := http.Get("https://www.coastal.edu/scs/employee")
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
htmlTokens := html.NewTokenizer(response.Body)
loop:
for {
tt := htmlTokens.Next()
fmt.Printf("%T", tt)
switch tt {
case html.ErrorToken:
fmt.Println("End")
break loop
case html.TextToken:
fmt.Println(tt)
case html.StartTagToken:
t := htmlTokens.Token()
isAnchor := t.Data == "a"
if isAnchor {
fmt.Println("We found an anchor!")
}
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
15076 次 |
| 最近记录: |