在Golang中从HTML中提取文本内容

use*_*591 2 regex string byte substring go

在Golang中从字符串中提取内部子串的最佳方法是什么?

输入:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"
Run Code Online (Sandbox Code Playgroud)

输出:

"this is paragraph \n
 this is paragraph 2"
Run Code Online (Sandbox Code Playgroud)

是否有任何字符串包/库已经做了类似的事情?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph \n
    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}
Run Code Online (Sandbox Code Playgroud)

谢谢

thw*_*hwd 5

不要使用正则表达式来尝试和解释HTML.使用功能齐全的HTML标记生成器和解析器.

我建议你阅读关于CodingHorror的这篇文章.

  • 好的,https://godoc.org/code.google.com/p/go.net/html#example-Parse解决了我的问题.谢谢! (2认同)