在Golang中从HTML中提取文本内容

Question

在Golang中从HTML中提取文本内容

use*_*591 2 regex string byte substring go

在Golang中从字符串中提取内部子串的最佳方法是什么？

输入:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

Run Code Online (Sandbox Code Playgroud)

输出:

"this is paragraph \n
 this is paragraph 2"

Run Code Online (Sandbox Code Playgroud)

是否有任何字符串包/库已经做了类似的事情？

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph \n
    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

Run Code Online (Sandbox Code Playgroud)

谢谢

Answer 1

thw*_*hwd 5

不要使用正则表达式来尝试和解释HTML.使用功能齐全的HTML标记生成器和解析器.

我建议你阅读关于CodingHorror的这篇文章.

好的,https://godoc.org/code.google.com/p/go.net/html#example-Parse解决了我的问题.谢谢! (2认同)

归档时间：	12 年，4 月前
查看次数：	4737 次
最近记录：	9 年，11 月前