如何在golang中用空字符串替换所有html标签

Loi*_*int 5 go

我正在尝试<div> </div>用正则表达式模式替换 golang 中的空字符串( " " )上的所有 html 标签,例如...^[^.\/]*$/g以匹配所有关闭标签。前任 :</div>

我的解决方案:

package main

import (
    "fmt"
    "regexp"
)

const Template = `^[^.\/]*$/g`

func main() {
    r := regexp.MustCompile(Template)
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := r.ReplaceAllString(s, "")
    fmt.Println(res)
}
Run Code Online (Sandbox Code Playgroud)

但输出相同的源字符串。怎么了?请帮忙。谢谢

预期结果应该: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"

Bil*_*nko 8

对于那些来到这里寻找快速解决方案的人,有一个库可以做到这一点:bluemonday

bluemonday提供了一种将 HTML 元素和属性的白名单描述为策略的方法,并将该策略应用于可能包含标记的用户的不可信字符串。所有不在白名单上的元素和属性都将被删除。

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    // Do this once for each unique policy, and use the policy for the life of the program
    // Policy creation/editing is not safe to use in multiple goroutines
    p := bluemonday.StripTagsPolicy()

    // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
    html := p.Sanitize(
        `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
    )

    // Output:
    // Google
    fmt.Println(html)
}
Run Code Online (Sandbox Code Playgroud)

https://play.golang.org/p/jYARzNwPToZ


D. *_*ell 8

正则表达式的问题

这是一个非常简单的 RegEx 替换方法,它从字符串中格式良好的HTML 中删除 HTML 标记。

strip_html_regex.go

package main

import "regexp"

const regex = `<.*?>`

// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
    r := regexp.MustCompile(regex)
    return r.ReplaceAllString(s, "")
}
Run Code Online (Sandbox Code Playgroud)

注意:这不适用于格式错误的HTML。不要用这个

更好的方法

由于 Go 中的字符串可以被视为字节切片,因此可以轻松地遍历字符串并找到不在 HTML 标记中的部分。当我们识别字符串的有效部分时,我们可以简单地取该部分的一部分并使用strings.Builder.

strip_html.go

package main

import (
    "strings"
    "unicode/utf8"
)

const (
    htmlTagStart = 60 // Unicode `<`
    htmlTagEnd   = 62 // Unicode `>`
)

// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(s) + utf8.UTFMax)

    in := false // True if we are inside an HTML tag.
    start := 0  // The index of the previous start tag character `<`
    end := 0    // The index of the previous end tag character `>`

    for i, c := range s {
        // If this is the last character and we are not in an HTML tag, save it.
        if (i+1) == len(s) && end >= start {
            builder.WriteString(s[end:])
        }

        // Keep going if the character is not `<` or `>`
        if c != htmlTagStart && c != htmlTagEnd {
            continue
        }

        if c == htmlTagStart {
            // Only update the start if we are not in a tag.
            // This make sure we strip out `<<br>` not just `<br>`
            if !in {
                start = i
            }
            in = true

            // Write the valid string between the close and start of the two tags.
            builder.WriteString(s[end:start])
            continue
        }
        // else c == htmlTagEnd
        in = false
        end = i + 1
    }
    s = builder.String()
    return s
}
Run Code Online (Sandbox Code Playgroud)

如果我们使用 OP 的文本和一些格式错误的 HTML 运行这两个函数,您将看到结果不一致。

main.go

package main

import "fmt"

func main() {
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := stripHtmlTags(s)
    fmt.Println(res)

    // Malformed HTML examples
    fmt.Println("\n:: stripHTMLTags ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
    
    // Regex Malformed HTML examples
    fmt.Println(":: stripHtmlRegex ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
}
Run Code Online (Sandbox Code Playgroud)

输出:

afsdf4534534!@@!!#345345afsdf4534534!@@!!#

:: stripHTMLTags ::

Do something bold.
I broke this
This is broken link.
start this tag

:: stripHtmlRegex ::

Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.
Run Code Online (Sandbox Code Playgroud)

注意:RegEx 方法不会一致地删除所有 HTML 标记。老实说,我在 RegEx 方面还不够好,无法编写一个 RegEx 匹配字符串来正确处理剥离 HTML。

基准

除了在剥离格式错误的 HTML 标签方面更安全和更积极的优势之外,它stripHtmlTagsstripHtmlRegex.

> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8          51516             22726 ns/op
BenchmarkStripHtmlTags-8          230678              5135 ns/op
Run Code Online (Sandbox Code Playgroud)


sh.*_*seo 7

如果要替换所有 HTML 标签,请使用 html 标签条。

正则表达式匹配 HTML 标签不是一个好主意。

package main

import (
    "fmt"
    "github.com/grokify/html-strip-tags-go"
)

func main() {
    text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    stripped := strip.StripTags(text)

    fmt.Println(text)
    fmt.Println(stripped)
}
Run Code Online (Sandbox Code Playgroud)