我正在尝试<div> </div>用正则表达式模式替换 golang 中的空字符串( " " )上的所有 html 标签,例如...^[^.\/]*$/g以匹配所有关闭标签。前任 :</div>
我的解决方案:
package main
import (
"fmt"
"regexp"
)
const Template = `^[^.\/]*$/g`
func main() {
r := regexp.MustCompile(Template)
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := r.ReplaceAllString(s, "")
fmt.Println(res)
}
Run Code Online (Sandbox Code Playgroud)
但输出相同的源字符串。怎么了?请帮忙。谢谢
预期结果应该: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"
对于那些来到这里寻找快速解决方案的人,有一个库可以做到这一点:bluemonday。
包bluemonday提供了一种将 HTML 元素和属性的白名单描述为策略的方法,并将该策略应用于可能包含标记的用户的不可信字符串。所有不在白名单上的元素和属性都将被删除。
package main
import (
"fmt"
"github.com/microcosm-cc/bluemonday"
)
func main() {
// Do this once for each unique policy, and use the policy for the life of the program
// Policy creation/editing is not safe to use in multiple goroutines
p := bluemonday.StripTagsPolicy()
// The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
html := p.Sanitize(
`<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
)
// Output:
// Google
fmt.Println(html)
}
Run Code Online (Sandbox Code Playgroud)
https://play.golang.org/p/jYARzNwPToZ
这是一个非常简单的 RegEx 替换方法,它从字符串中格式良好的HTML 中删除 HTML 标记。
strip_html_regex.go
package main
import "regexp"
const regex = `<.*?>`
// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
r := regexp.MustCompile(regex)
return r.ReplaceAllString(s, "")
}
Run Code Online (Sandbox Code Playgroud)
注意:这不适用于格式错误的HTML。不要用这个。
由于 Go 中的字符串可以被视为字节切片,因此可以轻松地遍历字符串并找到不在 HTML 标记中的部分。当我们识别字符串的有效部分时,我们可以简单地取该部分的一部分并使用strings.Builder.
strip_html.go
package main
import (
"strings"
"unicode/utf8"
)
const (
htmlTagStart = 60 // Unicode `<`
htmlTagEnd = 62 // Unicode `>`
)
// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(s) + utf8.UTFMax)
in := false // True if we are inside an HTML tag.
start := 0 // The index of the previous start tag character `<`
end := 0 // The index of the previous end tag character `>`
for i, c := range s {
// If this is the last character and we are not in an HTML tag, save it.
if (i+1) == len(s) && end >= start {
builder.WriteString(s[end:])
}
// Keep going if the character is not `<` or `>`
if c != htmlTagStart && c != htmlTagEnd {
continue
}
if c == htmlTagStart {
// Only update the start if we are not in a tag.
// This make sure we strip out `<<br>` not just `<br>`
if !in {
start = i
}
in = true
// Write the valid string between the close and start of the two tags.
builder.WriteString(s[end:start])
continue
}
// else c == htmlTagEnd
in = false
end = i + 1
}
s = builder.String()
return s
}
Run Code Online (Sandbox Code Playgroud)
如果我们使用 OP 的文本和一些格式错误的 HTML 运行这两个函数,您将看到结果不一致。
main.go
package main
import "fmt"
func main() {
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := stripHtmlTags(s)
fmt.Println(res)
// Malformed HTML examples
fmt.Println("\n:: stripHTMLTags ::\n")
fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
// Regex Malformed HTML examples
fmt.Println(":: stripHtmlRegex ::\n")
fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
}
Run Code Online (Sandbox Code Playgroud)
输出:
afsdf4534534!@@!!#345345afsdf4534534!@@!!#
:: stripHTMLTags ::
Do something bold.
I broke this
This is broken link.
start this tag
:: stripHtmlRegex ::
Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.
Run Code Online (Sandbox Code Playgroud)
注意:RegEx 方法不会一致地删除所有 HTML 标记。老实说,我在 RegEx 方面还不够好,无法编写一个 RegEx 匹配字符串来正确处理剥离 HTML。
除了在剥离格式错误的 HTML 标签方面更安全和更积极的优势之外,它stripHtmlTags比stripHtmlRegex.
> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8 51516 22726 ns/op
BenchmarkStripHtmlTags-8 230678 5135 ns/op
Run Code Online (Sandbox Code Playgroud)
如果要替换所有 HTML 标签,请使用 html 标签条。
正则表达式匹配 HTML 标签不是一个好主意。
package main
import (
"fmt"
"github.com/grokify/html-strip-tags-go"
)
func main() {
text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
stripped := strip.StripTags(text)
fmt.Println(text)
fmt.Println(stripped)
}
Run Code Online (Sandbox Code Playgroud)