Golang正则表达式与非拉丁字符

Question

Golang正则表达式与非拉丁字符

我需要熟练的地鼠的一些建议.

我正在解析一些句子中的单词,我的正则\w+表达式可以正常使用拉丁字符.然而,它完全失败了一些西里尔字符.

这是一个示例应用程序:

package main

import (
    "fmt"
    "regexp"
)

func get_words_from(text string) []string {
    words := regexp.MustCompile("\\w+")
    return words.FindAllString(text, -1)
}

func main() {
    text := "One, two three!"
    text2 := "???, ??? ???!"
    text3 := "Jedna, dva t?i ?ty?i p?t!"
    fmt.Println(get_words_from(text))
    fmt.Println(get_words_from(text2))
    fmt.Println(get_words_from(text3))
}

Run Code Online (Sandbox Code Playgroud)

它产生以下结果:

 [One two three]
 []
 [Jedna dva t i ty i p t]

Run Code Online (Sandbox Code Playgroud)

它返回俄语的空值和捷克语的额外音节.我不知道如何解决这个问题.有人可以给我一些建议吗？

或者也许有更好的方法将句子分成没有标点符号的单词？

Answer 1

Wik*_*żew 16

该\w速记类只匹配ASCII字母GO正则表达式,因此,你需要一个Unicode字符类\p{L}.

\w 单词字符(== [0-9A-Za-z_])

使用字符类来包含数字和下划线:

    regexp.MustCompile("[\\p{L}\\d_]+")

Run Code Online (Sandbox Code Playgroud)

演示输出:

[One two three]
[??? ??? ???]
[Jedna dva t?i ?ty?i p?t]

Run Code Online (Sandbox Code Playgroud)

奖金 - 如果你使用反引号,你不必双重逃避:````regexp.MustCompile(`[\ p {L}\d _] +`)``` (7认同)
是的，当我不知道 Go 中的原始字符串文字时，我发布了这篇文章（https://golang.org/ref/spec#String_literals）。 (2认同)

归档时间：	10 年，8 月前
查看次数：	5128 次
最近记录：	10 年，8 月前