单词边界的正则表达式，但包括表情符号

Question

单词边界的正则表达式，但包括表情符号

我有一个文本语料库，我正在使用正则表达式对其进行解析以查找最常见的单词。目前我正在使用.match(/(?!'.*')\b\[\w'\]+\b/g). 我的问题是\w与非字母数字字符不匹配，我的表情符号永远不会被解析。具体来说，我正在尝试制作一个正则表达式来识别单词（包括收缩）和表情符号，在单词边界上分开。

作为一个例子，我希望能够采取"Hey there! , let's go to the moon "并得到

Array( "Hey", "there", "", "let's", "go", "to", "the", "moon", "", "")

Run Code Online (Sandbox Code Playgroud)

Answer 1

rev*_*evo 1

要解决这个问题，您可能会发现扩展运算符很有帮助：

var str = "Hey there! , let's go to the moon ";
var words = [], word = '';
[...str].forEach(function(char) {
  // Test if current char is an English letter, a digit or '
    if (/[a-z0-9']/i.test(char)) {
        word += char;
    }
    // Test if current char is a non-whitespce char out of ASCII
    else if (/(?=\S)[^\u0000-\u007f]/.test(char)) {
        words.push(char)
    } else if (word !== '') {
    // Add a word to words array
        words.push(word)
        word = '';
    }
})
console.log(words);

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，2 月前
查看次数：	240 次
最近记录：	8 年，2 月前