如何检查网络上的选定文本是否仅包含JavaScript中的单词?

Bac*_*b32 4 html javascript selection

在vanilla Javascript中,我试图确定用户在网页上选择的文本是否都包含单词(不包括符号).

举个例子,

假设我们在网页上的某处有类似下面的文字.

你好,这个例子的文字! (全部选中时)

应该导致 ['Hello', 'a', 'text', 'for', 'the', 'example']

然而,

Hel lo,一个例子的文字! (留下前三个字母)

应该导致['a', 'text', 'for', 'the', 'example']因为Hello没有完全被选为一个词.

到目前为止,我有一个getSelectionText函数,它带来了所有选定的文本.

function getSelectionText() {
    var text = "";
    if (window.getSelection) {
        text = window.getSelection().toString();
    } else if (document.selection && document.selection.type !== "Control") {
        text = document.selection.createRange().text;
    }
    return text;
}

// Just adding the function as listeners.
document.onmouseup = document.onkeyup = function() {
    console.log(getSelectionText());
};
Run Code Online (Sandbox Code Playgroud)

有没有什么好的方法来调整我的功能,让它像我提到的那样工作?

nem*_*035 5

实现你想要的东西的主要障碍是如何告诉你的程序实际上是什么"单词".

一种方法是拥有所有英语单词的完整字典.

const setOfAllEnglishWords = new Set([
  "Hello",
  "a",
  "text",
  "for",
  "the",
  "example"
  // ... many many more
]);

const selection = "lo, a text for the example!";
const result = selection
  .replace(/[^A-Za-z0-9\s]/g, "") // remove punctuation by replacing anything that is not a letter or a digit with the empty string
  .split(/\s+/)                   // split text into words by using 1 or more whitespace as the break point
  .filter(word => setOfAllEnglishWords.has(word));

console.log(result);
Run Code Online (Sandbox Code Playgroud)

这可能需要大量内存.基于快速谷歌搜索,牛津英语词典有大约218632单词.平均字长是4.5字母和JS存储2每个字符的字节,给我们218632 * (4.5 * 2) = 1967688 B = 1.967 MB,在慢速3G连接上下载可能需要1分钟.

更好的方法可能是通过收集页面上的所有唯一单词,在每次页面加载时自己构建单词词典.

function getSetOfWordsOnPage() {
  const walk = document.createTreeWalker(
    document.body,
    NodeFilter.SHOW_TEXT
  );

  const dict = new Set();
  let n;
  while ((n = walk.nextNode())) {
    for (const word of n.textContent
      .replace(/[^A-Za-z0-9\s]/g, "")
      .split(/\s+/)
      .map(word => word.trim())
      .filter(word => !!word)) {
      dict.add(word);
    }
  }
  return dict;
}

const setOfWordsOnThePage = getSetOfWordsOnPage();

function getSelectionText() {
  if (window.getSelection) {
    return window.getSelection().toString();
  } else if (document.selection && document.selection.type !== "Control") {
    return document.selection.createRange().text;
  }
  return "";
}

// Just adding the function as listeners.
document.querySelector("#button").addEventListener("click", () => {
  const result = getSelectionText()
    .replace(/[^A-Za-z0-9\s]/g, "") // remove punctuation
    .split(/\s+/) // split text into words
    .filter(word => setOfWordsOnThePage.has(word));
  console.log(result);
});
Run Code Online (Sandbox Code Playgroud)
<button id="button">Show result</button>
<p>this is some text</p>
<p>again this is a text!!!!!</p>
<p>another,example,of,a,sentence</p>
Run Code Online (Sandbox Code Playgroud)


也许我们可以更进一步.我们甚至需要记住这些词吗?似乎"一个词是由空格包围的文字"的定义就足够了.

此外,正如OP在下面的评论中所提到的,如果所选部分也是有效单词,我们也有上述解决方案的错误匹配部分选择的单词.

为了减少记住页面上单词的不必要开销以及解决部分选择有效单词错误,我们可以检查最左侧(锚点)和最右侧(焦点)节点的内容.选择后的所选区域,如果它们包含其他未选择的文本,则忽略它们.

我们在这里做的假设是,对于任意选择的文本,我们最多可以有2个部分选择的单词,每个选择结束一个.

注:该方法也波纹管通过假定处理资本THIS,tHiSthis都是同一个词.

function removePunctuation(string) {
  return string.replace(/[^A-Za-z0-9\s]/g, " ");
}

function splitIntoWords(string) {
  return removePunctuation(string)
    .split(/\s+/)
    .map(word => word.toLowerCase().trim())
    .filter(word => !!word);
}

function getSelectedWords() {
  const selection = window.getSelection();
  const words = splitIntoWords(selection.toString());

  if (selection.anchorNode) {
    const startingsWords = splitIntoWords(selection.anchorNode.textContent);
    if (words[0] !== startingsWords[0]) {
      words.shift(); // remove the start since it's not a whole word
    }
  }

  if (selection.focusNode) {
    const endingWords = splitIntoWords(selection.focusNode.textContent);
    if (words[words.length - 1] !== endingWords[endingWords.length - 1]) {
      words.pop(); // remove the end since it's not a whole word
    }
  }

  return words;
}

// Just adding the function as listeners.
document.querySelector("#button").addEventListener("click", () => {
  console.log(getSelectedWords());
});
Run Code Online (Sandbox Code Playgroud)
<button id="button">Show result</button>
<p><div>this is</div> <div>some text</div></p>
<p><span>again</span><span> </span><span>this</span><span> </span><span>is</span><span> </span><span>a</span> <span>text</span><span>!!!!!</span></p>
<p>another,example,of,a,sentence</p>
Run Code Online (Sandbox Code Playgroud)

注意:如果您将单词分解为多个这样的html元素,此代码仍然会中断<span>w</span><span>o</span><span>r</span><span>d</span>.这个场景打破了我们对单词的定义并解决它你需要包含某种字典以测试单词有效性,基本上结合上面的最后两个解决方案.