在Spacy引理后检测禁用词

Daw*_*zuk 2 python nlp stop-words lemmatization spacy

如何在词干和词形还原后检测单词是否为禁用词spaCy

假设有句

s = "something good\nsomethings 2 bad"
Run Code Online (Sandbox Code Playgroud)

在这种情况下something是一个禁用词.显然(对我来说?)Something并且somethings也是停顿词,但它需要在此之前完成.下面的脚本会说第一个是真的,但后者不是.

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
tokenizer = Tokenizer(nlp.vocab)

s = "something good\nSomething 2 somethings"
tokens = tokenizer(s)

for token in tokens:
  print(token.lemma_, token.is_stop)
Run Code Online (Sandbox Code Playgroud)

返回:

something True
good False
"\n" False
Something False
2 False
somethings False
Run Code Online (Sandbox Code Playgroud)

有没有办法通过spaCyAPI 检测到它?

Ine*_*ani 5

在spaCy中停用单词只是一组字符串,它们在词汇表中设置了一个标志,这是词汇表中与上下文无关的条目(请参阅此处的英语停止列表).该标志只是检查text in STOP_WORDS,这是"某事"返回True的原因is_stop,而"某事"则不然.

但是,您可以做的是检查令牌的引理或小写形式是否是停止列表的一部分,该列表可通过nlp.Defaults.stop_words(即您正在使用的语言的默认值)获得:

def extended_is_stop(token):
    stop_words = nlp.Defaults.stop_words
    return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words
Run Code Online (Sandbox Code Playgroud)

如果您正在使用spaCy v2.0并希望更优雅地解决这个问题,您还可以is_stop通过自定义Token属性扩展来实现自己的功能.您可以为您的属性选择任何名称token._.,例如token._.is_stop:

from spacy.tokens import Token
from spacy.lang.en.stop_words import STOP_WORDS  # import stop words from language data

stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
Token.set_extension('is_stop', getter=stop_words_getter)  # set attribute with getter

nlp = spacy.load('en')
doc = nlp("something Something somethings")
assert doc[0]._.is_stop  # this was a stop word before, and still is
assert doc[1]._.is_stop  # this is now also a stop word, because its lowercase form is
assert doc[2]._.is_stop  # this is now also a stop word, because its lemma is
Run Code Online (Sandbox Code Playgroud)