如何在文本中搜索（可分离的）短语

Question

如何在文本中搜索（可分离的）短语

我正在寻找一种在文本中搜索短语或惯用语的方法，无论时态或可能的介词/副词如何，例如，如果我正在寻找

取消

我还想找到类似的用法

我的老板通话编会议关闭。

这可能吗（使用 spacy）？如果是这样，我在寻找 NLP 的什么特征或能力？

Answer 1

Dav*_*ale 6

是的，你可以用 spacy 来做：你需要一个依赖解析器来检测单词之间的关系，并需要一个词形还原器来找到这些单词的正常形式。spacy 两者兼而有之。

依赖解析器显示词对之间的句法关系，如下所示：

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('My boss called the meeting off.')
displacy.render(doc, style="dep", jupyter=True)

Run Code Online (Sandbox Code Playgroud)

习语表达往往由这种句法树的紧凑子树表示，以它们之间的特定关系为特征。在不同的句子中，作为习语一部分的词的确切形式和位置可能会有所不同，但它们之间的关系保持不变。

当我们搜索一个表达式时，我们实际上可以遍历文档中的所有单词，寻找一个具有范式“call”的单词，该单词具有一个连接的（“子”）单词，具有依赖关系“prt”和范式“off” ：

def detect_collocations(doc, parent_lemma, dep, child_lemma):
    """ Create a generator of all occurences of collocation in a document.
    The elements of generator are all pairs of tokens with lemmas `parent_lemma` and `child_lemma`
    and dependency of type `dep` between them that are found in a spacy document `doc`.
    """
    for token in doc:
        if token.lemma_ == parent_lemma:
            for child in token.children:
                if child.dep_ == dep and child.lemma_ == child_lemma:
                    yield token, child

result = list(detect_collocations(doc, 'call', 'prt', 'off'))
print(result)
# [(called, off)]

Run Code Online (Sandbox Code Playgroud)

因为上面的函数返回spacy.Token对象对，您可以从中提取元数据，例如它们的位置以在文本中突出显示它们：

positions = {t.idx for pair in result for t in pair}
for token in doc:
    print('_{}_'.format(token) if token.idx in positions else token, end=' ')
# My boss _called_ the meeting _off_ .

Run Code Online (Sandbox Code Playgroud)

这是您可以玩的colab 笔记本。

我认为这是一个很好的答案！除此之外，我想向您介绍 spaCy 3 的一个新功能，即依赖项匹配器：https://nightly.spacy.io/usage/rule-based-matching#dependencymatcher spaCy 3 尚未正式发布，但您可以使用“pip install spacy-nightly”安装候选版本。 (3认同)

归档时间：	5 年，8 月前
查看次数：	206 次
最近记录：	5 年，8 月前