python中的子句提取/长句分割

Question

python中的子句提取/长句分割

Pau*_*ler 9 python nlp stanford-nlp spacy bert-language-model

我目前正在开展一个涉及句子向量的项目（来自 RoBERTa 预训练模型）。当句子很长时，这些向量的质量较低，并且我的语料库包含许多带有子句的长句子。

我一直在寻找子句提取/长句分割的方法，但令我惊讶的是，没有一个主要的 NLP 软件包（例如 spacy 或 stanza）提供开箱即用的功能。

我想这可以通过使用 spacy 或 stanza 的依赖解析来完成，但正确处理各种复杂的句子和边缘情况可能会非常复杂。

我遇到过使用 spacy实现ClausIE 信息提取系统，它可以执行类似的操作，但它尚未更新并且无法在我的计算机上运行。

我也遇到过这个用于简化句子的存储库，但是当我在本地运行它时，我收到了斯坦福 coreNLP 的注释错误。

有没有我忽略的明显的包/方法？如果没有，是否有一种简单的方法可以使用 stanza 或 spacy 来实现此目的？

Answer 1

pol*_*m23 8

这是适用于您的特定示例的代码。将其扩展到所有情况并不简单，但可以根据需要随着时间的推移进行处理。

import spacy
import deplacy
en = spacy.load('en_core_web_sm')

text = "This all encompassing experience wore off for a moment and in that moment, my awareness came gasping to the surface of the hallucination and I was able to consider momentarily that I had killed myself by taking an outrageous dose of an online drug and this was the most pathetic death experience of all time."

doc = en(text)
#deplacy.render(doc)

seen = set() # keep track of covered words

chunks = []
for sent in doc.sents:
    heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']

    for head in heads:
        words = [ww for ww in head.subtree]
        for word in words:
            seen.add(word)
        chunk = (' '.join([ww.text for ww in words]))
        chunks.append( (head.i, chunk) )

    unseen = [ww for ww in sent if ww not in seen]
    chunk = ' '.join([ww.text for ww in unseen])
    chunks.append( (sent.root.i, chunk) )

chunks = sorted(chunks, key=lambda x: x[0])

for ii, chunk in chunks:
    print(chunk)

Run Code Online (Sandbox Code Playgroud)

部署是可选的，但我发现它对于可视化依赖关系很有用。

另外，我看到您表示惊讶，这不是常见 NLP 库的固有特征。原因很简单 - 大多数应用程序不需要这个，虽然这看起来是一个简单的任务，但实际上随着您尝试涵盖的案例越多，它最终会变得非常复杂并且特定于应用程序。另一方面，对于任何特定的应用程序，就像我给出的示例一样，组合出一个足够好的解决方案相对容易。

归档时间：	4 年，11 月前
查看次数：	6586 次
最近记录：	4 年，11 月前