Spacy 中是否有二元或三元功能?

ven*_*nev 9 nlp tokenize n-gram python-3.x spacy

下面的代码将句子分成单独的标记,输出如下

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)
Run Code Online (Sandbox Code Playgroud)

理想情况下,我想要的是,将“云计算”放在一起阅读,因为它在技术上是一个词。

基本上我正在寻找一个双克。Spacy 中是否有允许 Bi gram 或 Trigram 的任何功能?

Dhr*_*hak 9

Spacy 允许检测名词块。因此,要将您的名词短语解析为单个实体,请执行以下操作:

  1. 检测名词块 https://spacy.io/usage/linguistic-features#noun-chunks

  2. 合并名词块

  3. 再做依赖解析,现在将“云计算”解析为单个实体。

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
... 
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
Run Code Online (Sandbox Code Playgroud)

  • 感谢您的回答,但您提供的解决方案是一种“绕行方式”,而不是通用解决方案。以本文为例`doc = nlp("大数据云计算网络安全机器学习")`。它不是一个连贯的句子,而是一个单词的集合。在这种情况下,我没有得到云计算,我得到了`['大数据云','网络安全机器学习']` (4认同)

Suz*_*ana 8

如果你有一个 spacy doc,你可以把它传递给textacy

ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))
Run Code Online (Sandbox Code Playgroud)


小智 6

警告:这只是 Zuzana 正确答案的延伸。

我的声誉不允许我发表评论,所以我做出这个答案只是为了回答上面 Adit Sanghvi 的问题:“当你有一份文件清单时,你如何做?”

  1. 首先,您需要创建一个包含文档文本的列表

  2. 然后将文本列表加入到一个文档中

  3. 现在您使用 spacy 解析器将文本文档转换为 Spacy 文档

  4. 您可以使用 Zuzana 的答案来创建二元组

这是示例代码:

步骤1

doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love  make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
Run Code Online (Sandbox Code Playgroud)

这将打印此文本:

['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']

然后是第2步和第3步:

doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
Run Code Online (Sandbox Code Playgroud)

并将打印:

all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams

最后第4步(Zuzana的回答)

ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
Run Code Online (Sandbox Code Playgroud)

将打印这个:

[make bigrams, make bigrams, make bigrams]