ven*_*nev 9 nlp tokenize n-gram python-3.x spacy
下面的代码将句子分成单独的标记,输出如下
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
print(token.text)
Run Code Online (Sandbox Code Playgroud)
理想情况下,我想要的是,将“云计算”放在一起阅读,因为它在技术上是一个词。
基本上我正在寻找一个双克。Spacy 中是否有允许 Bi gram 或 Trigram 的任何功能?
Spacy 允许检测名词块。因此,要将您的名词短语解析为单个实体,请执行以下操作:
检测名词块 https://spacy.io/usage/linguistic-features#noun-chunks
合并名词块
再做依赖解析,现在将“云计算”解析为单个实体。
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
... noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
...
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
Run Code Online (Sandbox Code Playgroud)
如果你有一个 spacy doc,你可以把它传递给textacy:
ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))
Run Code Online (Sandbox Code Playgroud)
小智 6
警告:这只是 Zuzana 正确答案的延伸。
我的声誉不允许我发表评论,所以我做出这个答案只是为了回答上面 Adit Sanghvi 的问题:“当你有一份文件清单时,你如何做?”
首先,您需要创建一个包含文档文本的列表
然后将文本列表加入到一个文档中
现在您使用 spacy 解析器将文本文档转换为 Spacy 文档
您可以使用 Zuzana 的答案来创建二元组
这是示例代码:
步骤1
doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
Run Code Online (Sandbox Code Playgroud)
这将打印此文本:
['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']
然后是第2步和第3步:
doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
Run Code Online (Sandbox Code Playgroud)
并将打印:
all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams
最后第4步(Zuzana的回答)
ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
Run Code Online (Sandbox Code Playgroud)
将打印这个:
[make bigrams, make bigrams, make bigrams]
| 归档时间: |
|
| 查看次数: |
10950 次 |
| 最近记录: |