Pro*_*ltk 6 python text nltk sentence spacy
我打算使用spacy和textacy来识别英语中的句子结构。
例如: 猫坐在垫子上-SVO,猫跳了起来,拿起了饼干-SVV0。那只猫吃了饼干和饼干。-SVOO。
该程序应该读取一个段落并以SVO,SVOO,SVVO或其他自定义结构返回每个句子的输出。
到目前为止的努力:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)
Run Code Online (Sandbox Code Playgroud)
输出:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
Run Code Online (Sandbox Code Playgroud)
SVOO SVO SVVO等?编辑1:
我正在概念化的某种方法。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
print "SVO not identified"
elif result == True: # shouldn't do this
print "SVO"
else:
print "Others..."
Run Code Online (Sandbox Code Playgroud)
编辑2:
取得进一步进展
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))
Run Code Online (Sandbox Code Playgroud)
电流输出:
det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conjpunct
预期产量:
SVO SVVO SVOO
Run Code Online (Sandbox Code Playgroud)
想法是将依赖项标签分解为简单的主语-动词和宾语模型。
如果没有其他选择,可以考虑使用正则表达式来实现。但这是我的最后选择。
编辑3:
在研究了此链接后,得到了一些改进。
def testSVOs():
nlp = en_core_web_sm.load()
tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
svos = findSVOs(tok)
print(svos)
Run Code Online (Sandbox Code Playgroud)
电流输出:
[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
Run Code Online (Sandbox Code Playgroud)
预期产量:
我期待句子的符号。尽管我能够提取SVO上如何将其转换为SVO表示法。它更多是模式识别,而不是句子内容本身。
SVO SVO SVOO
Run Code Online (Sandbox Code Playgroud)
问题 1:SVO 被覆盖。为什么?
这是textacy问题。这部分效果不太好,请参阅此博客
问题2:如何识别句子为SVOO SVO SVVO等?
您应该解析依赖树。SpaCy提供了信息,您只需编写一组规则即可使用.head、.left和属性.right将其提取出来。.children
>>for word in text:
print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
The DT det DET cat
cat NN nsubj NOUN sat
sat VBD ROOT VERB sat
on IN prep ADP sat
the DT det DET mat
mat NN pobj NOUN on
. . punct PUNCT sat
of IN ROOT ADP of
the DT det DET lab
art NN compound NOUN lab
lab NN pobj NOUN of
. . punct PUNCT of
The DT det DET cat
cat NN nsubj NOUN jumped
jumped VBD ROOT VERB jumped
and CC cc CCONJ jumped
picked VBD conj VERB jumped
up RP prt PART picked
the DT det DET biscuit
biscuit NN dobj NOUN picked
. . punct PUNCT jumped
The DT det DET cat
cat NN nsubj NOUN ate
ate VBD ROOT VERB ate
biscuit NN dobj NOUN ate
and CC cc CCONJ biscuit
cookies NNS conj NOUN biscuit
. . punct PUNCT ate
Run Code Online (Sandbox Code Playgroud)
我建议您查看此代码,只需添加pobj到列表中OBJECTS,您就会涵盖 SVO 和 SVOO。只要稍微摆弄一下,您也可以获得 SVVO。
| 归档时间: |
|
| 查看次数: |
2581 次 |
| 最近记录: |