Sea*_*ean 14 python text-parsing nltk
我尝试在给定依赖关系树的Python中找到两个单词之间的依赖路径.
对于判刑
流行文化中的机器人在那里提醒我们无拘无束的人类机构的可怕性.
我使用了practnlptools(https://github.com/biplab-iitb/practNLPTools)来获取依赖项解析结果,如:
nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)
Run Code Online (Sandbox Code Playgroud)
也可视化为(图片来自https://demos.explosion.ai/displacy/)

"机器人"和"是"之间的路径长度为1,"机器人"和"可怕"之间的路径长度为4.
我的问题在上面给出了依赖解析结果,我怎样才能获得两个单词之间的依赖路径或依赖路径长度?
根据我目前的搜索结果,nltk的ParentedTree会有帮助吗?
谢谢!
Hug*_*hot 11
您的问题很容易被视为图形问题,我们必须找到两个节点之间的最短路径.
要在图形中转换依赖关系解析,我们首先必须处理它作为字符串的事实.你想得到这个:
'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
Run Code Online (Sandbox Code Playgroud)
看起来像这样:
[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
Run Code Online (Sandbox Code Playgroud)
通过这种方式,您可以将元组列表从networkx模块提供给图形构造函数,该模块将分析列表并为您构建图形,并为您提供一个简洁的方法,为您提供两个给定节点之间最短路径的长度.
必要的进口
import re
import networkx as nx
from practnlptools.tools import Annotator
Run Code Online (Sandbox Code Playgroud)
如何以所需的元组列表格式获取字符串
annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
m = pattern.search(dep)
edges.append((m.group(1), m.group(2)))
Run Code Online (Sandbox Code Playgroud)
如何构建图表
graph = nx.Graph(edges) # Well that was easy
Run Code Online (Sandbox Code Playgroud)
如何计算最短路径长度
print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
Run Code Online (Sandbox Code Playgroud)
该脚本将揭示给定的依赖解析的最短路径实际上是长度为2的,因为你可以从Robots-1到awesomeness-12通过去remind-8
1. xsubj(remind-8, Robots-1)
2. prep_of(remind-8, awesomeness-12)
Run Code Online (Sandbox Code Playgroud)
如果您不喜欢这个结果,可能需要考虑过滤一些依赖项,在这种情况下不允许将xsubj依赖项添加到图形中.
HugoMailhot的答案很棒.我会为那些希望在两个单词之间找到最短依赖路径的spacy用户编写类似内容(而HugoMailhot的答案依赖于practNLPTools).
这句话:
流行文化中的机器人在那里提醒我们无拘无束的人类机构的可怕性.
具有以下依赖关系树:
以下是查找两个单词之间最短依赖路径的代码:
import networkx as nx
import spacy
nlp = spacy.load('en')
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
print('document: {0}'.format(document))
# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
# FYI https://spacy.io/docs/api/token
for child in token.children:
edges.append(('{0}-{1}'.format(token.lower_,token.i),
'{0}-{1}'.format(child.lower_,child.i)))
graph = nx.Graph(edges)
# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
Run Code Online (Sandbox Code Playgroud)
输出:
4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
Run Code Online (Sandbox Code Playgroud)
要安装spacy和networkx:
sudo pip install networkx
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB
Run Code Online (Sandbox Code Playgroud)
有关spacy依赖性解析的一些基准:https://spacy.io/docs/api/
| 归档时间: |
|
| 查看次数: |
5771 次 |
| 最近记录: |