NLP - Python中的信息提取(spaCy)

kat*_*ehl 9 python nlp information-extraction spacy

我试图从以下段落结构中提取此类信息:

 women_ran men_ran kids_ran walked
         1       2        1      3
         2       4        3      1
         3       6        5      2

text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]
Run Code Online (Sandbox Code Playgroud)

我使用Python spaCy作为我的NLP库.我是NLP工作的新手,我希望得到一些指导,说明从这些句子中提取这些表格信息的最佳方法.

如果只是确定是否有个人跑步或走路,我只会使用sklearn一个分类模型,但我需要提取的信息显然比那些更细微(我试图检索子类别和值)每).任何指导将不胜感激.

syl*_*sm_ 12

您将需要使用依赖关系解析.您可以使用displaCy可视化工具查看示例句子的可视化.

您可以通过几种不同的方式实现所需的规则 - 就像总是有多种方法来编写XPath查询,DOM选择器等一样.

这样的事情应该有效:

nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
    for j, sent in enumerate(doc.sents):
        subjects = [w for w in sent if w.dep_ == 'nsubj']
        for subject in subjects:
            numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
            if len(numbers) == 1:
                print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))
Run Code Online (Sandbox Code Playgroud)

对于你的例子text你应该得到:

document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One
Run Code Online (Sandbox Code Playgroud)