使用python NLTK在interegative语句中转换语句

Gio*_*elm 1 python string text nlp nltk

关于过去发生的事件,我有成千上万的句子.例如

sentence1 = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
sentence2 = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
sentence3 = 'The Hindu Medang kingdom flourishes and declines.'
Run Code Online (Sandbox Code Playgroud)

我想将它们转换成表格的问题:

question1 = 'When were the Knights Templar founded to protect Christian pilgrims in Jerusalem?'
question2 = 'When did Alfonso VI of Castile capture the Moorish Muslim city of Toledo, Spain?'
question3 = 'When did the Hindu Medang kingdom flourish and decline?'
Run Code Online (Sandbox Code Playgroud)

我意识到这是一个复杂的问题,我的成功率为80%.

据我所知,从网上搜索NTLK是解决这类问题的方法.我开始尝试一些东西,但这是我第一次使用这个库,我不能比这更进一步:

import nltk
question = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
tokens = nltk.word_tokenize(question)
tagged = nltk.pos_tag(tokens)
Run Code Online (Sandbox Code Playgroud)

这听起来像许多人必须遇到并解决的问题.有什么建议?

Igo*_*gor 7

NLTK绝对是这里使用的合适工具.但是,tokenizer和pos-tagger输出的质量取决于你的语料库和句子类型.此外,通常没有真正的开箱即用的解决方案(afaik),它需要一些调整.如果你没有太多时间投入其中,我怀疑你的成功率甚至会达到80%.

话说回来; 这是一个基于列表的基本列表示例,可以帮助您捕获并成功转换您的一些句子.

import nltk

question_one = 'The Knights Templar are founded to protect Christian     pilgrims in Jerusalem.'
question_two = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'

def modify(inputStr):

    tokens = nltk.PunktWordTokenizer().tokenize(inputStr)
    tagged = nltk.pos_tag(tokens)
    auxiliary_verbs = [i for i, w in enumerate(tagged) if w[1] == 'VBP']
    if auxiliary_verbs:
        tagged.insert(0, tagged.pop(auxiliary_verbs[0]))
    else:
        tagged.insert(0, ('did', 'VBD'))
    tagged.insert(0, ('When', 'WRB'))

    return ' '.join([t[0] for t in tagged])

question_one = modify(question_one)
question_two = modify(question_two)

print(question_one)
print(question_two)
Run Code Online (Sandbox Code Playgroud)

输出:

When are The Knights Templar founded to protect Christian pilgrims in Jerusalem.
When did Alfonso VI of Castile captures the Moorish Muslim city of Toledo , Spain.
Run Code Online (Sandbox Code Playgroud)

正如你所看到的,你仍然需要修正正确的套管('''仍然是大写的),'捕获'现在处于错误的时态并且你想要扩展在auxiliary_verbs类型上(可能只是'VBP'太有限了).但这是一个开始.希望这可以帮助!