Gio*_*elm 1 python string text nlp nltk
关于过去发生的事件,我有成千上万的句子.例如
sentence1 = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
sentence2 = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
sentence3 = 'The Hindu Medang kingdom flourishes and declines.'
Run Code Online (Sandbox Code Playgroud)
我想将它们转换成表格的问题:
question1 = 'When were the Knights Templar founded to protect Christian pilgrims in Jerusalem?'
question2 = 'When did Alfonso VI of Castile capture the Moorish Muslim city of Toledo, Spain?'
question3 = 'When did the Hindu Medang kingdom flourish and decline?'
Run Code Online (Sandbox Code Playgroud)
我意识到这是一个复杂的问题,我的成功率为80%.
据我所知,从网上搜索NTLK是解决这类问题的方法.我开始尝试一些东西,但这是我第一次使用这个库,我不能比这更进一步:
import nltk
question = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
tokens = nltk.word_tokenize(question)
tagged = nltk.pos_tag(tokens)
Run Code Online (Sandbox Code Playgroud)
这听起来像许多人必须遇到并解决的问题.有什么建议?
NLTK绝对是这里使用的合适工具.但是,tokenizer和pos-tagger输出的质量取决于你的语料库和句子类型.此外,通常没有真正的开箱即用的解决方案(afaik),它需要一些调整.如果你没有太多时间投入其中,我怀疑你的成功率甚至会达到80%.
话说回来; 这是一个基于列表的基本列表示例,可以帮助您捕获并成功转换您的一些句子.
import nltk
question_one = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
question_two = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
def modify(inputStr):
tokens = nltk.PunktWordTokenizer().tokenize(inputStr)
tagged = nltk.pos_tag(tokens)
auxiliary_verbs = [i for i, w in enumerate(tagged) if w[1] == 'VBP']
if auxiliary_verbs:
tagged.insert(0, tagged.pop(auxiliary_verbs[0]))
else:
tagged.insert(0, ('did', 'VBD'))
tagged.insert(0, ('When', 'WRB'))
return ' '.join([t[0] for t in tagged])
question_one = modify(question_one)
question_two = modify(question_two)
print(question_one)
print(question_two)
Run Code Online (Sandbox Code Playgroud)
输出:
When are The Knights Templar founded to protect Christian pilgrims in Jerusalem.
When did Alfonso VI of Castile captures the Moorish Muslim city of Toledo , Spain.
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,你仍然需要修正正确的套管('''仍然是大写的),'捕获'现在处于错误的时态并且你想要扩展在auxiliary_verbs类型上(可能只是'VBP'太有限了).但这是一个开始.希望这可以帮助!
| 归档时间: |
|
| 查看次数: |
1117 次 |
| 最近记录: |