NLTK标签荷兰语句子

Ste*_*reo 2 python nltk

我从NLTK开始,想要标记荷兰语句子,但是我在指定语料库时遇到了麻烦。

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino

pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')
Run Code Online (Sandbox Code Playgroud)

给,

[('Python', 'UNK'),
 ('is', 'UNK'),
 ('een', 'UNK'),
 ('goede', 'UNK'),
 ('data', 'UNK'),
 ('science', 'UNK'),
 ('taal', 'UNK'),
 ('.', 'UNK')]
Run Code Online (Sandbox Code Playgroud)

很明显,我没有正确指定语料库。我下载了白化语语料库。谁能帮助我找出如何正确指定语料库?

alv*_*vas 6

默认语言nltk.pos_tag是英语文字训练,您必须在alpino语料库上训练一个新的标记器才能滚动自己的荷兰标记器。

但是请注意,该模型将达到以下效果:

  • 训练什么数据
  • 用哪种算法训练

来自UnigramTaggerBigramTagger示例:

>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]
Run Code Online (Sandbox Code Playgroud)

PerceptronTagger

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents()) 
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')
Run Code Online (Sandbox Code Playgroud)

正如@WasiAhmed指出的,这是另一个很好的例子:https : //github.com/evanmiltenburg/Dutch-tagger,正如@evanmiltenburg在github上指出的那样,尝试在生产中使用更快的taggger。


已编辑

要评估标记器,您可以test_set按以下步骤进行操作:

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]
Run Code Online (Sandbox Code Playgroud)

然后使用该tagger.evaluate()函数获取精度,该.evaluate()函数的输入与该函数的输入相同.train(),即一个句子列表,每个句子都是一个('word', 'tag')元组列表:

>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738
Run Code Online (Sandbox Code Playgroud)