我从NLTK开始,想要标记荷兰语句子,但是我在指定语料库时遇到了麻烦。
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino
pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')
Run Code Online (Sandbox Code Playgroud)
给,
[('Python', 'UNK'),
('is', 'UNK'),
('een', 'UNK'),
('goede', 'UNK'),
('data', 'UNK'),
('science', 'UNK'),
('taal', 'UNK'),
('.', 'UNK')]
Run Code Online (Sandbox Code Playgroud)
很明显,我没有正确指定语料库。我下载了白化语语料库。谁能帮助我找出如何正确指定语料库?
默认语言nltk.pos_tag是英语文字训练,您必须在alpino语料库上训练一个新的标记器才能滚动自己的荷兰标记器。
但是请注意,该模型将达到以下效果:
来自UnigramTagger和BigramTagger示例:
>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]
Run Code Online (Sandbox Code Playgroud)
与PerceptronTagger:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents())
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')
Run Code Online (Sandbox Code Playgroud)
正如@WasiAhmed指出的,这是另一个很好的例子:https : //github.com/evanmiltenburg/Dutch-tagger,正如@evanmiltenburg在github上指出的那样,尝试在生产中使用更快的taggger。
要评估标记器,您可以test_set按以下步骤进行操作:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]
Run Code Online (Sandbox Code Playgroud)
然后使用该tagger.evaluate()函数获取精度,该.evaluate()函数的输入与该函数的输入相同.train(),即一个句子列表,每个句子都是一个('word', 'tag')元组列表:
>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4597 次 |
| 最近记录: |