我正在尝试学习如何使用NLTK标记西班牙语单词.
从nltk书中,使用他们的例子标记英语单词非常容易.因为我是nltk和所有语言处理的新手,所以我很困惑如何处理.
我已经下载了cess_esp语料库.有没有办法在中指定语料库nltk.pos_tag?我查看了pos_tag文档,没有看到任何暗示我能做到的事情.我觉得我错过了一些关键概念.我是否必须在cess_esp语料库中手动标记文本中的单词?(通过手动我的意思是标记我的信号并再次运行它的语料库)或者我完全没有标记.谢谢
alv*_*vas 15
首先,您需要从语料库中读取标记的句子.NLTK提供了一个很好的界面,不用来自不同语料库的不同格式; 您可以简单地导入语料库,使用语料库对象函数来访问数据.请参见http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml.
然后你必须选择你的tagger并训练tagger.有更多花哨的选项,但你可以从N-gram标记开始.
然后,您可以使用标记器标记您想要的句子.这是一个示例代码:
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
# Read the corpus into a list,
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()
# Train the unigram tagger
uni_tag = ut(cess_sents)
sentence = "Hola , esta foo bar ."
# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))
# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%
# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])
# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])
# Using the tagger.
bi_tag.tag(sentence.split(" "))
Run Code Online (Sandbox Code Playgroud)
在大型语料库上训练标记器可能需要很长时间.我们不是每次需要时都训练标记器,而是将训练好的标记器保存在文件中以便以后重复使用.
请参阅http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html中的Storing Taggers部分.
鉴于上一个答案中的教程,这里是spaghetti tagger的一个更面向对象的方法:https://github.com/alvations/spaghetti-tagger
#-*- coding: utf8 -*-
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load
def loadtagger(taggerfilename):
infile = open(taggerfilename,'rb')
tagger = load(infile); infile.close()
return tagger
def traintag(corpusname, corpus):
# Function to save tagger.
def savetagger(tagfilename,tagger):
outfile = open(tagfilename, 'wb')
dump(tagger,outfile,-1); outfile.close()
return
# Training UnigramTagger.
uni_tag = ut(corpus)
savetagger(corpusname+'_unigram.tagger',uni_tag)
# Training BigramTagger.
bi_tag = bt(corpus)
savetagger(corpusname+'_bigram.tagger',bi_tag)
print "Tagger trained with",corpusname,"using" +\
"UnigramTagger and BigramTagger."
return
# Function to unchunk corpus.
def unchunk(corpus):
nomwe_corpus = []
for i in corpus:
nomwe = " ".join([j[0].replace("_"," ") for j in i])
nomwe_corpus.append(nomwe.split())
return nomwe_corpus
class cesstag():
def __init__(self,mwe=True):
self.mwe = mwe
# Train tagger if it's used for the first time.
try:
loadtagger('cess_unigram.tagger').tag(['estoy'])
loadtagger('cess_bigram.tagger').tag(['estoy'])
except IOError:
print "*** First-time use of cess tagger ***"
print "Training tagger ..."
from nltk.corpus import cess_esp as cess
cess_sents = cess.tagged_sents()
traintag('cess',cess_sents)
# Trains the tagger with no MWE.
cess_nomwe = unchunk(cess.tagged_sents())
tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
traintag('cess_nomwe',tagged_cess_nomwe)
print
# Load tagger.
if self.mwe == True:
self.uni = loadtagger('cess_unigram.tagger')
self.bi = loadtagger('cess_bigram.tagger')
elif self.mwe == False:
self.uni = loadtagger('cess_nomwe_unigram.tagger')
self.bi = loadtagger('cess_nomwe_bigram.tagger')
def pos_tag(tokens, mmwe=True):
tagger = cesstag(mmwe)
return tagger.uni.tag(tokens)
def batch_pos_tag(sentences, mmwe=True):
tagger = cesstag(mmwe)
return tagger.uni.batch_tag(sentences)
tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())
Run Code Online (Sandbox Code Playgroud)
我最终在这里搜索了其他语言的 POS 标记器,然后是英语。您的问题的另一个选择是使用 Spacy 库。它提供多种语言的 POS 标记,如荷兰语、德语、法语、葡萄牙语、西班牙语、挪威语、意大利语、希腊语和立陶宛语。
来自 Spacy 文档:
import es_core_news_sm
nlp = es_core_news_sm.load()
doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
print([(w.text, w.pos_) for w in doc])
Run Code Online (Sandbox Code Playgroud)
造成:
[('El', 'DET'), ('copal', 'NOUN'), ('se', 'PRON'), ('usa', 'VERB'), ('principalmente', 'ADV') , ('para', 'ADP'), ('sahumar', 'VERB'), ('en', 'ADP'), ('distintas', 'DET'), ('ocasiones', 'NOUN') , ('como', 'SCONJ'), ('lo', 'PRON'), ('son', 'AUX'), ('las', 'DET'), ('fiestas', 'NOUN') , ('religiosas', 'ADJ'), ('.', 'PUNCT')]
并在笔记本中可视化:
displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
22705 次 |
| 最近记录: |