我正在使用NLTK学习自然语言处理.我遇到了代码PunktSentenceTokenizer,使用了我在给定代码中无法理解的实际用法.代码是:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A
tokenized = custom_sent_tokenizer.tokenize(sample_text) #B
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Run Code Online (Sandbox Code Playgroud)
那么,为什么我们使用PunktSentenceTokenizer.标记为A和B的行中发生了什么.我的意思是有一个训练文本,另一个是示例文本,但是需要两个数据集才能获得词性标记.
线标记为A和B是我无法理解.
PS:我确实试过看NLTK书,但无法理解PunktSentenceTokenizer的实际用途是什么
alv*_*vas 26
PunktSentenceTokenizer是默认句子标记化器的抽象类,即sent_tokenize()在NLTK中提供.这是一个implmentation 无监督的多语言句子边界检测(吻和斯特伦克(2005年).参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/ 初始化的.py#L79
给出一个带有多个句子的段落,例如:
>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '
Run Code Online (Sandbox Code Playgroud)
你可以使用sent_tokenize():
>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
... print sent
... print '--------'
...
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world.
--------
Run Code Online (Sandbox Code Playgroud)
在sent_tokenize()使用从预先训练模式nltk_data/tokenizers/punkt/english.pickle.您还可以指定其他语言,NLTK中预训练模型的可用语言列表如下:
alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle finnish.pickle norwegian.pickle slovene.pickle
danish.pickle french.pickle polish.pickle spanish.pickle
dutch.pickle german.pickle portuguese.pickle swedish.pickle
english.pickle greek.pickle PY3 turkish.pickle
estonian.pickle italian.pickle README
Run Code Online (Sandbox Code Playgroud)
给定另一种语言的文本,请执行以下操作:
>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "
>>> for sent in sent_tokenize(german_text, language='german'):
... print sent
... print '---------'
...
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten.
---------
Run Code Online (Sandbox Code Playgroud)
要训练自己的punkt模型,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py和nltk punkt的训练数据格式
Cen*_*tAu 15
PunktSentenceTokenizer是一种句子边界检测算法,必须经过训练才能使用[1].NLTK已经包含了PunktSentenceTokenizer的预训练版本.
因此,如果您使用不带任何参数的tokenizer初始化,它将默认为预先训练的版本:
In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']
Run Code Online (Sandbox Code Playgroud)
您还可以提供自己的训练数据,以便在使用之前训练标记器.Punkt tokenizer使用无监督算法,这意味着您只需使用常规文本进行训练即可.
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
对于大多数情况,使用预先训练的版本是完全没问题的.所以你可以简单地初始化tokenizer而不提供任何参数.
那么"所有这些与POS标签有什么关系"?NLTK POS标记符使用标记化的句子,因此您需要在可以进行POS标记之前将文本分解为句子和单词标记.
[1] Kiss and Strunk," 无监督多语句边界检测 "
| 归档时间: |
|
| 查看次数: |
28873 次 |
| 最近记录: |