Abd*_*res 3 python nlp tokenize nltk text-segmentation
我有来自维基百科的这段文字:
神父提出了一项雄心勃勃的校园扩建计划。弗农·加拉格尔(Vernon F. Gallagher),1952年成立。首个学生宿舍假设大厅于1954年开放,罗克韦尔大厅(Rockwell Hall)于1958年11月投入使用,设有商学院和法学院。正是在F. Henry J. McAnulty任职期间。加拉格尔的雄心勃勃的计划付诸行动。
我正在使用NLTK nltk.sent_tokenize来获取句子。返回:
['An ambitious campus expansion plan was proposed by Fr.',
'Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr.',
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
Run Code Online (Sandbox Code Playgroud)
虽然NTLK可以处理F.亨利·J·麦卡纳尔蒂作为一个实体,它失败神父 弗农F.加拉格尔。这把句子分成两部分。
正确的令牌化应为:
[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
Run Code Online (Sandbox Code Playgroud)
如何改善令牌生成器的性能?
Kiss and Strunk(2006)Punkt算法的出色之处在于它是不受监督的。因此,给定新文本后,您应该重新训练模型并将模型应用于文本,例如
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."
# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}
# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]
Run Code Online (Sandbox Code Playgroud)
如果在重新训练模型时没有足够的数据来生成良好的统计信息,则还可以在训练之前输入预先确定的缩写列表;请参阅如何避免NLTK的句子标记词在缩写词上分裂?
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)
>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1370 次 |
| 最近记录: |