Python:使用短语进行标记

Question

Python:使用短语进行标记

yav*_*voh 7 python nlp tokenize nltk

我有想要标记的文本块,但我不想对空格和标点符号进行标记,因为似乎是NLTK等工具的标准.我希望将特定短语标记为单个标记,而不是常规标记化.

例如,鉴于句子"The West Wing是由Aaron Sorkin创作的美国电视连续剧,最初于1999年9月22日至2006年5月14日在NBC上播出",并将该短语添加到令牌器" 西翼, "由此产生的代币将是:

西翼
是
一个
美国
...

实现这一目标的最佳方法是什么？我宁愿呆在像NLTK这样的工具范围内.

Answer 1

小智 8

You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:

from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())

Run Code Online (Sandbox Code Playgroud)

You will get:

['Something', 'about', 'the_west_wing']

Run Code Online (Sandbox Code Playgroud)

Answer 2

Fre*_*Foo 3

如果您要查找一组固定的短语，那么简单的解决方案是将您的输入标记化并“重新组合”多词标记。或者，在标记化之前执行正则表达式搜索和替换，将其转换The West Wing为The_West_Wing.

有关更高级的选项，请使用regexp_tokenize或参阅NLTK 书的第 7 章。

归档时间：	14 年，10 月前
查看次数：	5467 次
最近记录：	7 年，7 月前