小智 8
You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())
Run Code Online (Sandbox Code Playgroud)
You will get:
['Something', 'about', 'the_west_wing']
Run Code Online (Sandbox Code Playgroud)
如果您要查找一组固定的短语,那么简单的解决方案是将您的输入标记化并“重新组合”多词标记。或者,在标记化之前执行正则表达式搜索和替换,将其转换The West Wing为The_West_Wing.
有关更高级的选项,请使用regexp_tokenize或参阅NLTK 书的第 7 章。