在 Python 中的标记化文本中查找多词项

Question

在 Python 中的标记化文本中查找多词项

我有一个已标记化的文本，或者一般来说，单词列表也可以。例如：

   >>> from nltk.tokenize import word_tokenize
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
    ... two of them.\n\nThanks.'''
    >>> word_tokenize(s)
        ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Run Code Online (Sandbox Code Playgroud)

如果我有一个包含单个单词和多单词键的 Python 字典，我如何有效且正确地检查它们在文本中的存在？理想的输出是 key:location_in_text 对，或者其他类似的东西。提前致谢！

PS要“正确”解释-如果我的字典中有“租约”，我不希望请标记。此外，需要识别复数。我想知道这是否可以在没有很多 if-else 子句的情况下优雅地解决。

Answer 1

alv*_*vas 5

如果您已经有一个多词表达式地名词典列表，您可以使用MWETokenizer，例如：

>>> from nltk.tokenize import MWETokenizer
>>> from nltk import sent_tokenize, word_tokenize

>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
...     ... two of them.\n\nThanks.'''

>>> mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')


>>> [mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New_York', '.'], ['Please', 'buy', 'me', '...', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，6 月前
查看次数：	1940 次
最近记录：	8 年，6 月前