Python nltk 词干分析器永远不会删除前缀

Jon*_*mon 3 python nlp stemming porter-stemmer nltk

我正在尝试预处理单词以删除诸如“un”和“re”之类的常见前缀,但是 nltk 的所有常见词干分析器似乎都完全忽略了前缀:

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

PorterStemmer().stem('unhappy')
# u'unhappi'

SnowballStemmer('english').stem('unhappy')
# u'unhappi'

LancasterStemmer().stem('unhappy')
# 'unhappy'

PorterStemmer().stem('reactivate')
# u'reactiv'

SnowballStemmer('english').stem('reactivate')
# u'reactiv'

LancasterStemmer().stem('reactivate')
# 'react'
Run Code Online (Sandbox Code Playgroud)

删除常见前缀和后缀不是词干分析器工作的一部分吗?是否有另一个词干分析器可以可靠地做到这一点?

alv*_*vas 5

You're right. Most stemmers only stem suffixes. In fact the original paper from Martin Porter is titled:

Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.

And possibly the only stemmers that has prefix stemming in NLTK are the arabic stemmers:

But if we take a look at this prefix_replace function, it simply removes the old prefix and substitute it with the new prefix.

def prefix_replace(original, old, new):
    """
     Replaces the old prefix of the original string by a new suffix
    :param original: string
    :param old: string
    :param new: string
    :return: string
    """
    return new + original[len(old):]
Run Code Online (Sandbox Code Playgroud)

But we can do better!

First, do you have a fixed list of prefix and substitutions for the language you need to process?

Lets go with the (unfortunately) de facto language, English, and do some linguistics work to find out prefixes in English:

https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes

Without much work, you can write a prefix stemming function before the suffix stemming from NLTK, e.g.

import re
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

def stem_prefix(word, prefixes):
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0:
            return word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes))


word = "extraordinary"
porter_english_plus(word)
Run Code Online (Sandbox Code Playgroud)

Now that we have a simplistic prefix stemmer could we do better?

# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"
Run Code Online (Sandbox Code Playgroud)

What if we check if the prefix stemmed words appears in certain list before stemming it?

import re

from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer

# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "",    # e.g. anti-goverment, anti-racist, anti-war
"auto": "",    # e.g. autobiography, automobile
"de": "",      # e.g. de-classify, decontaminate, demotivate
"dis": "",     # e.g. disagree, displeasure, disqualify
"down": "",    # e.g. downgrade, downhearted
"extra": "",   # e.g. extraordinary, extraterrestrial
"hyper": "",   # e.g. hyperactive, hypertension
"il": "",     # e.g. illegal
"im": "",     # e.g. impossible
"in": "",     # e.g. insecure
"ir": "",     # e.g. irregular
"inter": "",  # e.g. interactive, international
"mega": "",   # e.g. megabyte, mega-deal, megaton
"mid": "",    # e.g. midday, midnight, mid-October
"mis": "",    # e.g. misaligned, mislead, misspelt
"non": "",    # e.g. non-payment, non-smoking
"over": "",  # e.g. overcook, overcharge, overrate
"out": "",    # e.g. outdo, out-perform, outrun
"post": "",   # e.g. post-election, post-warn
"pre": "",    # e.g. prehistoric, pre-war
"pro": "",    # e.g. pro-communist, pro-democracy
"re": "",     # e.g. reconsider, redo, rewrite
"semi": "",   # e.g. semicircle, semi-retired
"sub": "",    # e.g. submarine, sub-Saharan
"super": "",   # e.g. super-hero, supermodel
"tele": "",    # e.g. television, telephathic
"trans": "",   # e.g. transatlantic, transfer
"ultra": "",   # e.g. ultra-compact, ultrasound
"un": "",      # e.g. under-cook, underestimate
"up": "",      # e.g. upgrade, uphill
}

porter = PorterStemmer()

whitelist = list(wn.words()) + words.words()

def stem_prefix(word, prefixes, roots):
    original_word = word
    for prefix in sorted(prefixes, key=len, reverse=True):
        # Use subn to track the no. of substitution made.
        # Allow dash in between prefix and root. 
        word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
        if nsub > 0 and word in roots:
            return word
    return original_word

def porter_english_plus(word, prefixes=english_prefixes):
    return porter.stem(stem_prefix(word, prefixes, whitelist))
Run Code Online (Sandbox Code Playgroud)

We resolve the issue of not stemming away the prefix, causing senseless root, e.g.

>>> stem_prefix("united", english_prefixes, whitelist)
"united"
Run Code Online (Sandbox Code Playgroud)

But the porter stem would have still make remove the suffix, -ed, which may/may not be the desired output that one would require, esp. when the goal is to retain linguistically sound units in the data:

>>> porter_english_plus("united")
"unit"
Run Code Online (Sandbox Code Playgroud)

So, depending on the task, it's sometimes more beneficial to use a lemma more than a stemmer.

See also: