如何使用pandas/sklearn删除停止短语/停止ngram(多字符串)？

Question

如何使用pandas/sklearn删除停止短语/停止ngram(多字符串)？

Mel*_*uce 2 python nlp pandas scikit-learn

我想阻止某些短语进入我的模型.例如,我想防止"红玫瑰"进入我的分析.我理解如何通过这样做添加单词添加到scikit-learn的CountVectorizer的停止列表中给出的单个停用词:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

Run Code Online (Sandbox Code Playgroud)

然而,这也导致其他ngrams,如"红色郁金香"或"蓝玫瑰"未被检测到.

我正在构建一个TfidfVectorizer作为我的模型的一部分,我意识到我需要的处理可能必须在这个阶段之后输入,但我不知道如何做到这一点.

我最终的目标是在一段文字上进行主题建模.以下是我正在处理的代码片段(几乎直接来自https://de.dariah.eu/tatom/topic_model_python.html#index-0):

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

Run Code Online (Sandbox Code Playgroud)

编辑

示例数据帧(我试图插入尽可能多的边缘情况),df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.

Run Code Online (Sandbox Code Playgroud)

Answer 1

and*_*ece 6

TfidfVectorizer允许自定义预处理器.您可以使用它来进行任何所需的调整.

例如,要从示例语料库中删除所有出现的连续"红色"+"玫瑰"标记(不区分大小写),请使用:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

Run Code Online (Sandbox Code Playgroud)

现在vocab已red roses删除所有引用.

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']

Run Code Online (Sandbox Code Playgroud)

更新(每条评论帖):

要将所需的停止短语与自定义停用词一起传递给包装函数,请使用:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，3 月前
查看次数：	1767 次
最近记录：	8 年，3 月前