Keras：文本预处理（停用词删除等）

Question

Keras：文本预处理（停用词删除等）

我正在使用 Keras 执行多标签分类任务（Kaggle 上的有毒评论文本分类）。

我正在使用Tokenizer该类进行一些预处理，如下所示：

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train_sentences)
train_sentences_tokenized = tokenizer.texts_to_sequences(train_sentences)
max_len = 250
X_train = pad_sequences(train_sentences_tokenized, maxlen=max_len)

Run Code Online (Sandbox Code Playgroud)

这是一个好的开始，但我还没有去除停用词、词干词等。对于停用词去除，我在上述之前做了以下工作：

def filter_stop_words(train_sentences, stop_words):
    for i, sentence in enumerate(train_sentences):
        new_sent = [word for word in sentence.split() if word not in stop_words]
        train_sentences[i] = ' '.join(new_sent)
    return train_sentences

stop_words = set(stopwords.words("english"))
train_sentences = filter_stop_words(train_sentences, stop_words)

Run Code Online (Sandbox Code Playgroud)

在 Keras 中不应该有更简单的方法来做到这一点吗？希望也有词干能力，但文档没有表明有：

https://keras.io/preprocessing/text/

任何有关停用词删除和词干提取最佳实践的帮助都会很棒！

谢谢！

Answer 1

nur*_*ric 5

不，Keras 不是自然语言处理库。您必须自己处理任何复杂的处理。在这个阶段，使用带有 Python 接口（如NLTK或Spacy）的实际 NLP 库可能是有意义的。Tokenizer是一个用于基本自然语言任务的小型实用程序类，您可以自己将其扩展到某个点，但 NLP 库将提供更多功能，包括标记化、词性标记和词干提取。

归档时间：	7 年，5 月前
查看次数：	10043 次
最近记录：	4 年，7 月前