使用LSTM的Word嵌入防止文本分类的过度拟合

Question

使用LSTM的Word嵌入防止文本分类的过度拟合

Som*_*dam 10 text-classification lstm keras tensorflow word-embedding

目标:

使用用户输入的问题识别类标签(如问答系统).
从Big PDF文件中提取数据,需要根据用户输入预测页码.
主要用于策略文档,用户对策略有疑问并需要显示特定页码.

以前的实现:应用弹性搜索但精度非常低,因为用户输入任何文本,如"我需要"=="想要"

数据集信息:数据集包含每一行,文本(或段落)和标签(作为页码).这里数据集大小很小,我只有500行.

目前的实施:

在Keras和后端使用LSTM应用的字嵌入(手套)是Tensor-flow
应用Droupout
应用活动评估
应用L2 W_regularizer(从0.1到0.001)
应用不同的nb_epoch从10到600
将EMBEDDING_DIM从100改为300到Glove Data

Applied NLP for,

转换为小写
删除停止英语单词
词干
删除数字
删除URL和IP地址

结果:测试数据(或验证数据)的准确率为23%,但列车数据为91%

代码:

import time
from time import strftime

import numpy as np
from keras.callbacks import CSVLogger, ModelCheckpoint
from keras.layers import Dense, Input, LSTM, ActivityRegularization
from keras.layers import Embedding, Dropout,Bidirectional
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.regularizers import l2
from keras.utils import to_categorical

import pickle
from DataGenerator import *

BASE_DIR = ''
GLOVE_DIR = 'D:/Dataset/glove.6B'  # BASE_DIR + '/glove.6B/'

MAX_SEQUENCE_LENGTH = 50
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector
np.random.seed(1337)  # for reproducibility

print('Indexing word vectors.')

t_start = time.time()

embeddings_index = {}

if os.path.exists('pickle/glove.pickle'):
    print('Pickle found..')
    with open('pickle/glove.pickle', 'rb') as handle:
        embeddings_index = pickle.load(handle)
else:
    print('Pickle not found...')
    f = open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'), encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    with open('pickle/glove.pickle', 'wb') as handle:
        pickle.dump(embeddings_index, handle, protocol=pickle.HIGHEST_PROTOCOL)

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels = []  # list of label ids
labels_index = {}  # dictionary mapping label name to numeric id

(texts, labels, labels_index) = get_data('D:/PolicyDocument/')

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))
print('Preparing embedding matrix. :', embedding_matrix.shape)
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            mask_zero=True,
                            trainable=False)

print('Training model.')

csv_file = "logs/training_log_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".csv"
model_file = "models/Model_" + strftime("%Y-%m-%d %H-%M", time.localtime()) + ".mdl"
print("Model file:" + model_file)
csv_logger = CSVLogger(csv_file)

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

rate_drop_lstm = 0.15 + np.random.rand() * 0.25
num_lstm = np.random.randint(175, 275)
rate_drop_dense = 0.15 + np.random.rand() * 0.25

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001))(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64)(x)
x = Dropout(0.25)(x)
x = ActivityRegularization(l1=0.01, l2=0.001)(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

model_checkpoint = ModelCheckpoint(model_file, monitor='val_loss', verbose=0, save_best_only=True,
                                   save_weights_only=False, mode='auto')

model.fit(x_train, y_train,
          batch_size=1,
          nb_epoch=600,
          validation_data=(x_val, y_val), callbacks=[csv_logger, model_checkpoint])

score = model.evaluate(x_val, y_val, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

t_end = time.time()
total = t_end - t_start
ret_str = "Time needed(s): " + str(total)
print(ret_str)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Max*_*axB 8

Dropout和BN对于前馈NN非常有效.但是,它们可能会导致RNN出现问题(关于此主题的论文很多)

使RNN模型更好地概括的最佳方法是增加数据集大小.在您的情况下(LSTM有大约200个单元格),您可能希望有大约100,000个或更多标记样本进行训练.

Answer 2

Dan*_*ler 6

除了简单地减少诸如嵌入尺寸和某些层中的单元数量之类的参数之外,还可以调整LSTM中的重复丢失.

LSTM似乎很容易过度装配(所以我读过).

然后,您可以在Keras文档中看到每个图层的使用dropout和recurrent_dropout作为参数LSTM.

任意数字的示例:

x = LSTM(num_lstm, return_sequences=True, W_regularizer=l2(0.001), recurrent_dropout=0.4)(embedded_sequences)
x = Dropout(0.5)(x)
x = LSTM(64,dropout=0,5, recurrent_dropout=0,3)(x)

Run Code Online (Sandbox Code Playgroud)

其他原因可能是错误或数据不足:

您是否尝试过将测试和验证数据混合在一起并创建新的列车和验证集？
你在培训数据中有多少句话？你在尝试小套装吗？使用整个集或尝试dada扩充(创建新的句子及其分类 - 但这可能是非常棘手的文本).

归档时间：	8 年，8 月前
查看次数：	6441 次
最近记录：	8 年，4 月前