在 Keras 中使用单热编码创建模型

Tim*_*jan 6 artificial-intelligence python-3.x keras tensorflow keras-2

我正在研究句子分类问题并尝试使用 Keras 解决。词汇表中的唯一单词总数为 36。

在这种情况下,总词汇是 [W1,W2,W3....W36]

所以,如果我有一个单词为 [W1 W2 W6 W7 W9] 的句子,如果我对其进行编码,我会得到一个 numpy 数组,如下所示

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Run Code Online (Sandbox Code Playgroud)

形状是 (5,36)

我被困在这里。我已经生成了 20000 个形状各异的 numpy 数组,即 (N,36) 其中 N 是句子中的单词数。所以,我有 20,000 个句子用于训练,100 个用于测试,所有句子都标有 (1,36) 单热编码

我有 x_train、x_test、y_train 和 y_test

x_test 和 y_test 是维度 (1,36)

任何人都可以请建议我该怎么做?

我做了一些下面的编码

model = Sequential()
model.add(Dense(512, input_shape=(??????))),
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])
Run Code Online (Sandbox Code Playgroud)

任何帮助将非常感激。

更新和回应@putonspectacles

非常感谢您花费时间和精力进行详细回复。我对您的代码进行了一些小的修改,我认为需要完成这些修改才能使代码正常工作。请在下面找到它

num_classes = 5 
max_words = 20
sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)
words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
#-Changed the below line the inner for loop sent to sent.split()  
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
vocab_size = len(words)
print(vocab_size)
#-changed the below line - the outer for loop sentences to sent_ints
X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1)  for sent in sent_ints])
print(X)
print(y)
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
      optimizer='adam',
      metrics=['accuracy'])
model.fit(X,y)
Run Code Online (Sandbox Code Playgroud)

如果没有这些更改,代码将无法工作。当我运行上面的代码时,我得到了如下所示的正确嵌入

[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]


[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]]



[[0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]]
Run Code Online (Sandbox Code Playgroud)

但是我得到的错误是“检查输入时出错:预期dense_44_input有3维,但得到的数组形状为(3,1,20,16)

当我将输入形状更改为 model.add(Dense(512, input_shape=(None,max_words, vocab_size + 1)))

我收到错误“输入 0 与 lstm_27 层不兼容:预期 ndim=3,发现 ndim=4

我正在努力解决这个问题。如果你能给我一个方向,那就太好了。

我接受了答案,因为它回答了嵌入单词的目标。再次感谢。

ors*_*ady 3

酷,你解决了这个问题。你想对一个句子进行分类。我假设你说我想要比词袋编码做得更好。您想要重视顺序。

然后我们将选择一个新模型——RNN (LSTM 版本)。该模型有效地总结了每个单词(按顺序)的重要性,因为它构建了最适合任务的句子的表示。

但我们必须以不同的方式处理预处理。为了提高效率(以便我们可以批量处理更多句子,而不是一次处理单个句子),我们希望所有句子都具有相同数量的单词。因此,我们选择 max_words,比如 20,然后填充较短的句子以达到最大单词数,然后将长度超过 20 个单词的句子剪掉。

Keras 将提供帮助。我们将用一个整数对每个单词进行编码。

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

num_classes = 5 
max_words = 20
sentences = ["The cat is in the house",
                           "The green boy",
            "computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)

words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
vocab_size = len(words)
Run Code Online (Sandbox Code Playgroud)

所以“绿色男孩”现在可能是[1,3,5]。然后我们将填充并进行单热编码

# pad to max_words length and encode with len(words) + 1  
# + 1 because we'll reserve 0 add the padding sentinel.
X = np.array([to_categorical(pad_sequences((sent,), max_words),  
       vocab_size + 1)  for sent in sent_ints])
print(X.shape) # (3, 20, 16)
Run Code Online (Sandbox Code Playgroud)

现在到模型:我们将添加一层Dense将这些热门词转换为密集向量。然后我们使用 anLSTM将句子中的词向量转换为密集句子向量。最后,我们将使用 softmax 激活来生成各个类的概率分布。

model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])
Run Code Online (Sandbox Code Playgroud)

这应该完成。然后您可以继续训练。

model.fit(X,y)
Run Code Online (Sandbox Code Playgroud)

编辑:

这一行:

# we need to split the sentences in a words write now it reading every
# letter notice the sent.split() in the correct version below.
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
Run Code Online (Sandbox Code Playgroud)

应该:

sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
Run Code Online (Sandbox Code Playgroud)