Tim*_*jan 6 artificial-intelligence python-3.x keras tensorflow keras-2
我正在研究句子分类问题并尝试使用 Keras 解决。词汇表中的唯一单词总数为 36。
在这种情况下,总词汇是 [W1,W2,W3....W36]
所以,如果我有一个单词为 [W1 W2 W6 W7 W9] 的句子,如果我对其进行编码,我会得到一个 numpy 数组,如下所示
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Run Code Online (Sandbox Code Playgroud)
形状是 (5,36)
我被困在这里。我已经生成了 20000 个形状各异的 numpy 数组,即 (N,36) 其中 N 是句子中的单词数。所以,我有 20,000 个句子用于训练,100 个用于测试,所有句子都标有 (1,36) 单热编码
我有 x_train、x_test、y_train 和 y_test
x_test 和 y_test 是维度 (1,36)
任何人都可以请建议我该怎么做?
我做了一些下面的编码
model = Sequential()
model.add(Dense(512, input_shape=(??????))),
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Run Code Online (Sandbox Code Playgroud)
任何帮助将非常感激。
更新和回应@putonspectacles
非常感谢您花费时间和精力进行详细回复。我对您的代码进行了一些小的修改,我认为需要完成这些修改才能使代码正常工作。请在下面找到它
num_classes = 5
max_words = 20
sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)
words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
#-Changed the below line the inner for loop sent to sent.split()
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
vocab_size = len(words)
print(vocab_size)
#-changed the below line - the outer for loop sentences to sent_ints
X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1) for sent in sent_ints])
print(X)
print(y)
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X,y)
Run Code Online (Sandbox Code Playgroud)
如果没有这些更改,代码将无法工作。当我运行上面的代码时,我得到了如下所示的正确嵌入
[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]]
[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]]
[[0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]]
Run Code Online (Sandbox Code Playgroud)
但是我得到的错误是“检查输入时出错:预期dense_44_input有3维,但得到的数组形状为(3,1,20,16) ”
当我将输入形状更改为 model.add(Dense(512, input_shape=(None,max_words, vocab_size + 1)))
我收到错误“输入 0 与 lstm_27 层不兼容:预期 ndim=3,发现 ndim=4 ”
我正在努力解决这个问题。如果你能给我一个方向,那就太好了。
我接受了答案,因为它回答了嵌入单词的目标。再次感谢。
酷,你解决了这个问题。你想对一个句子进行分类。我假设你说我想要比词袋编码做得更好。您想要重视顺序。
然后我们将选择一个新模型——RNN (LSTM 版本)。该模型有效地总结了每个单词(按顺序)的重要性,因为它构建了最适合任务的句子的表示。
但我们必须以不同的方式处理预处理。为了提高效率(以便我们可以批量处理更多句子,而不是一次处理单个句子),我们希望所有句子都具有相同数量的单词。因此,我们选择 max_words,比如 20,然后填充较短的句子以达到最大单词数,然后将长度超过 20 个单词的句子剪掉。
Keras 将提供帮助。我们将用一个整数对每个单词进行编码。
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM
num_classes = 5
max_words = 20
sentences = ["The cat is in the house",
"The green boy",
"computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)
words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
vocab_size = len(words)
Run Code Online (Sandbox Code Playgroud)
所以“绿色男孩”现在可能是[1,3,5]。然后我们将填充并进行单热编码
# pad to max_words length and encode with len(words) + 1
# + 1 because we'll reserve 0 add the padding sentinel.
X = np.array([to_categorical(pad_sequences((sent,), max_words),
vocab_size + 1) for sent in sent_ints])
print(X.shape) # (3, 20, 16)
Run Code Online (Sandbox Code Playgroud)
现在到模型:我们将添加一层Dense将这些热门词转换为密集向量。然后我们使用 anLSTM将句子中的词向量转换为密集句子向量。最后,我们将使用 softmax 激活来生成各个类的概率分布。
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Run Code Online (Sandbox Code Playgroud)
这应该完成。然后您可以继续训练。
model.fit(X,y)
Run Code Online (Sandbox Code Playgroud)
编辑:
这一行:
# we need to split the sentences in a words write now it reading every
# letter notice the sent.split() in the correct version below.
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
Run Code Online (Sandbox Code Playgroud)
应该:
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
14899 次 |
| 最近记录: |