Yol*_*llo 3 machine-learning julia lstm flux.jl
我还是 Julia 和机器学习的新手,但我非常渴望学习。在我正在处理的当前项目中,我遇到了尺寸不匹配的问题,无法弄清楚该怎么做。
我有两个数组如下:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
Run Code Online (Sandbox Code Playgroud)
以及下一个使用 Flux 的模型:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
Run Code Online (Sandbox Code Playgroud)
我压缩了两个数组,然后使用 Flux.train 将它们输入到模型中!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
Run Code Online (Sandbox Code Playgroud)
并立即抛出下一个错误:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Run Code Online (Sandbox Code Playgroud)
现在,我知道矩阵 A 的第一维是隐藏层的总和 (256 + 256 + 128 + 128 + 128 + 128),第二维是输入层,即 10。我做的第一件事是将 10 更改为 9,但它只会引发错误:
ERROR: DimensionMismatch("dimensions must match")
Run Code Online (Sandbox Code Playgroud)
有人可以向我解释哪些尺寸不匹配,以及如何使它们匹配?
首先,您应该知道,从架构的角度来看,您正在向您的网络提出一些非常困难的问题;softmax
将输出重新归一化为介于0
和之间1
(像概率分布一样加权),这意味着要求您的网络输出类似77
匹配的值y
将是不可能的。这不是导致尺寸不匹配的原因,但需要注意。我将softmax()
在最后放弃,给网络一个战斗的机会,特别是因为它不是导致问题的原因。
让我们来看看Flux.train!()
. 这个定义实际上出奇的简单。忽略对我们无关紧要的一切,我们只剩下:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Run Code Online (Sandbox Code Playgroud)
因此,让我们首先从您的 中拉出第一个元素data
,并将其放入您的loss
函数中。您没有在问题中指定损失函数或优化器。虽然softmax
通常意味着您应该使用crossentropy
损失,但您的y
值非常不是概率,因此如果我们放弃,softmax
我们可以只使用简单的mse()
损失。对于优化器,我们将默认使用旧的 ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Run Code Online (Sandbox Code Playgroud)
现在,为了模拟 的第一次运行Flux.train!()
,我们将first(data)
其放入loss()
:
loss(first(data)...)
Run Code Online (Sandbox Code Playgroud)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12")
. Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
Run Code Online (Sandbox Code Playgroud)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Run Code Online (Sandbox Code Playgroud)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
Run Code Online (Sandbox Code Playgroud)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Run Code Online (Sandbox Code Playgroud)
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM
network a vector of length 10
(well, 12
now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM
layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM
that takes in a dimensionality of 3
, then feed it 1000
times.
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain
and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
Run Code Online (Sandbox Code Playgroud)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y)
function to, instead of calling model(x)
, it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!()
the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Run Code Online (Sandbox Code Playgroud)
Feeding this into Flux.train!()
actually trains, albeit not very well. ;)
Although your data is all Int64
's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax
is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1]
and moreover that they sum to 1.0
, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101]
into something where you can say "we have 99.1%
certainty that the second class is correct, and 0.7%
certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1
vector through (where S
is the size of your embedding) you're instead going to be passing through an SxN
matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN
must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N
halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!