pr3*_*338 8 python nlp pytorch
我试图使用双向RNN和pytorch填写空白.
输入将如下: The dog is _____, but we are happy he is okay.
输出将如下:
1. hyper (Perplexity score here)
2. sad (Perplexity score here)
3. scared (Perplexity score here)
Run Code Online (Sandbox Code Playgroud)
我在这里发现了这个想法:https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27
import torch, torch.nn as nn
from torch.autograd import Variable
text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1
random_input = Variable(
torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)
bi_rnn = torch.nn.RNN(
input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)
bi_output, bi_hidden = bi_rnn(random_input)
# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output = torch.cat((forward_output, backward_output), dim=-1)
linear = nn.Linear(hidden_size * 2, output_size)
# only predict on words
labels = random_input[1:-1]
# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)
Run Code Online (Sandbox Code Playgroud)
我正在尝试重新实现博客文章底部的上述代码.我是pytorch和nlp的新手,无法理解代码的输入和输出是什么.
关于输入的问题:我猜输入的是给出的几个单词.在这种情况下,为什么需要句子开头和句子结尾?为什么我看不到输入是一个像其他经典NLP问题一样训练模型的语料库?我想使用安然电子邮件语料库来训练RNN.
关于输出的问题:我看到输出是一个张量.我的理解是张量是一个向量,所以在这种情况下可能是一个单词向量.你如何使用张量输出单词本身?
由于这个问题是开放式的,因此我将从最后一部分开始,朝标题中提出的主要问题的更一般的答案过渡。
快速说明:正如@Qusai Alothman的评论中所指出的那样,您应该找到关于该主题的更好的资源,当涉及必要的信息时,该资源很少。
附加说明:上一节中描述的过程的完整代码将占用太多空间以提供确切的答案,而更多的是博客文章。我将重点介绍在建立这样一个具有有用链接的网络时应该采取的步骤。
最后说明:如果下面有什么愚蠢的东西(或者您想以任何方式或形式扩展答案,请通过在下面发表评论来纠正我/添加信息)。
这里输入从生成的随机正态分布,有没有连接到实际的话。它应该表示单词嵌入,例如将单词表示为带有语义(这很重要!)含义的数字(有时还取决于上下文(请参见当前的最新方法之一,例如BERT))。
在您的示例中,它提供为:
seq_len, batch_size, embedding_size,
哪里
seq_len -表示单个句子的长度(因数据集的不同而异),我们稍后会介绍。 batch_size-在forward传递的一个步骤中应处理多少个句子(如果
PyTorch是从torch.nn.Module继承的类的前向方法
)embedding_size - vector with which one word is represented (it
might range from the usual 100/300 using word2vec up to 4096 or
so using the more recent approaches like the BERT mentioned
above)In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.
Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).
If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.
You dataset might be already splitted into sentences, in those cases you are kind of ready to go.
I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:
king is more semantically related to queen
than to a, say, banana). Those were already linked above, some
other noticeable might be
GloVe or
ELMo and tons of other creative
approaches.One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).
Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).
# for language models, use cross-entropy :)
loss = nn.MSELoss()
Run Code Online (Sandbox Code Playgroud)
For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.
We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).
PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).
This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.
You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.
First steps should be roughly this:
Next, you would like to create your target (e.g. words to be filled) in each sentence.
Each word should be replaced by a special token (say <target-token>) and moved to target.
Example:
Neural networks can do some stuff.would give us the following sentences and it's respective targets:
<target-token> networks can do some stuff. target: NeuralNeural <target-token> can do some stuff. target: networksNeural networks <target-token> do some stuff. target: canNeural networks can <target-token> some stuff. target: doNeural networks can do <target-token> stuff. target: someNeural networks can do some <target-token>. target: someNeural networks can do some stuff <target-token> target: .You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!
Each word in each sentence should be replaced by an integer, which in turn points to it embedding.
I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.
You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.
One should use PyTorch's torch.utils.data.Dataset and torch.utils.data.DataLoader.
In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).
Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.
Oh, and remember to shuffle your dataset using DataLoader, while we're at it.
You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).
Something along those lines when it comes to model construction:
import torch
class Filler(torch.nn.Module):
def __init__(self, cell, embedding_words_count: int):
self.cell = cell
# We want to output vector of N
self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)
def forward(self, batch):
# Assuming batch was properly prepared before passing into the network
output, _ = self.cell(batch)
# Batch shape[0] is the length of longest already padded sequence
# Batch shape[1] is the length of batch, e.g. 32
# Here we create a view, which allows us to concatenate bidirectional layers in general manner
output = output.view(
batch.shape[0],
batch.shape[1],
2 if self.cell.bidirectional else 1,
self.cell.hidden_size,
)
# Here outputs of bidirectional RNNs are summed, you may concatenate it
# It makes up for an easier implementation, and is another often used approach
summed_bidirectional_output = output.sum(dim=2)
# Linear layer needs batch first, we have to permute it.
# You may also try with batch_first=True in self.cell and prepare your batch that way
# In such case no need to permute dimensions
linear_input = summed_bidirectional_output.permute(1, 0, 2)
return self.linear(embedding_words_count)
Run Code Online (Sandbox Code Playgroud)
As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:
model = Filler(
torch.nn.GRU(
# Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
input_size=300,
hidden_size=100,
num_layers=3,
batch_first=False,
dropout=0.4,
bidirectional=True,
),
# How many unique words are there in your dataset
embedding_words_count=10000,
)
Run Code Online (Sandbox Code Playgroud)
You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).
Remember it's only a showcase, you will have to tune and fix it on your own.
This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.
BTW: Here is some info on concatenation of bidirectional output of RNN.
I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.
Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.
This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.
And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.