从头开始和PyTorch之间的MLP实现有什么区别?

alv*_*vas 9 python numpy neural-network deep-learning pytorch

跟进如何在两层多层感知器中更新学习率的问题

鉴于XOR问题:

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
Run Code Online (Sandbox Code Playgroud)

而且很简单

  • 两层多层感知器(MLP)
  • 他们之间的sigmoid激活
  • 均方误差(MSE)作为损失函数/优化标准

如果我们从头开始训练模型:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 5000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())
    #print(cost_delta)


# Visualize the losses
plt.plot(losses)
plt.show()
Run Code Online (Sandbox Code Playgroud)

我们从0纪元的损失中急剧下滑,然后迅速饱和:

在此输入图像描述

但是如果我们训练一个类似的模型pytorch,训练曲线在饱和之前会逐渐下降:

在此输入图像描述

从头开始的MLP和PyTorch代码有什么区别?

为什么它在不同的点上实现了收敛?

除了权重初始化,np.random.rand()从头开始的代码和默认的火炬初始化,我似乎无法看到模型的差异.

PyTorch的代码:

from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

import matplotlib.pyplot as plt

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

# XOR gate inputs and outputs.
X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)


# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
print('Inputs Dim:', input_dim) # i.e. n=2 

num_data, output_dim = Y.shape
print('Output Dim:', output_dim) 
print('No. of Data:', num_data) # i.e. n=4

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
            # Use nn.Linear to get our simple perceptron.
            nn.Linear(input_dim, hidden_dim),
            # Use nn.Sigmoid to get our sigmoid non-linearity.
            nn.Sigmoid(),
            # Second layer neurons.
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )
model

# Initialize the optimizer
learning_rate = 0.3
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

# Initialize the stopping criteria
# For simplicity, just stop training after certain no. of epochs.
num_epochs = 5000 

losses = [] # Keeps track of the loses.

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    losses.append(loss.data.item())


plt.plot(losses)
Run Code Online (Sandbox Code Playgroud)

tel*_*tel 11

手卷代码和PyTorch代码之间的差异列表

事实证明,您的手动代码和PyTorch代码之间存在很多差异.这是我发现的内容,大致按照对输出影响最大的顺序列出:

  • 您的代码和PyTorch代码使用两种不同的函数来报告丢失.
  • 您的代码和PyTorch代码以非常不同的方式设置初始权重.你在问题中提到了这一点,但事实证明它对结果有很大的影响.
  • 默认情况下,torch.nn.Linear图层会为模型添加额外的"偏差"权重.因此,Pytorch模型的第一层有效地具有3x5权重,第二层具有6x1权重.手卷代码中的图层分别具有权重2x55x1权重.
    • 这种偏见似乎有助于模型更快地学习和适应.如果你关闭偏差,那么Pytorch模型的训练时期大约要达到接近0损失的两倍.
  • 奇怪的是,似乎Pytorch模型使用的学习率实际上是你指定的一半.或者,它可能是一个杂散的因素,2它已经进入你的手动数学/代码的某个地方.

如何从手动和Pytorch代码中获得相同的结果

通过仔细考虑上述4个因素,可以实现手动和Pytorch代码之间的完全奇偶校验.通过正确的调整和设置,两个片段将产生相同的结果:

在此输入图像描述

最重要的调整 - 使损失报告功能匹配

关键的区别在于您最终使用两个完全不同的函数来衡量两个代码片段中的损失:

  • 在手动滚动代码中,您将损失测量为layer2_error.mean().如果你解压缩变量,你可以看到这layer2_error.mean()是一个有点棘手和毫无意义的值:

    layer2_error.mean()
    == np.dot(cost_delta, cost_error).mean()
    == np.dot(mse_derivative(layer2, Y), mse(layer2, Y)).mean()
    == np.sum(.5 * (layer2 - Y) * ((layer2 - Y)**2).mean()).mean()
    
    Run Code Online (Sandbox Code Playgroud)
  • 另一方面,在PyTorch代码中,损失是根据传统的定义来衡量的mse,即相当于np.mean((layer2 - Y)**2).您可以通过修改PyTorch循环来证明这一点:

    def mse(x, y):
        return np.mean((x - y)**2)
    
    torch_losses = [] # Keeps track of the loses.
    torch_losses_manual = [] # for comparison
    
    # Step 2-4 of training routine.
    
    for _e in tqdm(range(num_epochs)):
        # Reset the gradient after every epoch. 
        optimizer.zero_grad() 
        # Step 2: Foward Propagation
        predictions = model(X)
    
        # Step 3: Back Propagation 
        # Calculate the cost between the predictions and the truth.
        loss = criterion(predictions, Y)
        # Remember to back propagate the loss you've computed above.
        loss.backward()
    
        # Step 4: Optimizer take a step and update the weights.
        optimizer.step()
    
        # Log the loss value as we proceed through the epochs.
        torch_losses.append(loss.data.item())
        torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))
    
    plt.plot(torch_losses, lw=5, label='torch_losses')
    plt.plot(torch_losses_manual, lw=2, label='torch_losses_manual')
    plt.legend()
    
    Run Code Online (Sandbox Code Playgroud)

输出:

在此输入图像描述

同样重要的是 - 使用相同的初始权重

PyTorch使用它自己的特殊程序来设置初始权重,从而产生非常不同的结果np.random.rand.我还没能完全复制它,但是对于下一个最好的东西,我们可以劫持Pytorch.这是一个函数,它将获得Pytorch模型使用的相同初始权重:

import torch
from torch import nn
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]
Run Code Online (Sandbox Code Playgroud)

最后 - 在Pytorch中,关闭所有偏差权重并将学习率提高一倍

最终,您可能希望在自己的代码中实现偏差权重.现在,我们只是在Pytorch模型中关闭偏差,并将手动模型的结果与无偏的Pytorch模型的结果进行比较.

此外,为了使结果匹配,您需要将Pytorch模型的学习率提高一倍.这有效地沿x轴缩放结果(即,使速率加倍意味着它需要一半的时期才能达到损耗曲线上的某个特定特征).

把它放在一起

为了hand_rolled_losses在帖子的开头重现绘图中的数据,您需要做的就是使用手动代码并将mse函数替换为:

def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))
Run Code Online (Sandbox Code Playgroud)

初始化权重的行:

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]
Run Code Online (Sandbox Code Playgroud)

和跟踪损失的线:

losses.append(cost_error)
Run Code Online (Sandbox Code Playgroud)

你应该好好去.

为了torch_losses从图中重现数据,我们还需要在Pytorch模型中关闭偏差权重.要做到这一点,你只需更改定义Pytorch模型的行,如下所示:

model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)
Run Code Online (Sandbox Code Playgroud)

您还需要更改定义以下内容的行learning_rate:

learning_rate = 0.3 * 2
Run Code Online (Sandbox Code Playgroud)

完整的代码清单

手卷代码

这是我的手动神经网络代码版本的完整列表,以帮助重现我的结果:

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats
import torch
from torch import nn

np.random.seed(0)
torch.manual_seed(0)

def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
    model = nn.Sequential(
        nn.Linear(nodes_in, nodes_hidden, bias=bias),
        nn.Sigmoid(),
        nn.Linear(nodes_hidden, nodes_out, bias=bias),
        nn.Sigmoid()
    )

    return [t.detach().numpy() for t in model.parameters()]

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.mean(np.square(predicted - truth))

def mse_derivative(predicted, truth):
    return predicted - truth

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Define the shape of the output vector. 
output_dim = len(Y.T)

W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]

num_epochs = 5000
learning_rate = 0.3
losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    cost_delta = mse_derivative(layer2, Y)
    layer2_delta = cost_delta *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

    # Log the loss value as we proceed through the epochs.
    losses.append(mse(layer2, Y))

# Visualize the losses
plt.plot(losses)
plt.show()
Run Code Online (Sandbox Code Playgroud)

Pytorch代码

import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np

import torch
from torch import nn
from torch import tensor
from torch import optim

torch.manual_seed(0)
device = 'gpu' if torch.cuda.is_available() else 'cpu'

num_epochs = 5000
learning_rate = 0.3 * 2

# XOR gate inputs and outputs.
X = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
Y = tensor([[0],[1],[1],[0]]).float().to(device)

# Use tensor.shape to get the shape of the matrix/tensor.
num_data, input_dim = X.shape
num_data, output_dim = Y.shape

# Step 1: Initialization. 

# Initialize the model.
# Set the hidden dimension size.
hidden_dim = 5
# Use Sequential to define a simple feed-forward network.
model = nn.Sequential(
    # Use nn.Linear to get our simple perceptron.
    nn.Linear(input_dim, hidden_dim, bias=None),
    # Use nn.Sigmoid to get our sigmoid non-linearity.
    nn.Sigmoid(),
    # Second layer neurons.
    nn.Linear(hidden_dim, output_dim, bias=None),
    nn.Sigmoid()
)

# Initialize the optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Initialize the loss function.
criterion = nn.MSELoss()

def mse(x, y):
    return np.mean((x - y)**2)

torch_losses = [] # Keeps track of the loses.
torch_losses_manual = [] # for comparison

# Step 2-4 of training routine.

for _e in tqdm(range(num_epochs)):
    # Reset the gradient after every epoch. 
    optimizer.zero_grad() 
    # Step 2: Foward Propagation
    predictions = model(X)

    # Step 3: Back Propagation 
    # Calculate the cost between the predictions and the truth.
    loss = criterion(predictions, Y)
    # Remember to back propagate the loss you've computed above.
    loss.backward()

    # Step 4: Optimizer take a step and update the weights.
    optimizer.step()

    # Log the loss value as we proceed through the epochs.
    torch_losses.append(loss.data.item())
    torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))

plt.plot(torch_losses, lw=5, c='C1', label='torch_losses')
plt.plot(torch_losses_manual, lw=2, c='C2', label='torch_losses_manual')
plt.legend()
Run Code Online (Sandbox Code Playgroud)

笔记

偏差重量

您可以在本教程中找到一些非常具有启发性的示例,以说明偏差权重以及如何实现它们.他们列出了一堆神经网络的纯Python实现,非常类似于你手工编写的神经网络,所以你可能会调整他们的一些代码来实现你自己的偏见.

用于产生权重初始猜测的函数

这是我从同一个教程改编的函数,它可以为权重生成合理的初始值.我认为Pytorch内部使用的算法有些不同,但这会产生类似的结果:

import scipy as sp
import scipy.stats

def tnorm_weights(nodes_in, nodes_out, bias_node=0):
    # see https://www.python-course.eu/neural_network_mnist.php
    wshape = (nodes_out, nodes_in + bias_node)
    bound = 1 / np.sqrt(nodes_in)
    X = sp.stats.truncnorm(-bound, bound)
    return X.rvs(np.prod(wshape)).reshape(wshape) 
Run Code Online (Sandbox Code Playgroud)