Flexible Neurons

16 October 2016 neural-network
neural-networks
theory

Code to this post

In the last post, we created a neuron that was able to learn a dataset generated by a simple linear function \(y = 0.58 x_1 + 0.67 x_2\). Now, we will modify the function behind our dataset just a bit and suddenly our neuron fails to predict the target variable accurately. We will identify the problem and modify the neuron. Thus, we will make the neuron more flexible. But first, we need to fix some general logic in our code.

Training in Epochs

Our current neuron stops learning / training when the error for a single input vector was very small. This can happen arbitrarily, even when the learning has not finished yet. It is better to look at a whole batch of training samples and calculate the Mean Squared Error. This makes the error measure more expressive and prevents stopping the training prematurely. We still adjust the weights after a single input, but look at the Mean Squared Error after each full iteration through the dataset, called epoch.

To be able to generate a dataset of arbitrary size, I added a dataset generator function instead of directly defining the dataset. That function tells us the solution that our neuron needs to learn.

import sys
import numpy as np

def generate_dataset(size):
    # Define the function behind the dataset
    def get_target(x):
        return 0.58 * x[0] + 0.67 * x[1]

    inputs = np.random.rand(size, 2)
    targets = np.apply_along_axis(get_target, 1, inputs)
    return inputs, targets

# Generate a dataset of a certain size
inputs, targets = generate_dataset(20)
print 'inputs shape: %s targets shape: %s' % (inputs.shape, targets.shape)

Now, we can use that dataset generator to train our neuron in epochs:

# Start with a vector of random weights
weights = np.random.random_sample((2,))
mse = sys.float_info.max  # Mean Squared Error on the whole dataset
epoch = 1
while mse > 1e-6:
    # Feed each input vector to the neuron and keep the errors for calculating
    # the MSE.
    errors = []
    for x, y in zip(inputs, targets):
        prediction = np.dot(x, weights)
        errors.append(np.square(y - prediction))

        # Adjust the weights
        weights += 2 * (y - prediction) * x.T

    # Calculate the mean of the squared errors
    mse = np.mean(errors)
    print 'MSE after epoch %d: %s' % (epoch, mse)

    epoch += 1

print 'Learned weights: %s' % weights

Now, the error will converge after 3 to 5 epochs. But with one slight modification of the dataset, the learning will fail:

Constant Offsets

Say we get a constant (untaxed) bonus of $50,000 per year. With respect to our dataset from the previous post, that would mean our annual net income has a constant offset. The generating function would be \(y = 0.58 x_1 + 0.67 x_2 + 0.5\) (sticking to 1 = $100,000 ). Try out that function in the code above. We can see that the MSE clearly does not converge to zero. That is because the neuron's function can only weigh the input variables. So, we need to add a bias input, which is always 1. To that input, the neuron then can assign a weight as well.

In the code that means that we initialize a bias weight randomly with bias = np.random.random(). We add that bias to the prediction and update it just like the other weights with bias += 2 * (y - prediction). You can see the whole updated code below. However, when trying out these modifications, we notice an irritating fact. Now, the error becomes infinitely large after a couple of epochs. That is because of a second problem.

The Learning Rate

We update each weight proportional to the derivative of the error with respect to the weight – in other words, its influence on the error. This sounds intuitive. But currently, the weights become extremely large (\(\infty\)) or small (\(-\infty\)). That is because at each step, we look at the extreme weight and say "Wow, that was way too large. Let's go really far into the opposite direction". Thus, we calculate an even more extreme value at the next step. In other words, the length of the gradient is too large and get's even larger at the next step. That is a vicious cycle that we can't escape once we entered it.

A solution is to adjust the step size with a parameter called learning rate. In our case, setting the learning rate to 0.1 will suffice to let the error converge to zero. We only need to adjust the update rules:

weights += learning_rate * 2 * (y - prediction) * x.T
bias += learning_rate * 2 * (y - prediction)

Other ways to prevent exploding weights are starting with smaller initial weights and using regularization. I will cover these in later posts.

A flexible Learning Rate

Finding the optimal weights can be compared to going hiking blindly and trying to find the highest point in the mountains. The height is the inverted error function since we want to maximize the height but minimize the error. Our x and y position is the weight vector. At each step, we can feel the gradient and follow it (or go in the opposite direction in the case of the error function). However, if we do too large steps, we may cross the optimum without noticing. A solution would be to choose a really low learning rate, for example \(10^{-10}\). The downside of this approach is that it takes really long to train the neuron and the danger of getting stuck in a local optimum increases. A local optimum would be a small hill in the Alps.

Local vs. Global Optimum. The height on the graph represents the negated error function. We want to minimize our error, i.e. maximize the height. Our (x,y) position on the graph is given by the value of our weights.

A small step size means we are very shy. Once we reach the top of the small hill, we do not leave it anymore, because every point close to us is beneath our current height. A good compromise is a flexible learning rate / step size that is high in the beginning but decreases when we feel like we are close to an optimum. An indication for that is that the Mean Squared Error does not increase anymore. The following code contains the bias input and a flexible learning rate. Every time the MSE did not improve, we divide it in half.

import sys
import numpy as np

def generate_dataset(size):
    # Define the function behind the dataset
    def get_target(x):
        return 0.58 * x[0] + 0.67 * x[1] + 0.5

    inputs = np.random.rand(size, 2)
    targets = np.apply_along_axis(get_target, 1, inputs)
    return inputs, targets

# Generate a dataset of a certain size
inputs, targets = generate_dataset(20)
print 'inputs shape: %s targets shape: %s' % (inputs.shape, targets.shape)

# Start with a vector of random weights
weights = np.random.random_sample((2,))
bias = np.random.random()
mse = sys.float_info.max  # Mean Squared Error on the whole dataset
epoch = 1
learning_rate = 0.1

while mse > 1e-6:
    # Feed each input vector to the neuron and keep the errors for calculating
    # the MSE.
    errors = []
    for x, y in zip(inputs, targets):
        prediction = np.dot(x, weights) + bias
        errors.append(np.square(y - prediction))

        # Adjust the weights
        weights += learning_rate * 2 * (y - prediction) * x.T
        bias += learning_rate * 2 * (y - prediction)

    # Calculate the mean of the squared errors
    new_mse = np.mean(errors)

    # If the error did not improve, decrease the learning rate, since we might
    # be close to an optimum
    if new_mse > mse:
        learning_rate *= 0.5
        print 'New learning rate: %f' % learning_rate

    mse = new_mse
    print 'MSE after epoch %d: %s' % (epoch, mse)

    epoch += 1

print 'Learned weights: %s bias: %s' % (weights, bias)

How to choose the Learning Rate

Choosing the initial learning rate and its decay factor is not trivial. There also is no rule to follow, as the optimal learning rate depends on the dataset and the structure of the neural network. As far as I know the common practice is to try out values between \(10^{-5}\) and \(1\). If you know a better way, please tell me. Below, I plotted the decrease of the MSE for the program above with different initial learning rates.

With all learning rates, the learning converged, but with a learning rate of 0.001 it took almost 2000 epochs. Larger learning rates tend to let the neuron converge faster. However, before introducing a learning rate, we had an implicit learning rate of 1, which caused the gradients to explode.

Are we rich?

Now that our neuron is so flexible, let us change the dataset in a minor way. Instead of predicting our net income for the current year, the neuron shall just tell is if we are rich. We define being rich as having a net income over $100,000 per year. So, we change our dataset function to

def get_target(x):
    return 1 if 0.58 * x[0] + 0.67 * x[1] + 0.5 > 1 else 0

and suddenly the MSE does not converge to zero anymore. It converges at values around 0.05, but the net fails to predict the target value accurately.

Conclusion

We improved the neuron and enabled it to tackle offsets and to adjust its learning rate. Yet, it still cannot predict simple datasets. In the next post, I will explain why that is and how we can overcome such datasets.

Next Post: From Regression to Classification

Comments