ML Basics #2: Multilayer Perceptron

Sunday, Aug 25, 2019 | machine-learning

This is the second post in this series on the basics of Machine Learning. These posts are intended to serve as companion pieces to this zine on binary classification. The last post detailed the functioning of an artificial neuron, and how it can be trained to linearly segment a dataset. However, most real world datasets are not linearly separable, which begs the question:

What is the point of learning about a neuron?

Well, by the end of this post, we'll see that a bunch of neurons, when stacked together, can learn to create powerful non-linear solution spaces. Let's see how that works.

Setup
Multilayer Perceptron
Feedforward
- Hidden outputs
- Output
Error Function
Backpropagation
Training
Decision Boundary
References

Setup¶

In [0]:

!apt-get install imagemagick
!pip install -q celluloid

In [0]:

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML
import pickle
from math import ceil
from matplotlib.offsetbox import AnchoredText
from celluloid import Camera

In [0]:

def plot_points(X, y, alpha=None):
    dogs = X[np.argwhere(y==1)]
    cats = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in cats], [s[0][1] for s in cats], s = 25, \
                color = 'red', edgecolor = 'k', label='cat', alpha=alpha)
    plt.scatter([s[0][0] for s in dogs], [s[0][1] for s in dogs], s = 25, \
                color = 'blue', edgecolor = 'k', label='dog', alpha=alpha)
    plt.xlabel('$x_1$')
    plt.ylabel('$x_2$')
    plt.legend()

This is what we've learnt so far: A neuron's weights and bias can be optimised using gradient descent to create a linear solution boundary to segment a dataset, as seen below:

Let's create a dataset slightly more complicated than the one above.

In [ ]:

X,y = make_classification(n_samples=1000, n_features=2, n_informative=2, 
                    n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2,
                          class_sep=1.8,
                   flip_y=0,weights=[0.5,0.5], random_state=580)

fig = plt.figure(dpi=120)
plot_points(X,y,alpha=.8)
plt.savefig('1.png')
plt.show()

This time, the features of the pets have a much more complicated relationship with the output classes cat and dog. As it's clearly obvious from the graph above, this dataset cannot be properly segmented by a linear decision boundary. This is where a Multilayer Perceptron (MLP) comes into picture.

Multilayer Perceptron¶

A multilayer perceptron (MLP) is a class of feedforward artificial neural networks. As the name suggests, it's comprised of multiple neurons stacked together. The following is one of the simplest MLPs possible.

The MLP above has 4 neurons in it, with 3 being in the hidden layer, and one in the output layer. The input layer is a simple one-to-one mapping with the input features. Let's try to understand the reasoning behind stacking them up like this.

Instead of having one neuron perform a weighted sum of the inputs, we'll make the 3 neurons in the hidden layer perform the same task. Each neuron has its own sets of weights and biases associated with it. The weighted sum is then passed on to a non-linear activation function (sigmoid in this case).

Till now in the process, it's like having 3 different neurons do the same task independently. The gamechanger is to then have the outputs of these 3 neurons be input to the neurons of the next layer (a single neuron in this case). This output neuron performs a weighted sum of the activations input to it, and passes them through a non-linear function. Since, we want the output of the model to be probabilities, we'll use sigmoid as the activation function. The whole process is termed as feedforward operation.

Feedforward¶

Let's see how the feedforward operation is performed in an MLP.

The neurons of the hidden layer have their weights and biases. The weights can be represented as the following matrix:

\large \text{weightsInputToHidden} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}

In [0]:

N_input = 2
N_hidden = 3
N_output = 1

In [9]:

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_input_to_hidden

Out[9]:

array([[-0.0451407 , -0.14543326,  0.14033849],
       [-0.08288667, -0.02778502,  0.00107311]])

\large \text{biasInputToHidden} =\begin{bmatrix} b_{1}&b_{2}&b_{3} \end{bmatrix}

In [10]:

bias_input_to_hidden = np.zeros(N_hidden)
bias_input_to_hidden

Out[10]:

array([0., 0., 0.])

Hidden outputs¶

The hidden outputs are calculated as follows:

In [0]:

def sigmoid(x):
    return 1/(1+np.exp(-x))

In [12]:

input_x = X[0,:].reshape(1,-1)
input_x.shape

Out[12]:

(1, 2)

In [13]:

output_y = y[0].reshape(1,-1)
output_y.shape

Out[13]:

(1, 1)

In [14]:

hidden_layer_in = np.dot(input_x,weights_input_to_hidden) + bias_input_to_hidden
hidden_layer_out = sigmoid(hidden_layer_in)
hidden_layer_out

Out[14]:

array([[0.46542901, 0.56663112, 0.41688494]])

Next, the output neuron has its set of weights and bias. This can be represented as:

\large \text{weightsHiddenToOutput} =\begin{bmatrix} w_{1o}\\w_{2o}\\w_{3o} \end{bmatrix}

In [15]:

weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))
weights_hidden_to_output

Out[15]:

array([[-0.06185345],
       [-0.05837421],
       [-0.03493544]])

\large \text{biasHiddenToOutput} =\begin{bmatrix} b_{o} \end{bmatrix}

In [16]:

bias_hidden_to_output = np.zeros(N_output)
bias_hidden_to_output

Out[16]:

array([0.])

Output¶

The operation performed by the output neuron is as follows:

In [17]:

output_layer_in = np.dot(hidden_layer_out,weights_hidden_to_output) + bias_hidden_to_output
output_layer_out = sigmoid(output_layer_in)
output_layer_out

Out[17]:

array([[0.48090202]])

Error Function¶

As we know from the last post, binary cross-entropy is used to calculate error for binary classification problems. Binary cross-entropy for a single data point is as follows: $\large E = -y\log(\hat{y}) - (1-y)\log(1-\hat{y})$

In [18]:

error = (-1)*output_y*np.log(output_layer_out) - (1-output_y)*np.log(1-output_layer_out)
error

Out[18]:

array([[0.73209173]])

In order to have the MLP properly model the dataset, this error needs to be minimised. As we'll see below, this can be done through gradient descent.

Backpropagation¶

Since the prediction of the MLP is a function of all the parameters of the MLP — weights_input_to_hidden, bias_input_to_hidden, weights_hidden_to_output, and bias_hidden_to_output — the error is a function of these parameters too.

Thus, it can be minimised by performing gradient descent as follows: $\large parameter \leftarrow parameter - \alpha\frac{\partial E}{\partial parameter}$

Partial derivative wrt prediction¶

Let's start by calculating the partial derivative of the error wrt. the output prediction $\hat{y}$ .

In [19]:

del_y_out = (-1)*output_y/output_layer_out + (1-output_y)/(1-output_layer_out)
del_y_out

Out[19]:

array([[-2.07942565]])

Partial derivate wrt hidden output¶

We can then calculate the partial derivative of the error wrt to the other parameters by using the chain rule.

In [20]:

del_h = del_y_out*output_layer_out*(1-output_layer_out)*weights_hidden_to_output.T
del_h

Out[20]:

array([[0.032108  , 0.03030194, 0.01813492]])

Partial derivative wrt output layer weights¶

In [21]:

del_w_h_o = del_y_out*output_layer_out*(1-output_layer_out)*hidden_layer_out.T
del_w_h_o

Out[21]:

array([[-0.24160326],
       [-0.29413707],
       [-0.21640413]])

Partial derivative wrt output layer biases¶

In [22]:

del_b_o = del_y_out*output_layer_out*(1-output_layer_out)
del_b_o

Out[22]:

array([[-0.51909798]])

Partial derivative wrt hidden layer weights¶

In [23]:

del_w_i_h = del_h*hidden_layer_out*(1-hidden_layer_out)*input_x.T
del_w_i_h

Out[23]:

array([[-0.01928461, -0.01796252, -0.01064204],
       [ 0.02385166,  0.02221647,  0.01316232]])

Partial derivative wrt hidden layer biases¶

In [24]:

del_b_h = del_h*hidden_layer_out*(1-hidden_layer_out)
del_b_h

Out[24]:

array([[0.00798863, 0.00744095, 0.00440845]])

Let's perform the gradient descent step, with an appropriate learning rate.

In [0]:

learn_rate = 1e-3

In [0]:

# update
weights_hidden_to_output -= learn_rate*del_w_h_o
bias_hidden_to_output -= learn_rate*del_b_o[0,:]

weights_input_to_hidden -= learn_rate*del_w_i_h
bias_input_to_hidden -= learn_rate*del_b_h[0,:]

This is how backpropagation works in an MLP, and neural networks in general.

Let's put all of this in a single class titled MLPNeuralNet, with it's own implementaions of forward propagation (feed_forward), and backpropagation (back_propagate).

In [0]:

class MLPNeuralNet(object):
    def __init__(self,N_input, N_hidden,N_output):
        self.weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
        self.weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))

        self.bias_input_to_hidden = np.zeros(N_hidden)
        self.bias_hidden_to_output = np.zeros(N_output)
        
        
    def feed_forward(self,input_x):
        self.hidden_layer_in = np.dot(input_x,self.weights_input_to_hidden) + self.bias_input_to_hidden
        self.hidden_layer_out = sigmoid(self.hidden_layer_in)
        self.output_layer_in = np.dot(self.hidden_layer_out,self.weights_hidden_to_output) + self.bias_hidden_to_output
        self.output_layer_out = sigmoid(self.output_layer_in)
        return self.output_layer_out

    def back_propagate(self,input_x,output_y, learn_rate):
        error = (-1)*(output_y*np.log(self.output_layer_out) + (1-output_y)*np.log(1-self.output_layer_out))
        del_y_out = (-1)*output_y/self.output_layer_out + (1-output_y)/(1-self.output_layer_out)
        del_h = del_y_out*self.output_layer_out*(1-self.output_layer_out)*self.weights_hidden_to_output.T
        del_w_h_o = del_y_out*self.output_layer_out*(1-self.output_layer_out)*self.hidden_layer_out.T
        del_b_o = del_y_out*self.output_layer_out*(1-self.output_layer_out)
        del_w_i_h = del_h*self.hidden_layer_out*(1-self.hidden_layer_out)*input_x.T
        del_b_h = del_h*self.hidden_layer_out*(1-self.hidden_layer_out)
        
        self.weights_hidden_to_output -= learn_rate*del_w_h_o
        self.bias_hidden_to_output -= learn_rate*del_b_o[0,:]

        self.weights_input_to_hidden -= learn_rate*del_w_i_h
        self.bias_input_to_hidden -= learn_rate*del_b_h[0,:]
        return error

Training¶

Let's write a training function that will run gradient descent on a MLPNeuralNet object.

In [0]:

def train(nn, features, outputs, num_epochs, learn_rate, print_error=False):
    
    for e in range(num_epochs):
        for input_x, output_y in zip(features, outputs):
            input_x = input_x.reshape(1,-1)
            output_y = output_y.reshape(1,-1)
            
            output_layer_out = nn.feed_forward(input_x)
            
            error = nn.back_propagate(input_x,output_y,learn_rate)
        
        if print_error:
            if e%200==0:
                print(error[0][0])

Let's divide up the datasets into training and validation sets.

In [0]:

np.random.seed(10)
indices = np.random.permutation(X.shape[0])
training_idx, validation_idx = indices[:900], indices[900:]
training_set_features, validation_set_features = X[training_idx,:], X[validation_idx,:]
training_set_labels, validation_set_labels = y[training_idx], y[validation_idx]

In [34]:

training_set_features.shape,training_set_labels.shape

Out[34]:

((900, 2), (900,))

In [35]:

validation_set_features.shape,validation_set_labels.shape

Out[35]:

((100, 2), (100,))

Let's plot the two sets. The transparent data points belong to the training set, while the others belong to the validation set.

In [ ]:

fig=plt.figure(dpi=120)
plot_points(training_set_features,training_set_labels,alpha=.2)
plot_points(validation_set_features,validation_set_labels,alpha=1.)
plt.legend().set_visible(False)
plt.savefig('2.png')
plt.show()

Training the MLP on the training set.

In [51]:

np.random.seed(10)
nn = MLPNeuralNet(N_input, N_hidden, N_output)
train(nn,training_set_features,training_set_labels,2000,1e-3,print_error=True)

0.6876729700588742
0.010870559674664732
0.006562504501112279
0.0050788103852413705
0.004590401623432532
0.004401605255706727
0.0037705871008028506
0.00293949778464105
0.002318797566627306
0.0019117365857087827

Let's see how well the model performs on the validation set.

In [52]:

((nn.feed_forward(validation_set_features)>=0.5).astype(int)==validation_set_labels.reshape(-1,1)).mean()*100

Out[52]:

97.0

It's correctly classifying points in the validation set with a 97% accuracy. That's pretty good. Let's see the decision boundary for the MLP after training.

Decision Boundary¶

Let's see how well the MLP models the dataset.

In [ ]:

h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = (nn.feed_forward(np.c_[xx.ravel(), yy.ravel()]) >0.5).astype(int)

Z = Z.reshape(xx.shape)
fig = plt.figure(dpi=150)
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu)
plot_points(X,y,alpha=.8)
plt.title('Solution Boundary');
plt.savefig('3.png')

That's pretty good for this super simple MLP with only 4 neurons. As I mentioned before, the capability of a MLP of creating non-linear decision boundary comes from it's constituent neurons learning different decision boundaries of their own, and the output prediction being influenced by each of these individual decision boundaries by different weightages. These weights connecting different neurons are optimised by backpropagation during gradient descent.

Let's take a look at how the decision boundary of the MLP, as well as the decision boundaries of the neurons in the hidden layer shape up during the training process.

In [0]:

def display_linear_boundary(ax, m, b, x_1_lims,x_2_lims,color='g--',label=None,linewidth=3.):
    ax.set_xlim(x_1_lims)
    ax.set_ylim(x_2_lims)
    x = np.arange(-10, 10, 0.1)
    ax.plot(x, m*x+b, color, label=label,linewidth=linewidth)
    ax.set_yticklabels([])
    ax.set_xticklabels([])
    return ax

In [0]:

def plot_points_on_ax(ax,X, y,alpha):
    dogs = X[np.argwhere(y==1)]
    cats = X[np.argwhere(y==0)]
    ax.scatter([s[0][0] for s in cats], [s[0][1] for s in cats], s = 25, \
               color = 'red', edgecolor = 'k', label='cats',alpha=alpha)
    ax.scatter([s[0][0] for s in dogs], [s[0][1] for s in dogs], s = 25, \
               color = 'blue', edgecolor = 'k', label='dogs',alpha=alpha)
    return ax

In [0]:

def create_decision_boundary_anim(nn,X,y,epochs=300,learn_rate=1e-3,snap_every=100,dpi=100,print_error=False):
    
    fig = plt.figure(figsize=(9,9),dpi=dpi)

    gs = fig.add_gridspec(3, 3)

    sol_bound = fig.add_subplot(gs[:2, :])
    sol_bound.set_yticklabels([])
    sol_bound.set_xticklabels([])

    h_1 = fig.add_subplot(gs[2, 0])

    h_2 = fig.add_subplot(gs[2, 1])

    h_3 = fig.add_subplot(gs[2,2])

    hidden_layer_axes = [h_1, h_2, h_3]
    
    
    camera = Camera(fig)

    h = .02
    
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
    
    
    for e in range(ceil(epochs/snap_every)):
        train(nn,X,y,snap_every,learn_rate,print_error=print_error)
        Z = (nn.feed_forward(np.c_[xx.ravel(), yy.ravel()]) >0.5).astype(int)
        Z = Z.reshape(xx.shape)
        sol_bound.contourf(xx, yy, Z, cmap=plt.cm.RdBu)
        
        for i,ax in enumerate(hidden_layer_axes):

            w1,w2 = nn.weights_input_to_hidden[:,i][0],nn.weights_input_to_hidden[:,i][1]
            b = nn.bias_input_to_hidden[i]
        
            ax = display_linear_boundary(ax, -w1/w2, -b/w2, [x_min, x_max], [y_min, y_max])
            
        text_box = AnchoredText(f'Epoch: {snap_every*(e+1)}', frameon=True, loc=4, pad=0.5)
        plt.setp(text_box.patch, facecolor='white', alpha=0.5)
        sol_bound.add_artist(text_box)

        camera.snap()

    
    sol_bound = plot_points_on_ax(sol_bound,X, y,alpha=.6)
    sol_bound.set_title(f'Decision boundary of MLP')
    
    for i,ax in enumerate(hidden_layer_axes):
        ax = plot_points_on_ax(ax,X, y,alpha=.6)
        ax.set_title(f'Hidden neuron {i+1}')
    
    plt.tight_layout()
    animation = camera.animate()

    return animation

In [0]:

np.random.seed(10)
nn = MLPNeuralNet(N_input, N_hidden, N_output)
anim = create_decision_boundary_anim(nn,training_set_features,training_set_labels,\
                                     epochs=2000,dpi=120,\
                                     learn_rate=1e-3,snap_every=50,\
                                     print_error=False)
plt.close()

In [0]:

HTML(anim.to_html5_video())

Let's increase the learning rate to 5e-5. This will speed up the training process as evident below.

In [ ]:

np.random.seed(10)
nn = MLPNeuralNet(N_input, N_hidden, N_output)
anim = create_decision_boundary_anim(nn,training_set_features,training_set_labels,\
                                     epochs=500,dpi=120,\
                                     learn_rate=5e-3,snap_every=50,\
                                     print_error=False)
plt.close()

In [ ]:

HTML(anim.to_html5_video())

The larger cell shows the evolution of the MLP's decision boundary during the training process. The bottom 3 cells show the decision boundaries of the 3 neurons in the hidden layer.

It can be seen in the animation above that the decision boundaries of the 3 neurons come together to create the non-linear decision boundary of the MLP. The curviness of the decision boundary increases as those of neurons 1 and 3 become steeper.

This is how neurons — when properly stacked together — can create neural networks that can learn to segment datasets that are not linearly separable.

dhruv's space

ML Basics #2: Multilayer Perceptron

Table of Contents

Setup¶

Multilayer Perceptron¶

Feedforward¶

Hidden outputs¶

Output¶

Error Function¶

Backpropagation¶

Partial derivative wrt prediction¶

Partial derivate wrt hidden output¶

Partial derivative wrt output layer weights¶

Partial derivative wrt output layer biases¶

Partial derivative wrt hidden layer weights¶

Partial derivative wrt hidden layer biases¶

Training¶

Decision Boundary¶

References¶