Thursday, 24 April 2014

Multi Class Logistic Regression Training and Testing

Introduction

In this article we will look at training and testing of a Multi-class Logistic Classifier
• Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$. Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability.
• Mathematically this can be expressed as \begin{eqnarray*} & P(Y=i|x, W,b) =\frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
• Corresponding to each class $y_i$ logistic classifier is paramemterized by a set of parameters $W_i,b_i$.
• These parameters are used to compute the class probability.
• Given a unknown vector x,The prediction is performed as \begin{eqnarray*} & y_{pred} = argmax_i P(Y=i|x,W,b) \\ & y_{pred} = argmax_i \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \end{eqnarray*}
• Given a set of labelled training data ${X_i,Y_i}$ where $i\ in {1,\ldots,N}$ we need to estimate these parameters.

Loss Function

• Ideally we would like to compute the parameters so that the $0-1$ loss is minimized \begin{eqnarray*} & \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}} \\ & f(x)= argmax_k P(Y=y_k |x,\theta) \end{eqnarray*}
• $P(Y=y_k |x,\theta)$ is modelled using logistic function.
• The $0-1$ loss function is not differentiable ,hence optimizing it for large modes is computationally infesible.
• Instead we maximize the log-likelyhood of the classifier given the training data $\mathcal{D}$.
• Maximum Likelyhood estimation is used to perform this operation.
• Estimate the parameters so that likelyhood of training data $\mathcal{D}$ is maximized under the model parameters
• It is assumed that the data samples are independent ,so the probability of the set is the product of probabilities of individual examples. \begin{eqnarray*} & L(\theta={W,b},\mathcal{D}) =argmax \prod_{i=1}^N P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = argmax \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - argmin \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ \end{eqnarray*}
• It should be noted that Likelyhood of correct class is not same as number of right predictions.
• Log Likelyhood function can be considered as differential version of the $0-1$ loss function.
• In the present application negative log-likelyhood is used as the loss function
• Optimal parameters are learned by minimizing the loss function.
• In the present application gradient based methods are used for minimization.
• Specifically stochastic gradient descent and conjugated gradient descent are used for minimization of the loss function.
• The cost function is expressed as \begin{eqnarray*} & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log {e^{W_i x + b_i}}- log {\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N {W_i x + b_i} + log \frac{1}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
• The first part of the sum is affine,second is a log of sum of exponentials which is convex Thus the loss function is convex.
• Thus we can compute the parameters corresponding to global maxima of the loss function using gradient descent methods.
• Thus we compute the derivatives of the loss function $L(\theta,\mathcal{D})$ with respect to $\theta$,$\partial{\ell}/\partial{W}$ and $\partial{\ell}/\partial{b}$

Theano

• Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
• It is a expression compiler,and can evaluate symbolic expression when executed.Typically programs which are implemented in C/C++ can be written concisely and efficiently in Theano.
• computing the gradients in most programming languages (C/C++, Matlab, Python), involves manually deriving the expressions for the gradient of the loss with respect to the parameters $\partial{\ell}/\partial{W}$, and $\partial{\ell}/\partial{b}$,
• This approah not only involves manual coding but the derivatives can get difficult to compute for complex models,
• With Theano, this work is greatly simplified as it performs automatic differentiation .

Example

• For demonstration,we will use MNIST dataset The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples for the training set and 10,000 examples for testing. The official training set of 60,000 is divided into an actual training set of 50,000 examples and 10,000 validation examples All digit images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. In the original dataset each pixel of the image is represented by a value between 0 and 255, where 0 is black, 255 is white and anything in between is a different shade of grey.
• The dataset can be found at http://deeplearning.net/data/mnist/mnist.pkl.gz.
• The data set is pickled can be loaded using python pickle package.
• The data set consists of training,validation and test set.
• The data set consists of feature vector of length $28x28=784$ and number of classes are 10.
• A class called Logistic Regression is defined which encapsulates the methods that are used to perform training and testing of multi-class Logistic Regression classifier.

Theano Code

• The python code for training and testing can be found in the git repository https://github.com/pi19404/OpenVision ImgML/LogisticRegression.py file.
• the ImgML/load_datasets.py contains methods to load datasets from pickel files or SVM format files.
• C/C++ code has been also written using Eigen/OpenCV and incorporated in OpenVision Library. This can be found in the files https://github.com/pi19404/OpenVision ImgML/LogisticRegression.cpp and ImgML/LogisticRegression.hpp files.