Friday, 3 October 2014

Logistic Classifier - Overfitting and Regularization

In this article we will look at Logistic regression classifier and how regularization affects the performance of the classifier.

Training a machine learning algorithms involves optimization techniques.However apart from providing good accuracy on training and validation data sets ,it is required the machine learning to have good generalization accuracy.The machine learning algorithms should perform well on unseen examples as well.The model is trained by optimizing its performance over some training dataset however its performance is determined on its ability to perform on unseen datasets.

The term over fitting is often used when machine learning algorithm has high accuracy on training data set but poor generalization accuracy.

Problem of Overfitting

Over-fitting generally occurs when a model is excessively complex.A model that has been overfit will generally have poor generalization capabilities , as it can perform errors due to minor fluctuations in data dues to noise and other parameters which were not modeled during the training process.

Just because a model agrees with training data does not mean it will perform well on unseen examples and is not necessarily a good model.In case of overfitting the model just tries to learn pecularities of the training set and does not work well on unseen examples that differ from the training set.

In general over-fitting can be associated with complexity.In multiclass logistic classifier for MNIST digit classification where are 7850 free parameters that are optimized.This is a large number of parameters.

Training,Validation and Testing Datasets

Typically we can get some idea as to if over-fitting has been performed on not by periodically testing the model on unseen test data set.Thus during the training process we make use of 3 types of data sets.Training data set that is primarily used for learning.The validation data set which is a part of training data set but no used during the learning and test data set that contains a large variations of typically input data that represents all possible unseen examples.

The validation data set is the unseen data set against when we can test the model to check for over-fitting and generalization performance.The learning algorithm has not seen the samples from the validation data set.Validation data sets is used to test  the various choices of the model parameters .Thus validation data set which is again considered a part of training dataset is used to periodically evaluate the model parameters.Based on the performance of the model we decide if over-fitting has occurred,learning process has been saturated and possibly adaptively tune the model.

The difference between  validation data set and test data set is that validation data set is not as challenging as the test data set.It contains unseen examples but may not contain a large variation or all possible end cases or difficult examples.

The test data is primarily used to check if the model truly performs as expected agains a large variation in data.That is why the test data set is supposed to contains samples with large variations in data and difficult and end cases ,which are not included in the training data .If we use test data directly for validation purpose we may bias the classifier to overfit or choose parameters according to the test data ,which is again undesirable for the sake of generality

Validation Curves

If the we find that validation accuracy or training accuracy is increasing while the accuracy on test data is decreasing  we can assume that this is due to over-fitting.Intuitively, over-fitting occurs when the model or the algorithm fits the data too well.

In some causes we may observe that accuracy on the training data is reducing while that on the test data is constant .This also may imply over-fitting. Since model is trying to fit the training data better by updating the model ,however updating the model does not affect its performance on the test data set.

Avoid Overfitting

In the article we look at logistic regression classifier and how to handle the cases of overfitting

Increasing size of dataset

One of the ways to combat over-fitting is to increase the training data size.Let take the case of MNIST data set trained with 5000 and 50000 examples,using similar training process and parameters.
Below are observed errors on the training,validation and testing dataset

at the 250-th iteration the errors are
training accuracy is 12.34  %
testing accuracy is 12.0  %
validation accuracy is 11.29  %

we can observe that training and validation errors steadily decrease during the initial part of the learning process.However after 100-th iteration we can observe that rate at which training error improves is larger that that of validation errors.

The validation error essentially remains constant.At 120-th iteration we see again some improvement in validation curves ,but still rate of error on training dataset is higher.After the 140-th iteration it is validation curves essentially represents a straight line indicating no learning though the training error keeps on reducing.This may be an indication of over-fitting.

Now lets perform the same testing with the complete set of 50000 samples.Of cource the absolute error will be better when we include a larger training set as the classifier would learn to handle larger variations in data.But what is essential is to observe the rate of change of errors and not the absolute errors.

we can see at 250-iteration the errors are
training accuracy is 11.34  %
testing accuracy is 10.0  %
validation accuracy is 10.29  %

This is almost the same as the earlier training process.This implies that training of data from 5000 samples dataset for 250 iterations gives a similar performance a using a larger data set.This may lead us to conclude that learning algorithm has learn efficiently from data.However what is essential to observe that there is no saturation of error curves.The error curves are dropping at a larger rate at 300 in the present example than the earlier one.Thus if we had let the training continue the error rates would have dropped even more before saturating.Especially we would have obtained lower error on test and validation data sets.

Thus increasing the training sample size has avoided over-fitting at 300-th iteration of the training process and will lead to better generalization performance.

Early stopping

Another way to combat over-fitting is to perform early stopping.

Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation

Early stopping based on cross-validation combats over-fitting by monitoring the model’s performance on a validation set.The error on the validation set is used as a proxy for the generalization error in determining when over-fitting has begun.

Many ad-hoc rules for deciding when over-fitting has truly begun are used in different implementation but few how them have any theoretical foundation.

One of rules may be that if we see that the error on the training data is decreasing while error on the validation dataset remains constant or begins to decrease,this may indicate that we need to stop the training process or it might lead to over-fitting.

One of the commonly used criteria for early stopping is rate of change fo validation error.If the validation error does not change significantly in successive iterations ,we reduces the number of iterations for which gradient based learning is performed.

As mentioned earlier A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set.The validation examples are considered to be representative of future test examples.If the model’s performance ceases to improve sufficiently on the validation set, or even degrades with further optimization, then the some heuristic's can be employed to cease further optimization.

In present article we specify a strategy based on a geometrically increasing amount of patience which are used in tutorials in stochastic gradient learning based frameworks.We will look at stochastic gradient descent algorithms in the future artciles .The gradient based learning algorithms executes the learning algorithms N times.

In mathematics, a geometric progression, also known as a geometric sequence, is a sequence of numbers where each term after the first is found by multiplying the previous one by a fixed, non-zero number called the common ratio.

The heuristic used is the geometric progression of validation error.Let us consider the common ratio of 0.9.if best validation error be 0.5 and next one is less than 0.9*0.5=0.45 the we consider that signifiant reduction in error has occurred.

if error has  reduced significantly then we increase the number of iteration the algorithm is supposed to run by a geometric factor and if there is no significant increase in error we reduce the number of iteration to be run by a geometric factor.If the max number of iterations have been reached then we say that learning process has been completed and retain the parameters of the best learning iteration.

Using this criteria the for training samples with 5000 samples,the learning stops at 134-th iteration with

training accuracy is 12.82  %
testing accuracy is 11.4  %
validation accuracy is 12.09  %

we can see that at even 250-th iteration  we had not obtained a significant improvement in the performance

training accuracy is 12.34  %
testing accuracy is 12.0  %
validation accuracy is 11.29  %

Thus early stopping helps prevent over fitting.

The code for validation heuristics is as follows
       if error3 < self.best_validation_loss:
         if error3 < self.best_validation_loss *self.improvement_threshold:
                self.patience = max(self.patience, self.iter * self.patience_increase)
             self.best_validation_loss = error3
             self.best_iter = self.iter
          self.patience = min(self.patience,self.iter+self.iter/self.patience_increase)
error3 is the current validation error in the above code. self.improvement_threshold is error improvement factor typically takes value between 0.9-1 self.patience_increase is geometric iteration factor,typically a value greater than 2. The number of iterations is reduced by this factor every time validation heuristic is not met.


A way to combat over-fitting is through regularization. Regularization techniques can be viewed as imposing certain prior distribution on the model parameters.Mathematically the regularization process implies performing constrained optimization.

L2 regularization implies imposing Gaussian prior on weights while L1 prior implies imposing Laplacian prior on weights.

There are several regularization techniques .However in this article we will look at $L_{p}$ regularization and its effect on Logistic classification process.
$L_{p}$ regularization involve adding an extra term to the cost function, which penalizes certain parameter configurations
If the Original loss function is defined as
$\begin{eqnarray*} L(\theta,\mathcal{D}) = - \sum_{i=1}^N log P(Y=y_i | X=x_i,\theta)  \end{eqnarray*}$
The regularized loss function is given by
$\begin{eqnarray*} E(\theta,\mathcal{D}) = L(\theta,\mathcal{D}) + \lambda R(\theta) \end{eqnarray*}$

In the general in the case of $(L_{p}$ regularization

$\begin{align} R'(\theta)= \lambda || \theta||_{p}^p  = \lambda \left(\sum_{j=0}^{|\theta|}{|\theta_j|^p}\right) \end{align}$

where $||\theta||_{p}^p$ is the $L_p$ norm of $\theta$ and $\lambda$ is a parameter which controls the relative importance of the regularization parameter.Generally L1 or L2 norms are used .
Intuitively adding the regularization term will penalizes large values of parameters which decrease the amount of non-linearity that the network models.Adding a regularization term will have the effect of simplifying the models and improve the performance in presence of over fitting. Thus performing minimization of loss functions in presence of regularization term will provide us the simplest model that can fit the training data.

In the earlier article "MultiClass Logistic Regression in Python" the optimum parameters of the classifier were determined by minimizing the cost function.We had computed the gradient of the cost function wrt to the parameters.

due to addition of regularization term to the cost function,the gradient cost function will have additional terms corresponding to gradient of L2 norm.

The first effect of the regularization term is on the cost function.The cost function will have a prior likelihood on account of the regularization term.

    def negative_log_likelihood(self,params):
        # args contains the training data
        sigmoid_activation = pyvision.softmax(self.W,self.b,x);
        if self.Regularization==2:
        if self.Regularization==1:
        return l;
The next changes is in the gradient computation.
    """ function to compute the gradient of parameters for a single data sample """
    def compute_gradients(self,out,y,x):
        if self.Regularization==2:
        if self.Regularization==1:
        return res;
where (i) indicates the class for which the derivative is being computed.

L2 regularization based optimization is simple since the additional cost function added is continous and differentiable. For.For L1 regularization we use the basic sub-gradient method to compute the derivatives.

First we look at L2 regularization process.

L2 Regularization

The regularization is affected by regularization constant.Often the process is to determine the constant empirically by running the training with various values
A large values of constant will lead to flattening of error curves very soon an generally model will exhibit a lower accuracy.However a small values will lead to large number of iternations being performed and may lead to over-fitting.

These can be seen from below curves for L2 regularization
regularization parameter=0.9                                     

   regularization parameter=0.1

we can see that with parameter value of 0.9 learning saturates at around 18% error at 40 iterations while for the value 0,1 learning saturates at iteration 120 with error of 15%.Thus regularization process in generally prevents the learning algorithm to respond to outliers in data by controlling the weight parameters.

Thus we can observe inherent trade off.Regularization prevents the model to respond to variation in data,which in turn leads to lower accuracy of the model in learning all possible variations of data accurately.This is the case of bias variance trade off.


Bias represents how the model represent the training data and variance is how the model responds to variation in data.

Variance is large is different training sets give rise to very different classifier.Variance is small if training set has minor effect on the classification decisions.Variance measures how inconsistent the decisions and essentially variation of prediction of learned classifier.

Models with high variances are sensitive to noise,small changes in training data sets can lead to significantly different model parameters.

Models with high bias produce simple models that do not tend to over fit but may under-fit the data failing to capture the variations in the data.Models with low bias are complex enabling them to represents the training data accurately .

Models with high variance exhibit large fluctuations wrt to changes in the data,they represent the training data well,but often over-fit the data .Models with low variance are not sensitive to changes in data,they can provide good generalization performance in presence of noise.

Ideally one wants to choose a model that captures regularities in the training data as well as generalize well to unseen data.This however is not possible.There often exists a bias variance trade-off and many machine learning algorithms try to control this trade-off to obtain a good performance.

The learning error can typically be represented as sum of bias and variance.During the minimization process,reducing the bias will often lead to higher variance and vice versa. The trade-off often comes in the form of selecting a model having higher bias and lower variance or one with  lower bias and higher variance.Complex models will have low bias and high variance and simple models will have high bias and low variance.Ideally we would like the model to have low bias and low variance.

The regularization parameter gives us control over bias-variance trade-off. Large value of regularization parameter leads reduces variance .Since regularization provides robustness in case of outliers in data.Low values of the parameter will lead regularization having no effect on learning.
Thus regularization gives us a choice of reducing variance .The base learning algorithm is responsible for providing high bias.The a classifier with regularization can provide a good and robust model.

L1 and L2 regularization 
We can observe that model attains lower error rates using L1 regularization as opposed to L2 regularization at same number of iteration using identical regularization constant.constant. However the error rates remain higher that case with no regularization though learning algorithm converges faster.

L2 regularization                                                                L1 regularization

In conclusion we can see various methods of combating overfitting and how it affects the performance of classifiers and how regularization gives us a tool to control the variance of the model.


The code for logistic regression classifier with regularization can be found at github repository
The files impements the LogisticRegression classifier with regression. implments the minibatch stochastic gradient  algorithm
The entire pyVision directory is reuired for running the code.Running the files will execute the training process.
It can be downloaded from