Monday, 23 September 2013

Mean Shift Algorithm

Mean Shift Algorithm

Introduction

In the article we will look at the basics of Mean Shift Algorithm.

Kernel density Estimation

let us first consider a univariate gaussian PDF and sampled data from the PDF. The kernel density estimation uses fact that the density of samples about a given point is proportional to its probability.

It approximates the probability density by estimating the local density of points as seen in figure fig:image1 is resonable.

Large density of points are observed near the maximum of PDF.
original PDF
Sampled data
Density estimate
Density estimation
fig:image3 The KDE estimate the PDF by superposition of kernel at each data point,which is equivalent to convolving the data points with a gaussian kernel.

Mean Shift

Mean shift is a tool for finding the modes in a set of data samples which is sampled from underlying PDF.The aim of the mean shift algorithm is to find the densest region in given set of data samples.

Data points density implies PDF value .

Let us consider a 2D region.The points in the 2D region are sampled from a underlying unknown PDF.

Let $X=[x,y] $be random variables associated with a multi-variate PDF $P(X)$.

Thus sampling a point $P(X)$ will give us a vector $X'=[x',y']$

For example let us consider a multi-variate gaussian distribution where the random variables x and y take values in the range -3 to 3.

Modes of Smooth function

Let us say we want to find the modes of PDF.The PDF is approximated using kernel density estimation.Modes are the points at which PDF exhibits local maximum .

Dense regions in PDF corresponds to modes or local maxima.

Since the kernel is smooth,its differentiable.It gives to a smooth PDF.The gradient of density estimate is given by \begin{eqnarray*} \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)\\ \nabla \hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n \nabla K\Big(\frac{x-x_i}{h}\Big) \\ for gaussian kernel \\ \nabla \hat{f}_h(x) = \frac{C}{nh}\sum_{i=1}^n \nabla exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big) \\ \nabla \hat{f}_h(x) = \frac{C}{nh*h^2}\sum_{i=1}^n exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big)*(-(x-xi)) \\ \nabla \hat{f}_h(x) = \frac{1}{nh*h^2}\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)*(-(x-xi)) \\ equating the gradient to 0 \\ \sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*(x-xi)=0 \\ \sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*x = \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i\\ x'=\frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)} \end{eqnarray*} The estimate is $x'=m(x)$ is called the sample mean at x with kernel K.

This will always be biased towards region with high density.

Thus if we were to move along the vector $m(x)-x$,we would reach the region with higher density.The density at $m(x)$ will be greater than density at $x$. This forms the basis of mean shift algorithm.

The vector $m(x)-x$ is called the mean shift vector which always points in the direction of local maximum or mode. \begin{eqnarray*} m(x)-x = \frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)(x_i-x)}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)} \\ m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)} \end{eqnarray*} This is a estimate of normalize gradient of $f_h(x)$

Given any point x,we can take a small step in the direction of vector $m(x)-x$ to reach the local maximum.

Let us consider that the present estimate of the mode is $x$, then we compute $m(x)$ at this point.

For examples let initial estimate of the location of mode be $(-0.96,-2.01)$ The density at this point can be approximated by interpolation or computed again using non parametric density estimation

The plot for this is show in fig:image5.The estimate clearly does not lie at maximum.

To find the direction of the mean shift vector we find the gradient of the normalize density estimate and take a smalll step in that direction.This is perform till gradient magnitude is very small

Mean Shift
fig:image5 A video for mean shift algorithm using KDE is shown in


In this case we scale the gradient by the estimated PDF values to obtain normalize gradient values. \begin{eqnarray*} m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)} \end{eqnarray*} This enabled us to adaptively change the step size based on estimated PDF value.The step size magnitude is iversly proportional to estimated PDF values. if the estimated PDF values is small,we are far away from the maximum and the step size will be large.

If the estimate PDf value is large,we are close to maximum and the step size will be small.

Using the Gradient

to find the modes of the PDF,we do not actually required to estimate the PDF,we require just the gradient of the PDF to move in the direction of the mean shift vector.

The gradient of superposition of kernels centered at each data point is equivalent to convolving the data points with gradient of the kernel.

Instead of convolving with gaussian kernel,we can convolve with gradient of gaussian kernel.

\begin{eqnarray*} k(X) = C exp\Big(-\frac{x^2+y^2}{2*h}\Big) \\ \nabla k(X) = \Big[-\frac{x}{h}*k(x) ; -\frac{y}{h}*k(x) \Big] \end{eqnarray*} Thus given a intial point ${X}$ ,we estimate the value at ${X}$ using the kernels $-\frac{x}{h}*k(x)$ and $-\frac{x}{h}*k(x)$ which gives us the direction of gradient at the point $X$

Since we do not actually estimate the PDF at a point,but estimate the gradient of PDF each time during the mean shift iteration we need to take a step in direction of mean shift vector,in the earlier case ,we used the scale the gradient by the estimated PDF values to obtained a normalized measure.

However in the present case we do not adaptively change the step size but take a step of fixed size in direction of the gradient.

This still incorporates some adaptive behavior,since mean shift vector magnitude depends on the gradient magnitude.

If gradient magnitude is large,step size take will be large else step take will be small and refined ,near the maximum.

video of mean shift algorithm using gradient estimates is shown in


This iterative algorithm is a standard gradient descent algorithm and the convergence is guranteed for infinately small step size.

Since the algorithm depends on kernel density estimate, the bandwith of kernel will play a important role in mean shift algorithm as well.

Local Maxima

If we reach a region,where local density is flat or we have reached a local maximum.The algorithm will terminate.

this is a problem in case of all algorithms trying to reach a global maximum.The animation for the same is shown in
original PDF
Mean shift
fig:image6

Code

The Code is written in matlab and available in repository https://github.com/pi19404/m19404/tree/master/meanshift the file mean_shift.m is the main file.The file kgde2 implements kernel density estimator using bivariate gaussian windows for 2D distributions.The file kgde2x implements estimation of gradient on KDE .The dim parameter decides the computation of gradient along x and y directions.