## Wednesday, 25 September 2013

### Mean Shift Tracking

# Mean Shift Tracking\\\$2cm] #### Introduction In the article we will look at the application of Mean Shift Tracking for color based tracking. #### Mean shift The object model used in mean shift tracking is color probability distribution. Now we have a object model,given an image we can compute the likelihood image Each pixel in likelihood image represents the likelihood that pixel belongs to the object model/histogram. \[ L_u(y_i) = p_u*\delta(b(y_i)-u)$ This likelihood image assigns to each pixel a similarity measure wrt object model.

It is reasonable to assume that the region in which the highest similarity measure or highest density is observed is a good estimate of object location.

Thus if we consider a small window and move towards the mean value ie along the mean shift vector we should eventually reach the region of maximum similarity.

The likelihood surface is not smooth,we can give it properties of smoothness using kernel density estimation.

Now we can find the modes of the similarity surface using standard mean shift algorithm.

Let us assume that current estimate of mode of function is at $y$. Thus we consider a small rectangular window about $y$,compute the mean shift vector and take a small step along the mean shift vector.

In principle this should enable to find the local maximum. Similarity surface is discontinuous and to do this we need to perform KDE over entire image consisting of dense grid of points ,which is a expensive operation.(We need to perform convolution at each point with Gaussian with suitably large aperture)

#### Using Similarity for tracking

The concept of similarity surface can be made useful in tracking application.

Let us consider a small region of interest about a present location y, we can compute the similarity score about this region,perform KDE on this small region,obtain a similarity surface and compute the mean shift vector.

If object is not present in region,similarity surface will be flat and mean shift vector will be zero.

If there is object present in some part of region,it will correspond to modes of similarity surface .The mean shift vector will give us direction to move along.

Now instead of trying to estimate the mode,say we translate the region of interest along direction provided by the mean shift vector. This would typically lead to large portion of object being visible and would expose the region of global similarity surface where a large maximum would lie.

This is the basis of mean shift tracking,keen on translating the region of interest ,till we reach local maximum of similarity surface.

For tracking applications ,since fast computation is required, we can consider a rectangular window with bandwidth equal to that of the region of interest.The present location of point point is the center of the rectangular region.

#### Implementation

A video of mean shift tracking is shown below.A naive object model based on color probability in HS color space using first frame of the video

Another issue is that if the object is moving too fast and significant part of the object moves out of ROI in successive frames,the object will not be tracked. This can be seen in the following videos

If we encounter a larger object or object that exhibits higher density ,tracking will be lost.In the below case the tracking is lost when object passes over a large blue background which is similar to object color.

There are many other cases where mean shift tracking will fail

As will all tracking approaches ,the performance heavily depends on the object model. The better we are able to model the object and obtain a likelyhood/similarity which does not show high probability for background or other objects in the scene,the more accurate will be the tracking

#### Code

For further image processing application a library consisting of high level interface to opencv will be used.The library is called OpenVisionLibrary. https://github.com/pi19404/OpenVision

The project cmake file is included in the repository. the build will create the library and test files in the bin directory To run demo program for mean shift run the binary meanShiftTest

The files for mean shift algorithm are meanshift.cpp and meanshift.hpp https://github.com/pi19404/OpenVision/tree/master/ImgProc/ repository.

To run the test program : Select the region of interest and click the build model button to start tracking.

For video file initially only the first frame is show ,select the ROI in the first frame and then click on build model button to start the tracking.

The button is shown upon clicking on Display properties button on the window.

# Mean Shift Algorithm

#### Introduction

In the article we will look at the basics of Mean Shift Algorithm.

#### Kernel density Estimation

let us first consider a univariate gaussian PDF and sampled data from the PDF. The kernel density estimation uses fact that the density of samples about a given point is proportional to its probability.

It approximates the probability density by estimating the local density of points as seen in figure fig:image1 is resonable.

Large density of points are observed near the maximum of PDF.
fig:image3 The KDE estimate the PDF by superposition of kernel at each data point,which is equivalent to convolving the data points with a gaussian kernel.

#### Mean Shift

Mean shift is a tool for finding the modes in a set of data samples which is sampled from underlying PDF.The aim of the mean shift algorithm is to find the densest region in given set of data samples.

Data points density implies PDF value .

Let us consider a 2D region.The points in the 2D region are sampled from a underlying unknown PDF.

Let $X=[x,y]$be random variables associated with a multi-variate PDF $P(X)$.

Thus sampling a point $P(X)$ will give us a vector $X'=[x',y']$

For example let us consider a multi-variate gaussian distribution where the random variables x and y take values in the range -3 to 3.

#### Modes of Smooth function

Let us say we want to find the modes of PDF.The PDF is approximated using kernel density estimation.Modes are the points at which PDF exhibits local maximum .

Dense regions in PDF corresponds to modes or local maxima.

Since the kernel is smooth,its differentiable.It gives to a smooth PDF.The gradient of density estimate is given by \begin{eqnarray*} \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)\\ \nabla \hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n \nabla K\Big(\frac{x-x_i}{h}\Big) \\ for gaussian kernel \\ \nabla \hat{f}_h(x) = \frac{C}{nh}\sum_{i=1}^n \nabla exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big) \\ \nabla \hat{f}_h(x) = \frac{C}{nh*h^2}\sum_{i=1}^n exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big)*(-(x-xi)) \\ \nabla \hat{f}_h(x) = \frac{1}{nh*h^2}\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)*(-(x-xi)) \\ equating the gradient to 0 \\ \sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*(x-xi)=0 \\ \sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*x = \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i\\ x'=\frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)} \end{eqnarray*} The estimate is $x'=m(x)$ is called the sample mean at x with kernel K.

This will always be biased towards region with high density.

Thus if we were to move along the vector $m(x)-x$,we would reach the region with higher density.The density at $m(x)$ will be greater than density at $x$. This forms the basis of mean shift algorithm.

The vector $m(x)-x$ is called the mean shift vector which always points in the direction of local maximum or mode. \begin{eqnarray*} m(x)-x = \frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)(x_i-x)}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)} \\ m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)} \end{eqnarray*} This is a estimate of normalize gradient of $f_h(x)$

Given any point x,we can take a small step in the direction of vector $m(x)-x$ to reach the local maximum.

Let us consider that the present estimate of the mode is $x$, then we compute $m(x)$ at this point.

For examples let initial estimate of the location of mode be $(-0.96,-2.01)$ The density at this point can be approximated by interpolation or computed again using non parametric density estimation

The plot for this is show in fig:image5.The estimate clearly does not lie at maximum.

To find the direction of the mean shift vector we find the gradient of the normalize density estimate and take a smalll step in that direction.This is perform till gradient magnitude is very small

fig:image5 A video for mean shift algorithm using KDE is shown in

In this case we scale the gradient by the estimated PDF values to obtain normalize gradient values. \begin{eqnarray*} m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)} \end{eqnarray*} This enabled us to adaptively change the step size based on estimated PDF value.The step size magnitude is iversly proportional to estimated PDF values. if the estimated PDF values is small,we are far away from the maximum and the step size will be large.

If the estimate PDf value is large,we are close to maximum and the step size will be small.

#### Using the Gradient

to find the modes of the PDF,we do not actually required to estimate the PDF,we require just the gradient of the PDF to move in the direction of the mean shift vector.

The gradient of superposition of kernels centered at each data point is equivalent to convolving the data points with gradient of the kernel.

Instead of convolving with gaussian kernel,we can convolve with gradient of gaussian kernel.

\begin{eqnarray*} k(X) = C exp\Big(-\frac{x^2+y^2}{2*h}\Big) \\ \nabla k(X) = \Big[-\frac{x}{h}*k(x) ; -\frac{y}{h}*k(x) \Big] \end{eqnarray*} Thus given a intial point ${X}$ ,we estimate the value at ${X}$ using the kernels $-\frac{x}{h}*k(x)$ and $-\frac{x}{h}*k(x)$ which gives us the direction of gradient at the point $X$

Since we do not actually estimate the PDF at a point,but estimate the gradient of PDF each time during the mean shift iteration we need to take a step in direction of mean shift vector,in the earlier case ,we used the scale the gradient by the estimated PDF values to obtained a normalized measure.

However in the present case we do not adaptively change the step size but take a step of fixed size in direction of the gradient.

This still incorporates some adaptive behavior,since mean shift vector magnitude depends on the gradient magnitude.

If gradient magnitude is large,step size take will be large else step take will be small and refined ,near the maximum.

video of mean shift algorithm using gradient estimates is shown in

This iterative algorithm is a standard gradient descent algorithm and the convergence is guranteed for infinately small step size.

Since the algorithm depends on kernel density estimate, the bandwith of kernel will play a important role in mean shift algorithm as well.

#### Local Maxima

If we reach a region,where local density is flat or we have reached a local maximum.The algorithm will terminate.

this is a problem in case of all algorithms trying to reach a global maximum.The animation for the same is shown in

#### Code

The Code is written in matlab and available in repository https://github.com/pi19404/m19404/tree/master/meanshift the file mean_shift.m is the main file.The file kgde2 implements kernel density estimator using bivariate gaussian windows for 2D distributions.The file kgde2x implements estimation of gradient on KDE .The dim parameter decides the computation of gradient along x and y directions.

# 1-D Kernel Density Estimation For Image Processing

#### Introduction

In the article we will look at the basics methods for Kernel Density Estimation.

#### Non Parametric Methods

The idea of the non-parametric approach is to avoid restrictive assumptions about the form of f(x) and to estimate this directly from the data rather than assuming some parametric form for the distribution eg gaussian,expotential,mixture of gaussian etc.

#### Kernel Density Estimation

kernel density estimation (KDE) is a non-parametric way to estimate the probability density function where the estimation about the population/PDF is performed using a finite data sample.

A general expression for non parametric density estimation is $p(x) = \frac{k}{NV}$
• where k is number of examples inside V
• V is the volume surrounding x
• N is total number of examples
Histograms are most simplest form of non-parametric method to estimate the PDF .

To construct a histogram, we divide the interval covered by the data values into equal sub-intervals, known as bins. Every time, a data value falls into a particular sub-interval/bin the count associated with bin is incremented by 1.

For histogram V can be defined $WxH$ where W is bin width and H is unbounded

In the figure fig:image1 the hue histogram of rectangular region of image is shown.

Histograms are described by bin width and range of values. In the above the range of Hue values is $0-180$ and the number of bins are 30

We can see that histograms are discontinuous ,which may not necessarily be due to underlying discontinuity of underlying PDF but also due to discretization due to bins and Inaccuracies may also exist in the histogram due to binning . Histograms are not smooth and depend on endpoints and width of the bins This can be seen in figure fig:image1 b.

Typically estimate becomes better as we increase the number of points and shrink the bin width and this is true in case of general non parametric estimation as seen in figure fig:image1 c.

In practice the number of samples are finite,thus we not observe samples for all possible values,in such case if the bin width is small,we may observe that bin does no enclose any samples and estimate will exhibit large discontinuties. For histogram we group adajcent sample values into a bin.

#### Kernel Density Estimation

Kernel density estimation provides another method to arrive at estimate of PDF under small sample size.The density of samples about a given point is proportional to its probability. It approximate the probability density by estimating the local density of points as seen in figure fig:image3

#### Parzen window technique

Parzen-window density estimation is essentially a data-interpolation technique and provide a general framework for kernel density estimation.

Given an instance of the random sample, ${\bf x}$, Parzen-windowing estimates the PDF $P(X)$ from which the sample was derived It essentially superposes kernel functions placed at each observation so that each observation $x_i$ contributes to the PDF estimate.

Suppose that we want to estimate the value of the PDF $P(X)$ at point $x$. Then, we can place a window function at $x$ and determine how many observations $x_i$ fall within our window or, rather, what is the contribution of each observation $x_i$ to this windowing

The PDF value $P (x)$ is then the sum total of the contributions from the observations to this window

Let $(x_1,x_2,\ldots, x_n)$ be an iid sample drawn from some distribution with an unknown density $\mathcal{f}$. We are interested in estimating the probability distribution $\mathcal{f}$. Its parzen window estimate is defined as $\hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)$ Where $\mathcal{K}$ is called the kernel,$h$ is called its bandwidth,$k_h$ is called a scaled kernel

Kernel density estimates are related to histograms,but possess properties like smoothness or smoothness by using a suitable kernel.

Commonly used kernel functions are uniform,gaussian,Epanechnikov etc

Superposition of kernels centered at each data point is equivalent to convolving the data points with the kernel.we are smoothing the histogram by performing convolution with a kernel. Different kernels will produce different effects.

#### Rectangular windows

For univariate case the rectangular windows encloses k examples about a region of width h centered about x on the histogram.

To find the number of examples that fall within this region ,the kernel function is defined as

\begin{equation*} k(x) = \begin{cases} 1 & |x| \le h,\\ 0 & otherwise \end{cases} \end{equation*} hence total number of bins of histogram be 180,hence bin width is 1.Let us apply a window function with bandwidth 6,12,18 etc and observe the effect on histogram

The kernel density estimate using parzen window of bandwidth 6,12 and 18 are shown in figure fig:image2.

#### Gaussian Windwos

The kernel function for the gaussian window is defined as \begin{eqnarray*} k(x) = C*exp\Big(-\frac{x^2}{2*\sigma^2}\Big) \end{eqnarray*} Instead of a parze rectangular window let us apply a gaussian window of width 6,12 and 18 and observe the effects on the histogram It can be seen that estimate of PDF is smooth,however the bandwidth plays an important role in the estimated PDF.A small bandwidth of 6 estimates a bimodal PDF width peaks well seperated. A bandwidth of 12,still is bimodal however the peaks are no longer seperated. A larger bandwidth of 16 estimates a unimodal PDF.

The bandwidth of the kernel is a free parameter which exhibits a strong influence on estimate of the PDF.Selecting bandwidth is a tradeoff between accuracy and generality.

#### Code

The class Histogram contains methods to perform kernel density estimation for 1D histogram using rectangular and gaussian windows.The definition for Histogram class can be found in files Histogram.cpp and Histogram.hpp.The code can be found at https://github.com/pi19404/m19404/tree/master/OpenVision/ImgProc The file to test the kernel density estimation is kde_demo.cpp and can be found in https://github.com/pi19404/m19404/tree/master/OpenVision/demo To compile the code for kde_demo run command

# Histogram Comparision

#### Introduction

In this section we will look at techniques for histogram comparison .

#### Histogram Comparison

In many application histogram is used as object model .Thus to compare an unknown object with a known object we can compute the similarity between their histograms.

The histogram of image shows the frequency distribution of different pixel intensities.

The value of histogram for pixel level/intensity $\mathcal{g}$ shows the total count of pixels in the image with pixel intensity $\mathcal{g}$.Let this be denoted by $h_a(g)$

$h_a(g) = \sum_{i,j \in x} \delta(g-A(i,j))$ \subsection{Minkowski norm} The histogram can be viewed as vectors and similariy is calculated using Minkowski norm $(L_p)$.

The manhattan distance $p=1$ and euclidean distance $p=2$ are special cases of minkowski norm.

\begin{eqnarray*} d_{L_1}(h_a,h_b) = \sum_{g \in G} | h_a(g) - h_b(g) |

d_{L_2}(h_a,h_b) = \sum_{g \in G} | h_a(g) - h_b(g) |^2 \end{eqnarray*} A high value indicates a mismatch while low value indicates a better match.0 indicates a perfect mismatch while total mismatch is 1 for normalized histogram. \subsection{Histogram Intersection} For 2 histograms $h_a$ and $h_b$ the histogram intersection is defined as \begin{eqnarray*} d_I(h_a,h_b) = \sum_{g \in G} min(h_a(g),h_b(g)) \end{eqnarray*} A high score indicates a good match while low score indicates bad match.if both histograms are normalized then 1 is a perfect match while 0 indicates total mismatch indicating no regions of histogram overlap. \subsection{Correlation} For 2 histograms $h_a$ and $h_b$ the histogram correlation is defined as ,where histograms are simple considered as discrete 1-D signals. \begin{eqnarray*} d_C(h_a,h_b) = \frac{\sum_{g \in G} (h_a(g)-\bar{h_a})(h_b(g)-\bar{h_b})}{\sqrt{ \sum_{g \in G} (h_a(g)-\bar{h_a})^2 \sum_{g \in G} (h_b(g)-\bar{h_b})}}

where \bar{h_k} = \frac{1}{N} \sum_j H_k(j)

N is the total number of bins \end{eqnarray*} If the histogram is normalized ,then $\sum_j H_k(j) = 1 ,\bar{h_k} = \frac{1}{N}$ The value of 1 represents a perfect match ,-1 indicates a ideal mismatch while 0 indidates no correlation.

For correlation a high score represents a better match than a low score. \subsection{Chi-Square} For 2 histograms $h_a$ and $h_b$ the histogram similarity using Ch-Squared method is defined as $d(h_a,h_b) = \sum_{g\in G} \frac{\left(H_a(g)-H_b(g)\right)^2}{H_a(g)}$ A low score represents a better match than a high score.0 is a perfect match while mismatch is unbounded.

The chi-square test is testing the null hypothesis, which states that there is no significant difference between the expected and observed result \subsection{Bhattacharya Distance} In statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations $d(h_a,h_b) = \sqrt{1 - \frac{1}{\sqrt{\bar{h_a} \bar{h_b} N^2}} \sum_{g\in G} \sqrt{h_a(g) \cdot h_b(g)}}$ For Bhattacharyya matching , low scores indicate good matches and high scores indicate bad matches. A value of 0 is perfect match while 1 indicates a perfect mismatch.

#### multi-channel images

For histogram with independ channels,values of similarity scores for each channel can be added seperately.

#### Implementation

The opencv libraries are used to perform histogram related operations.All the histogram related functions are encapsulate in the class named histogram.

The code is present in files histogram.cpp and histogram.hpp which are wrapper classes for calling opencv histogram functions and performing comparison operations etc.

Let us consider the Hue histogram for following images and below are the similarity score using the bhattachrya distance method,chi-square,intersection and correlation methods.

Consider the sets of image

 1-2 0.5912 12.221 0.2798 0.2342 1-3 0.9841 3209.8 0.0027 -0.0775 1-4 0.7662 288.80 0.1975 0.0814

We can see that images 1 and 2 will have similar histograms ,for correlation and intersection for image pair 1-2 will be greater than 1-3 .Indicating the similarity of histograms 1-2 incomparison to 2-3.

Chi square mismatch is unbounded as can be seen by high value.A large value can be observed in 2-3 compared to 1-2

The bhattachrya distance is close to 0 for better match,and nearly 1 for mismatch, for 1-3 value is almost 1 indicating a ideal mismatch.

In case of image pair 1-2 and 1-4 are similar,but coefficients indicate that histogram of 1-2 is more similar than 1-4.
The similarity measures are crude at best,however it might be useful in case of discriminative task ,however for identification task it seems like a poor measure.

#### Code

For code refer to site http://code.google.com/p/m19404/source/browse/OpenVisionLibrary/ImgProc/ Histogram.cpp,Histogram.hpp,test_histogram.cpp files.

# Color Constancy Algorithms : Minkowski P Norm Method

#### Color Constancy

Color constancy is a mechanism of detection of color independent of light source. The light source many introduce color casts in captured digital images To solve the color constancy problem a standard method is to estimate the color of the prevailing light and then, at the second stage, remove it. Once the color of light in individual channels is obtained the each color pixel is normalized by a scaling factor . In the previous article we had looked at the gray world algorithm.The present article describes another method to achive color constancy using normalized minkowski p-norm.

#### normalized minkowski p-norm

Another variant to estimate illumination vector is to calculate the normalized minkowski p-norm for each color channel color constancy algorithm which is based on Minkowski norm - for each color channel, the Minkowski p-norm is calculated and the normalized result forms the estimated illumination vector.

The gray world algorithm is a special case of with norm =1.

$e_i = \left( \frac{1}{N} \sum_i p_i \right)^{1/p}$ where summation if over all the pixels.

The average illuminatio Gray world algorithm is obtained by setting p=1 The shades of gray, is given by Finlayson and Trezzi concluded that using Minkowski norm with p = 6 gave the best estimation results on their data set. One of the methods of normalization is that the mean of the three components is used as illumination estimate of the image. To normalize the image of channel i ,the pixel value is scaled by $s_1 = \frac{avg}{avg_i}$ where $avg_i$ is the channel mean and $avg$ is the illumination estimate .

Another method of normalization is normalizing to the maximum channel by scaling by $s_i$ $r_i = \frac{max(avg_R,avg_G,avg_B)}{avg_i}$

Another method of normalization is normalizing to the maximum channel by scaling by norm $m_i$ $m_i = \sqrt{(avg_r*avg_r+avg_g*avg_g+avg_b*avg_b)}$ $r_i = \frac{max(m_R,m_G,m_B)}{m_i}$

Below are some output images on which norm 6 was applied. Some of the images are taken from http://research.edm.uhasselt.be/~oancuti/Underwater_CVPR_2012/ image set

#### Code

For code refer to site http://www.gihub.com/pi19404/m19404/master/ColorConstancy/ The files are color_constancy.cpp and color_constancy.hpp.

The class for applying color constancy using minkowski p norm technique

The norm factor is $p=1$ for gray world algorithm and $p=6$ for shades of gray algorithm . Various normalization techniques can be passed as $m=(1,2,3)$.