Thursday, 24 April 2014

Adaptive Skin Color Detector

Adaptive Skin Color Detection


In this article we will look at Adaptive Skin Color Detection technique described in the paper \cite{Dadgostar:2006:ARS:1151986.1151991}.
  • Skin Color can be efficiently represented using Hue channel of HSV color space.
  • A static Skin color detector/Global Skin color detection can be specified by low and higher hue thresholds.
  • The hue range specified global skin color detector should detect the aactual skin colored pixels ,but it would also falsely detect some non skin colored pixels belonging to background or objects with similar hue color as skin like wood etc
  • The ammount of falsely detected pixels may be large in some situtations compared to actual skin pixels,if a significant ammount of scence contains objects with hue similar to the skin color.
  • The choise of image aquisition system,lighting conditions,pre-processing etc affect the choice of hue thresholds.
  • Hence the optimum thresholds needs to be decided adaptively.
  • In HCI applications hand or face regions are used to communicate with the computer and assuming that dominant motion in the scene belongs to hand and skin pixels.
  • One of the ways to detect the regions belonging to skin regions is motion tracking.
  • The Hue thresholds are adaptively changed by observing the motion of skin colored pixels in the image.
  • Thus the first step is to determine the in motion skin colored pixels.

    Global Skin Colored Detector

  • A Global Skin Colored Detector is specified by lower and upper Hue thresholds and lower and higher intensity thresholds.
  • Initals lower and upper hue thresholds are chosen as 3 and 33.
  • The Hue range provided is a generic thresholds that will cover all the possible skin colors.
  • Due to the generic nature of the skin threshold,some background objects whose hue is similar to skin or falls within the specified threshold may also be detected.
  • The initial lower and higher intensity thresholds are chosen as 15 and 250.
  • The thresholds are choosen such to avoid over or under-exposed regions in the images.

    Skin Color Histogram

  • The filtered skin colored pixels can be used to construct a skin colored histogram which represents the statistical distribution of the skin colored pixels in the scene.
  • In case of global Skin colored detector,this histogram also accumulates the data due to background pixles.
  • Let us known assume that we known the pixels that belong to hand or face pixels.
  • We again compute the histogram of skin colored pixels.
  • The Actual Skin Colored histogram is merged with original skin colored histogram
  • The histograms are combined using a weighted average

  • A good result is obtained by choosing a value in the range 0.02-0.05 for a.
  • For each frame the range of hue thresholds are re-calculated based on new histogram.
  • The criteria used for selection of lower and upper thresholds is such that area under the histogram covers f%.
  • In the paper a criteria of 90-96\% was used

    Global Threshold
    new Threshold
  • Without any additional information ,the criteria used for selection of lower and upper threholds ,simply is instrumental in removal of outliers.
  • In the figures shown ,some background pixels,pixels belonging to hair,lips etc are also shown as skin colored pixels.
  • Now let is consider pixels which belong to face,This in given manually by specifying a mask.Pixels in ROI (168,63,50,50) is explicitly specified as skin colored pixels.
  • A histogram is computed by considering only the pixels in ROI.
  • The global histogram and newly constructed histogram are combined by performing a weighted average.
  • Obviously the pixels in histogram computed over entire image are large than ones computed in small ROI,to avoid bias due to count of pixels used to build the histogram, the histograms are normalized between 0 to 1,before computing the linear combination.
  • Then we determine the pixels between which 90\% of pixels lie.
  • The hue range corresponding to this is (13,16)
  • The skin color detected considering the new range in shown in figure fig:image5

    New threshold
    Skin Image
  • Thus incorporating the cue's about skin color,enhances the detection performane of skin colored pixels.
  • In the above example the cue has been encorporated manually,however if we can incorporate the cur automatically then we have make the process of skin color detection completely adaptive.
  • Some techniques suggested in the paper were based on using motion based cue like frame differencing and optical flow tracking.
  • These techniques are suitable for HCI application assuming the object of interest is in motion.
  • Frame differencing provides a simple method to determine the region which encountered motion and use these pixels .


    The code for the same can be found at OpenVision Repository The class AdaptiveSkinDetector encapsulates the methods for implementing the skin detector. The code for the same can be found in files ImgProc/adaptiveskindetector.cpp IgmProc/adaptiveskindetector.hpp. For histogram computation the class Histogram is used which can be found in the filesImgProc/Histogram.cpp,ImgProc/Histogram.hpp
    1#include "ImgProc/adaptiveskindetector.hpp"
    3AdaptiveSkinDetector ss1;
    5Mat hmask;,hmask);

  1. Farhad Dadgostar and Abdolhossein Sarrafzadeh. An Adaptive Real-time Skin Detector Based on Hue Thresholding: A Comparison on Two Motion Tracking Methods . In: Pattern Recogn. Lett. 27.12 (Sept. 2006), pp. 1342 1352.  issn:0167-8655. doi: 10.1016/j.patrec.2006.01.007. url: http://dx.doi. org/10.1016/j.patrec.2006.01.007.

The PDF version of the document can be found below

Multi Class Logistic Regression Training and Testing


In this article we will look at training and testing of a Multi-class Logistic Classifier
  • Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$. Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability.
  • Mathematically this can be expressed as \begin{eqnarray*} & P(Y=i|x, W,b) =\frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
  • Corresponding to each class $y_i$ logistic classifier is paramemterized by a set of parameters $W_i,b_i$.
  • These parameters are used to compute the class probability.
  • Given a unknown vector x,The prediction is performed as \begin{eqnarray*} & y_{pred} = argmax_i P(Y=i|x,W,b) \\ & y_{pred} = argmax_i \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \end{eqnarray*}
  • Given a set of labelled training data ${X_i,Y_i}$ where $i\ in {1,\ldots,N}$ we need to estimate these parameters.

    Loss Function

  • Ideally we would like to compute the parameters so that the $0-1$ loss is minimized \begin{eqnarray*} & \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}} \\ & f(x)= argmax_k P(Y=y_k |x,\theta) \end{eqnarray*}
  • $P(Y=y_k |x,\theta)$ is modelled using logistic function.
  • The $0-1$ loss function is not differentiable ,hence optimizing it for large modes is computationally infesible.
  • Instead we maximize the log-likelyhood of the classifier given the training data $\mathcal{D}$.
  • Maximum Likelyhood estimation is used to perform this operation.
  • Estimate the parameters so that likelyhood of training data $\mathcal{D} $ is maximized under the model parameters
  • It is assumed that the data samples are independent ,so the probability of the set is the product of probabilities of individual examples. \begin{eqnarray*} & L(\theta={W,b},\mathcal{D}) =argmax \prod_{i=1}^N P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = argmax \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - argmin \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ \end{eqnarray*}
  • It should be noted that Likelyhood of correct class is not same as number of right predictions.
  • Log Likelyhood function can be considered as differential version of the $0-1$ loss function.
  • In the present application negative log-likelyhood is used as the loss function
  • Optimal parameters are learned by minimizing the loss function.
  • In the present application gradient based methods are used for minimization.
  • Specifically stochastic gradient descent and conjugated gradient descent are used for minimization of the loss function.
  • The cost function is expressed as \begin{eqnarray*} & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log P(Y=y_i | X=x_i,W,b) \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log \frac {e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N log {e^{W_i x + b_i}}- log {\sum_j e^{W_j x + b_j}} \\ & L(\theta,\mathcal{D}) = - \sum_{i=1}^N {W_i x + b_i} + log \frac{1}{\sum_j e^{W_j x + b_j}} \\ \end{eqnarray*}
  • The first part of the sum is affine,second is a log of sum of exponentials which is convex Thus the loss function is convex.
  • Thus we can compute the parameters corresponding to global maxima of the loss function using gradient descent methods.
  • Thus we compute the derivatives of the loss function $L(\theta,\mathcal{D})$ with respect to $\theta$,$\partial{\ell}/\partial{W}$ and $\partial{\ell}/\partial{b}$


  • Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
  • It is a expression compiler,and can evaluate symbolic expression when executed.Typically programs which are implemented in C/C++ can be written concisely and efficiently in Theano.
  • computing the gradients in most programming languages (C/C++, Matlab, Python), involves manually deriving the expressions for the gradient of the loss with respect to the parameters $\partial{\ell}/\partial{W}$, and $\partial{\ell}/\partial{b}$,
  • This approah not only involves manual coding but the derivatives can get difficult to compute for complex models,
  • With Theano, this work is greatly simplified as it performs automatic differentiation .


  • For demonstration,we will use MNIST dataset The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples for the training set and 10,000 examples for testing. The official training set of 60,000 is divided into an actual training set of 50,000 examples and 10,000 validation examples All digit images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. In the original dataset each pixel of the image is represented by a value between 0 and 255, where 0 is black, 255 is white and anything in between is a different shade of grey.
  • The dataset can be found at
  • The data set is pickled can be loaded using python pickle package.
  • The data set consists of training,validation and test set.
  • The data set consists of feature vector of length $28x28=784$ and number of classes are 10.

  • A class called Logistic Regression is defined which encapsulates the methods that are used to perform training and testing of multi-class Logistic Regression classifier.

    Theano Code

  • The python code for training and testing can be found in the git repository ImgML/ file.
  • the ImgML/ contains methods to load datasets from pickel files or SVM format files.
  • C/C++ code has been also written using Eigen/OpenCV and incorporated in OpenVision Library. This can be found in the files ImgML/LogisticRegression.cpp and ImgML/LogisticRegression.hpp files.

The PDF Version of the document can be found below

ARM Neon Optimization for image interleaving and deinterleaving

ARM Neon Optimization InterLeaving/De-Interleaving


  • In this article we will look at basic interleaving and de-interleaving operations using ARM Neon optimization and evaluate the performance improvements on android based mobile device in comparison with standard opencv code

    ARM Neon

  • ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of a piece of code.
  • SIMD technology allows process multiple data with one instruction call, saving time for other computations A set of pixels will be processed at a time.
  • One way to achieve this is to write assembly code ,which requires a steep learning curve and requires knowledge of processor architecture,instruction set etc.
  • Instead of using low-level instructions directly. There are special functions, called intrinsic, which can be treated as regular functions but they works with input data simultaneously.

    Deinterleaving and Interleaving channels of Image

  • NEON structure loads read data from memory into 64-bit NEON registers, with optional deinterleaving. Stores work similarly, reinterleaving data from registers before writing it to memory.
  • A set of neon intrinsic instruction set are provided for deinterleaving data.
  • The simultaneously pull data from the memory and seperate the data into different registers This is called deinterleaving .
  • The Neon structure loads the data from the memory into 64 bit neon registers with optional interleaving.
  • The opencv funtions split and merge are ported to arm neon and performance comparision with opencv code is performed.
  • Data loads interleaves elements based on the size specified in the instruction .


  • The de-interleave seperates the pairs of adjacenet elements in the memory into seperate registers.
  • the VLD3 instruction seperates/de-interleaves the BGR channels of the image and sperates them into 3 different registers.The BGR values are stored in adjacent memory locations.
  • The result of vld instruction is then stored to registers which point to destination memory location


  • Since the application is being developed for android applications,the android NDK toolchain is used for cross compilation.
  • ndk-build utility is used to build the application.
  • The ndk-build utility requires that based build directory contain a directory called \textbf{jni}
  • The jni directory contains all the source files as well as \textbf{,} which are the makefile for build process.
  • In the present application the jni directory contains the files \textbf{neon.cpp,helloneon-intrinsics.c,helloneon-intrinsics.h} source files.
  • To initiated the build process in verbose mode execute the command
  • This generates the helloneon binary in the \textbf{libs/armeabi-v7a} directory
  • The directory is transferred to the directory \textbf{/data/tmp/local/NEON_TEST} on android mobile device
  • The \textbf{/data/tmp/local} directory and files created under this directory can contains files with execute permission.I could not find any other sud-directory under the file system which provided execute permission for binaries or ability to provide execute permissions for binaries.
  • The script a.ksh being called below exports basic variables and then executes the binary.
  • The performance of neon intrinsic function is compared with standard opencv split function
    OPENCV : 15ms
    NEON : 11ms
  • There is not a very significant improvement seen due to neon optimization.
  • As per many references and by viewing the disassembly output of the compiler it can be seen that the main reason was found that the arm compiler is not able to generate optimized assembly code .
  • The compiler generates heavily unoptimized code that results in larger number of cycles than required.
  • The compilation commands were taken from the ndk-build verbose build output and the -c flag was replaced with -s to generate the assembly code
  • The above command will generate the the file \textbf{helloneon-intrinsics.s} in the present directory
  • A lot of unecessary instruction can be observed in the assembly code.
  • The assembly level code corresponding to the functions were optimized and compiled
  • For compilation again the debug build output observed from ndk-build process as modified so that \textbf{helloneon-intrinsics.o} object file is compiled from \textbf{helloneon-intrinsics.s} and helloneon binary file is compiled and linked from all source files.
  • The results of the optimization process is as follows
    OPENCV : 15ms
    NEON : 8ms
  • Thus a speedup factor of 1.4 and total performance improvement of 2.5x was observed.
  • Thus it can be seen that atleast 2.5x improvement is observed after optimizing the assembly code.
  • This still does not motivate the use of assembly level coding since the developement effort may outweight the optimization benifits.


  • The interleaving operation corresponds to combining 3 independent channels of a image into multi-channel image.
  • Each element of idependent channels are stored in adjacent locations in the multi-channel image.
  • The performance is as follows :
    OPENCV : 9ms
  • The interleaving process shows a performance improvement of about 3x.
  • Thus by using neon intrinsics we can achieve performance improvements wrt standard C code and by optimizing the assembly code further performance benifits can be achived.
  • It is to be noted that OPENCV code is compiled with SSE optimization which may also be in play hence the actual code speedup may be higher.
  • However a large speedup was not observed in the interleaving and de-interleaving operation due to optimizing the assembly code .


  • The code for the same can be found in the git repository in the POC/ARM subdirectory.
  • The jni subdirectory consists of the source files as well as the make files.
  • The files \textbf{generate_assembly.ksh} generate the helloneon-intrinsis.s files in the ARM directory.After modifying the file copy it to the jni sub-directory,
  • \textbf{compile_assembly.ksh} compiles the helloneon-intrinsis.s and also the binary file
  • The binary requires the opencv library files which needs to be transferred to the android mobile device
  • The link Execution Cycle computation : shows the number of execution cycles taken by ARM assembly code ,which can be used to check the performance of compiler generated and optimized code.
The PDF version of the document can be found below

Dense Optical Flow Expansion Based On Polynomial Basis Approximation

Dense Motion Estimation based on Polynomial expansion


In this article we will look at dense motion estimation based on polymonial repsentation of image.The polynomial basis representation of the image is obtained by approximating the local neighborhood of image using quadratic polynomial basis.The displacement between adjacent frames can be obtained by equating the coefficients of the basis.


  • This article describes a fast dense optical flow computation algorithm by [4] . In the earlier articles it was seen that a local neighborhood of image can be represented using polynomial basis. Using this representation estimation of dense optical flow is obtained at each point in the image.

    2D basis function
    Polynomial basis
  • Assuming how a polynomial transforms under translation using the polynomial expansion coefficients derived from current and previous frames and estimate of displacement vector is obtained.
  • The idea of polynomial expansion is to approximate a neighborhood of a point in a 2D function with a polynomial.Considering quadratic polynomial basis $1,x^2,y^2,x,y,xy$,pixel values in a neighborhood of image is represented by \[ f(x) ~ x^T A x + b^T x + c \] where A is a symmetric matrix,b is a vector and c is a scalar
  • The coefficients can be estimated by weighted least square estimate of pixel values about neighborhood as seen in the earlier article. As with all optical flow algorithm,the brightness constancy assumption is made. The brighness of a path of image in adjacent frames is constant.
  • Consider a translational motion $\mathbf{d}$ ecountered at point $\mathbf{(x,y)}$ in the image. \begin{eqnarray*} \hline \\ & f_1(x) = x^T A_1 x + b_{1}^Tx +c_1 \\ & f_2(x) = f_1(x-d) = (x-d)^T A_1 (x-d) + b_{1}^T(x-d) +c_1 \\ & f_2(x) = f_1(x-d) = (x)^T A_1 (x) + (b_{1}-2A_1d)^T(x) \\ & + d^TAd-b_1^Td+c_1 \\ & f_2(x) = f_1(x-d) = (x)^T A_1 (x) + b_2^T(x) + c_2 \\ & equating coefficients in the two polynomials \\ & Due to brightness constancy assumption \\ & A_2 = A_1 \\ & b_2=(b_{1}-2A_1d) \\ & c_2=d^TAd-b_1^Td+c_1 \\ & Assuming $A$ is non-singular \\ & d=-\frac{1}{2}A^-1(b_2-b_1) \\ \hline \end{eqnarray*} Thus by equating the coefficients of the polynomial the displacement vector can be obtained at each point in the image assuming there is overlap between the region of interest ie image neighborhood in adjacent frames.
  • Let us say we have the estimate of displacement $\bar{d}$ We extract the ROI about the neighborhood at point $P(x,y)$ and point $P(x+d.x,y+d.y)$ The polynomial basis are extracted and the computation that is shown above is performed.
  • The total displacement can be estimated as \begin{eqnarray*} \hline \\ &\bar{b}_2=b_{1}-2A_1 (\bar{d}) \\ & \text { we know $\bar{d}$,$A_1$,$b_1$} \\ & d = -0.5*A^{-1} (b_2+\bar{b}_2 - b_1) \\ \hline \end{eqnarray*}
  • Thus an iterative scheme can be used where in every successive iteration a better estimate of displacement vector is obtained.The iterations can be terminated when change is displacement vector is below is threshold in successive iterations or specific number of iterations have been completed. The intial estimate of displacement vector is assumed to be $\mathbf{(0,0)}$ Thus ROI or the image patch in the current and previous frames are at the same.
  • The method $\textbf{EstimateFlow}$ computes the coefficients $A,b_2,b_1$ required for displacentt field computation.The $\textbf{EstimateFlow}$ functions call the method $\textbf{UpdatePoly}$ for each pixel in the image.
  • The displacement field obtained may be discontinuous and contain noise and other atrifacts.Since it is reasonable to assume that if motion is encounted at a point, the neighborhood pixels also encounter the same motion. The displacement vector can be averaged over a neighborhood to get a better estimate of the displacement field.
  • The method $\textbf{AverageFlow}$ computes the average of coefficients $A_1,b_2,b_1$ and then computes the displacement flow field. \\
  • This approach may in case of large displaement.Hence a multi scale estimation is performed. The estimation of flow field is performed as the smallest resolution.The displacement computed at the lower resolution is used as estimate for peform displacement field computation at higher resolution.
  • Dense optial flow computed for two frames is shown in figure 2b
    Frame 1
    Frame 2
    Optical Flow
    Displacement field


    This article describes the theory and implementation details of the dense optical flow algorithm based on paper by [4].This code for the algorithm can be found at github repository in files DenseOf.cpp and DenseOf.hpp files. In the future article we will look at optimizing the code using SSE,NEOM and OpenCL optimizations to enable real time computation of the dense optical flow fields

  1. Kenneth Andersson and Hans Knutsson.  Continuous normalized convolution.In: ICME (1). IEEE, 2002, pp. 725728. isbn: 0-7803-7304-9. url:
  2. Kenneth Andersson, Carl-Fredrik Westin, and Hans Knutsson.  Prediction from o-grid samples using continuous normalized convolution. In: Signal Processing 87.3 (Mar. 22, 2007), pp. 353365. url: http://dblp.uni-
  3. Gunnar Farnebäck.  Motion-based Segmentation of Image Sequences. LiTH-ISY-EX-1596. MA thesis. SE-581 83 Linköping, Sweden: Linköping University, 1996.
  4. Gunnar Farnebäck.  Polynomial Expansion for Orientation and Motion Estimation. Dissertation No 790, ISBN 91-7373-475-6. PhD thesis. SE-581 83 Linköping,Sweden: Linköping University, Sweden, 2002.
  5. Gunnar Farneback.  Two-Frame Motion Estimation Based on Polynomial Expan-sion. In: SCIA. LNCS 2749. Gothenburg, Sweden, 2003, pp. 363370.

OpenVision Library Gaussian Mixture Model Implementation

OpenVision Library Gaussian Mixture Implementation


  • In this article we will look at gaussian mixture model.
  • Mixture Models are a type of density model which comprise a number of component functions, usually Gaussian.
  • These component functions are combined to provide a multimodal density.
  • GMM is a weighted average of gaussians where gaussian by its own mean can covariance matrix matrix
  • The mixture density is given by \begin{eqnarray*} & P(X|\mu,\Sigma,w) = \sum_k w_k \mathcal{N}(X,\mu_k,\Sigma_k) \end{eqnarray*}
  • To compute the likelyhood that a random variable is sampled from the mixture model we just need to compute the probability of individual gaussian component and then find the weighted average.
  • Instead of probability we can compute the log likelyhood of gaussian components which would simply be the log of sum of weighted exponentials. \begin{eqnarray*} & L(X|\mu,\Sigma,w) = log \sum_k exp (log w_k +log \mathcal{N}(X,\mu_k,\Sigma_k) ) \end{eqnarray*}
  • A gaussian mixture model is characterized by
    • Number of mixture components
    • Weights of mixtures
    • Mean and covariances of individual mixture components
  • Thus given the model parameters computation of probability that a vector is sampled from the gaussian mixture model is fairly simple
  • For example if we consider the following mixture models
  • The probability that vector X belongs to GMM is computed as $0.0580374$ using the method Prob
  • The probability that vector X belongs to individual mixture components is computed as $0.145093:9.5455e-17$
  • Both the above results can be verified by manual calculation or other packages etc.


  • The code for the same can be found in OpenVision git repository in files ImgMl/gaussianmixture.cpp and ImgMl/gaussianmixture.hpp files.
  • OpenVision is attempt at a opensource C/C++ Library developed in C/C++ using OpenCV,Eigen etc providing modular interface for image processing,computer vision,machine learning applications.
The PDF Version of the document can be found  below

Implementation of Discrete Hidden Markov Model for Sequence Classification in C++ using Eigen

Discrete Hidden Markov Models for Sequence Classification


In this article we will look at hidden Markov models and its application in classification of discrete sequential data.
  • Markov processes are examples of stochastic processes that generate random sequences of outcomes or states according to certain probabilities.
  • Markov chains can be considered mathematical descriptions of Markov models with a discrete set of states.
  • In Markov chain the observed variables are states ,now we will consider the case that the observed symbol is not directly the states but some random variable related to the underlying state sequence.
  • The true state of the system is latent and unobserved.

    Hidden Latent Variables

  • Let $\mathcal{Y}$ denote a sequence of random variables taking the values in a euclidean space $\mathcal{0}$
  • A simplistic model is where each observation state is associated with a single hidden state,Each hidden state emits a unique observation symbol. \\ In this case if we know exactly if a observed variable corresponds to a hidden state then the problem reduces to Markov chain. \\
  • However it may happen that observed state corresponds to more than one hidden state with certain probabilities. For example \begin{eqnarray*} & P(X=x_0|Y=y_1) = 0.9 \\ & P(X=x_0|Y=y_2) = 0.1 \\ \end{eqnarray*}
  • Thus if the hidden state $x_0$ ,it is likely to emit a observed state $Y=y_1$ with probability 0.9 and observed state $y_1$ with probability 0.1
  • Thus we can consider than 90\% of time if we observe the state $y_1$ and 10\% of times we will observe the $y_2$
  • The random variables ${Y}$ are not independent nor do they represent samples from a Markov process.
  • However given a realization ${X_i=x_i}$ the random variable $Y_i$ is independent of other random variables ${X}$ and therefore other ${Y}$
  • Given that we know the hidden/latent state,the observation probabilities are independent. \begin{eqnarray*} P(Y_1 Y_2 | X=x_i) = P(Y_1 | X=x_i) P(Y_2 | X=x_i) \end{eqnarray*}
  • The sequence of observation/emission and corresponding probabilities that emission corresponds to hidden state is specified by a emission matrix.
  • If N is the number of state and M is number of observations the emission matrix is $NxM$ matrix.
  • The model is called discrete hidden Markov model since the emission probabilities are discrete in nature,another class of hidden Markov models exist which are called continuous hidden Markov model wherein the emission probabilities are continuous and modeled using parametric distribution.

    Forward algorithm

  • The probability of observed sequence given we are in a hidden state $z_n$ can be computed using forward algorithm
  • In forward algorithm we compute the probability of observed sequence from first element of sequence to last element of sequence as opposed to backward algorithm where we compute the same starting from last sequence moving backward for 1st element of sequence.
  • The forward and backward algorithm can be used to compute the probability of observed sequence
  • The idea is to estimate the underlying hidden states which are Markov chains.
  • Since we have N hidden states we can compute NXL matrix containing the probabilities of observing the hidden state ,where L is length of sequence.
  • The result is stored in the matrix $_\alpha$
  • Each column of matrix $_\alpha$ ie $_\alpha(j,i)$ provides the probabilities of observing the sequence at each increment of the sequence.
  • The Forward algorithm computes a matrix of state probabilities which can be used to assess the probability of being in each of the states at any given time in the sequences. as $\sum_j \alpha(j,len-1)$ \begin{eqnarray*} Let \\ & \alpha(z_n) =p(x_1,\ldots,x_n,z_n) \\ & \alpha(z_1) =p(x_1,z_n)=p(x_1|z_1)p(z_1) \\ & \alpha(z_{n-1}) =p(x_1,\ldots,x_{n-1},z_{n-1}) \\ & \alpha(z_n) =\sum_{z_{n-1}}p(x_1,\ldots,x_{n-1},x_n,z_{n-1},z_n) \\ & \alpha(z_n) =\sum_{z_{n-1}}p(x_1,\ldots,x_{n-1},z_{n-1})P(x_n,z_n|x_1,\ldots,x_{n-1},z_{n-1}) \\ & \alpha(z_n) =\sum_{z_{n-1}}p(x_1,\ldots,x_{n-1},z_{n-1})P(x_n|z_n,x_1,\ldots,x_{n-1},z_{n-1})\\ & P(z_n|z_n,x_1,\ldots,x_{n-1},z_{n-1}) \\ & \text{Since $x_n$ is independent of all other states given $z_n$}\\ & \text{Since $z_n$ is independent of all other $z_i$ given $z_{n-1}$} \\ &=\sum_{z_{n-1}}p(x_1,\ldots,x_{n-1},z_{n-1})P(x_n|z_n)P(z_n|z_{n-1}) \\ & \alpha(z_n) =\sum_{z_{n-1}}\alpha(z_{n-1}) P(x_n|z_n) P(z_n | z_{n-1})\\ \end{eqnarray*}

    Observation Sequence Likelyhood

  • This if we are given a random sequence of observations $x_1,\ldots,x_n$
  • Starting from $\alpha(z_1)$ computer $\alpha(z_n)$
  • After which $P(X={x_1,\ldots,x_n})=\sum_{z_n} \alpha(z_n)$ can be computed.
  • This gives us a estimate of probability of observing the sequence $x_1,\ldots,x_n$ using the forward algorithm.
  • The probability estimate is often computed as log probability \begin{eqnarray*} & P(x_1,\ldots,x_n) = \sum_{z_n} \alpha(z_n) \\ & prob= \prod_{i=1}^n \sum_{z_n} \alpha(z_n) \\ & logprob = \sum_n log(\sum{z_n} \alpha(z_n)) \\ \end{eqnarray*}

    Sequence Classification

  • Again for sequence classification assume we have two hidden markov models $\theta_1 = (\pi_!,trans_1,emission_2)$ and $\theta_1 = (\pi_2,trans_2,emission_2)$
  • Given a observation sequence ${X}={x_1,\ldots,x_n}$ we compute the probability of observing the sequence or probability that models have generate the observation sequence.
  • The observation sequence is estimated to be produced by the model which exhibits the highest probability to have generated the sequence. \begin{eqnarray*} y = argmax_{n=1}^N prob({X} | \theta_n) \end{eqnarray*}


  • code for discrete hidden Markov models can be found at git repository in files \ImgML/hmm.hpp and ImgML/hmm.cpp