Saturday 31 December 2016

Visualizing the effects of neural network architectural choices on the prediction surface

Neural networks (NN) have seen unparalleled success in a wide range of applications. They have ushered in a revolution of sorts in some industries such as transportation and grocery shopping by facilitating technologies like self-driving cars and just-walk-out stores.  This has partly been made possible by a deluge of mostly empirically verified tweaks to NN architectures that have made training neural networks easier and sample efficient. The approach used for empirical verification has unanimously been to take a task and dataset, for instance CIFAR for visual classification in the vision community, apply the proposed architecture to the problem, and report the final accuracy. This final number while validating the merit of the proposed approach, reveals only part of the story. This post attempts to add to that story by asking the following question - What effect do different architectural choices have on the prediction surface of a neural network?

How is this blog different from other efforts to visually understand neural networks? 


There are 2 important differences:
  1. Complexity and Dimensionality of Datasets: Most works try to understand what the network has learned on large, complex, high dimensional datasets like ImageNet, CIFAR or MNIST. In contrast this post only deals with 2D data with mathematically well defined decision boundaries. 
  2. Visualizing the learned decision function vs visualizing indirect measures of it: Due to the above mentioned issues of dimensionality and size of the datasets, researchers are forced to visualize features, activations or filters (see this tutorial for an overview). However, these visualizations are an indirect measure of the functional mapping from input to output that the network represents. This post takes the most direct route - take simple 2D data and well defined decision boundaries, and directly visualize the prediction surface

Source Code


The code used for this post is available on Github. Its written in Python using Tensorflow and NumPy. One of the purposes of this post is to provide the readers with an easy to use code base that is easily extensible and allows further experimentation along this direction. Whenever you come across a new paper on arXiv proposing a new activation function or initialization or whatever, try it out quickly and see results within minutes. The code can easily be run on a laptop without GPU.

Experimental setup


Data: A fixed number of samples are uniformly sampled in a 2D domain, \([-2,2] \times [-2,2]\) for all experiments presented here. Each sample $(x,y)$ is assigned a label using $l=sign(f(x,y))$, where $f:\mathbb{R}^2 \rightarrow \mathbb{R}$ is a function of you choice. In this post, all results are computed for the following 4 functions of increasing complexity:
  1. $y-x$
  2. $y-|x|$
  3. $1-\frac{x^2}{4}-y^2$ 
  4. $y-\sin(\pi x)$
NN Architecture: We will use a simple feedforward architecture with:
  1. an input layer of size $2$,
  2. $n$ hidden layers with potentially different number of units in each layer, 
  3. a non-linear activation function on each unit,
  4. batch normalization and residual connections are used in some experiments
  5. an output layer of size $2$ that produces the logits, 
  6. a softmax layer to convert logits to probability of each label $l \in \{1,-1\}$.
Training: The parameters are learned by minimizing a binary cross entropy loss using SGD. Unless specified otherwise, assume $10000$ training samples, a mini-batch size of $1000$, and $100$ epochs. All experiments use a learning rate of $0.01$. Xavier initialization is used to initialize weights in all layers.  

Experiments: We will explore the following territory:
  1. Effect of activations
  2. Effect of depth
  3. Effect of width
  4. Effect of dropout
  5. Effect of batch normalization
  6. Effect of residual connections 
Each experiment will try to vary as few factors as possible while keeping others fixed. Unless specified otherwise, the architecture consists of 4 hidden layers with 10 units each and relu activation. No dropout or batch normalization is used in this barebones model.

Effect of activation




This experiment compares 4 activation functions - sigmoid, tanh, relu, and elu. There are a number of interesting things to observe:

  • Sigmoid performs poorly. This is because of its large saturation regions which yield very small gradients. It is possible that results could be improved by using a higher learning rate.
  • Given that sigmoid fails miserably, its surprising that tanh works so well since it has a shape similar to sigmoid but shifted to lie between $[-1,1]$ instead of $[0,1]$. In fact tanh works better than relu in this case. It might simply be because it receives larger gradients than sigmoid or relu because of larger slope (~ $2$) around $0$. The elu paper also points to some references that provide theoretical justification for using centered activations.
  • Many people don't realize that NN with relus, no matter how deep, produce a piecewise linear prediction surface. This can be clearly observed in jagged contour plots of ellipse and sine functions. 
  • It is also interesting to compare elu and relu. While relus turn negative input $x$ to 0, elus instead use $e^x-1$. See the original elu paper for theoretical justification of this phenomenon.

Effect of depth




As mentioned in the previous section, relus produce piecewise linear functions. From the figure above we observe that the approximation becomes increasingly accurate with higher confidence predictions and crisper decision boundaries as the depth increases.

Effect of width




All networks in this experiment have 4 hidden layers but the number of hidden units vary:
  • Narrow Uniform: $[5,5,5,5]$
  • Medium Uniform: $[10,10,10,10]$
  • Wide Uniform: $[20,20,20,20]$
  • Very Wide Uniform: $[40,40,40,40]$
  • Decreasing: $[40,20,10,5]$
  • Increasing: $[5,10,20,40]$
  • Hour Glass: $[20,10,10,20]$
As with depth, increasing the width improves performance. However, comparing Very Wide Uniform with 8 hidden layers network of the previous section (same width as the Medium Uniform network), increasing depth seems to be a significantly more efficient use of parameters (~5k vs ~1k). This result is theoretically proved in the paper Benefits of depth in neural networks. One might want to reduce the parameters in Very Wide Uniform by reducing width with depth (layers closer to output are narrower). The effect of this can be seen in Decreasing. The effect of reversing the order can be seen in Increasing. I have also included results for an Hour Glass architecture whose width first decreases and then increases. More experiments are needed to comment on the effect of the last three configurations.

Effect of dropout


Dropout was introduced as a means to regularize neural networks and so it does in the above results. The amount of regularization is inversely related to the keep probability. Comparing the last row with 4 hidden layers network in the Effect of Depth section, we see quite significant regularization effect even with a high keep probability. But this comparison is not entirely fair since there is no noise in the data.

Effect of Batch Normalization




Batch normalization is reported to speed up training by a factor of 13 for ImageNet classification tasks. The above figure shows the benefit gained by using batch normalization. You will find that this model with batch normalization beats the 8 hidden layer network without batch normalization in the Effect of Depth section. You can also compare it to elu in Effect of Activations section. elu was proposed as a cheaper alternative to Batch normalization, and indeed they compare well.

Effect of Residual Connection




The above figure compares networks with and without residual connections trained on different number of training examples. Whenever the input and output size of a layer matches. the residual connection adds the input back to the output before applying the activation and passing on the result to the next layer. Recall that the residual block used in the original ResNet paper used 2 convolutional layers with relus in between. Here a residual block consists of a single fully connected layer and no relu activation. Even then, residual connections noticeably improve performance. The other purpose of this experiment was to see how the prediction surface changes with the number of training samples available. Mini-batch size of 100 was used for this experiment since the smallest training set size used is 100.

Conclusion


Neural networks are complicated. But so are high dimensional datasets that are used to understand them. Naturally, trying to understand what deep networks learn on complex datasets is an extremely challenging task. However, as shown in this post a lot of insight can be gained quite easily by using simple 2D data, and we have barely scratched the surface here. For instance, we did not even touch the effects of noise, generalization, training procedures, and more. Hopefully, readers will find the provided code useful for exploring these issues, build intuitions about the numerous architectural changes that are proposed on arXiv every day, and to understand currently successful practices better.

BibTeX Citation

@misc{gupta2016nnpredsurf,
  author = {Gupta, Tanmay},
  title = {Visualizing the effects of neural network architectural choices},
  year = {2016},
  howpublished = {http://ahumaninmachinesworld.blogspot.in/2016/12/visualizing-effects-of-neural-network.html}
}

Sunday 3 July 2016

CVPR 2016 Papers that shined brighter than the bling of Vegas

This year CVPR saw a staggering attendance of over 3600 participants. As I sat in one of those chandelier lit colossal ballrooms of Caesar Palace (checkout some cool photos from my Vegas trip at the end of this blog :-)), I had only one aim in mind - to keep an eye for honest research which is of some interest to me. Needless to say publications featuring CNNs and RNNs were in abundance. However it seems like the vision community as a whole has matured slightly (very slightly) in dealing with these obscure learning machines and the earlier fascination with the tool itself has been replaced by growing interest in its creative and meaningful usage. In this post I will highlight some of the papers that seem to be interesting based on a quick glance. This blog post is very similar in spirit to the excellent posts by Tomasz Malisiewicz where he highlights a few papers from every conference he visits. I think it is a good exercise for anyone to keep up with the breadth and depth of a fast moving field like Computer Vision. As always many a grad students (including me) and possibly other researchers are waiting for Tomasz' highlight of CVPR 2016 (which based on his recent tweet is tentatively named 'How deep is CVPR 2016?'). But while we are waiting for it let us dive in -

Vision and Language

One of the major attractions for me were the oral and spotlight sessions on Vision and Language and the VQA workshop. There were some good papers from Trevor Darrell's group in UC Berkeley -
  • Neural Module Network: This work proposes composing neural networks for VQA from smaller modules each of which performs a specific task. The architecture of the network comes from a parse of the question which specifies the sequence in the which operations need to be performed on the image to answer the question. A follow up work is also available here where instead of committing to a single network layout, multiple layouts are proposed and a ranking is learned using the REINFORCE rule in a policy gradient framework. 
  • Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data: Most image captioning models required paired image-caption data to train and cannot generate captions describing novel object categories that aren't mentioned in any of the captions in the paired data. The main contribution of the paper is to propose a method to transfer knowledge of novel object categories from object classification datasets.
      
  • Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding: Most deep learning approaches to tasks involving image and language require combining the features from two modalities. The paper proposes MCB as an alternative to simple concatenation/addition/elementwise multiplication of the image and language features. MCB is essentially an efficient way of performing an outer product between two vectors and so results in a straightforward feature space expansion. 

There was also this work born out of a collaboration between Google, UCLA, Oxford and John Hopkins that deals with referring expressions which are expressions used to uniquely identify an object or a region in an image -
  • Generation and Comprehension of Unambiguous Object Descriptions: The main idea of the paper is to jointly model generation and interpretation of referring expressions. A smart thing about this is that unlike independent image caption generation which is hard to evaluate, this joint model can be evaluated using simple pixel distance to the object being referred to. 

Shifting the attention from caption generation to use of captions as training data for zero-shot learning, researchers from University of Michigan and Max-Planck Institute introduced a new task and dataset in -
  • Learning Deep Representations of Fine-Grained Visual Descriptions: The task is to learn to classify novel categories of birds given exemplar captions that describe them and paired image caption data for other bird categories. The proposed model learns to score captions and images and classifies any new bird image as belonging to the category of the nearest neighbor caption. 

During the VQA workshop Jitendra Malik brought something unexpected to the table by pointing out that vision and language while being the most researched pillars of AI aren't the only ones. There is a third pillar which has to do with embodied cognition, the idea that an agent's cognitive processing is deeply dependent in a causal way on the characteristics of the agent's physical beyond-the-brain-body. He argued that in order for a robot to ingrain the intuitive concepts of physics which allow humans to interact with the environment on a daily basis, the robot needs to be able to interact with the world and see the effect of its actions. This theme reflects in some of the recent research efforts lead by his student Pulkit Agrawal at UC Berkeley - 
  • Learning to Poke by Poking: Experiential Learning of Intuitive Physics: The main idea here is to let a robot run amok in the world and learn to perform specific actions in the process. In this case the robot repeatedly pokes different objects and records the images before and after the interaction collecting 400hrs of training data. Next a network is trained to produce the action parameters given the initial and final image. In this way the robot learns certain rules of physics like the relation between force, mass, acceleration and other factors that affect the change like friction due to a particular surface texture. At test time, this allows the robot to predict an action that will produce a desired effect (in this case specified by initial and final images). Also refer to their previous work Learning to See by Moving where egomotion is used as supervision for feature learning. 

Some other invited speakers at VQA workshop were - Ali Farhadi, Mario Fritz, Margaret Mitchell, Alex Berg, Kevin Murphy and Yuandong Tian. Do checkout their websites for their latest works on Vision and Language.   


Object Detection

Collaborations between FAIR, CMU and AI2 brought us some interesting advances in object detection - 
  • Training Region-Based Object Detectors with Online Hard Example Mining: This work is a reformulation of the classic hard negative mining framework that was once popular for training object detectors with far more negative/background samples than positive examples. The training involved beginning with an active and possibly balanced set of positives and negatives. A classifier trained on this active set was applied to the complete training set to identify hard false positives which were then added to the active set and the process was repeated. This paper presents an online version of this methodology which rhymes well with current object detectors like Fast-RCNN which are trained using SGD with mini-batches. 
  • You Only Look Once: Unified, Real-Time Object Detection: A little pony used its magical powers to create a really fast and simple object detector. The key idea is to use a single forward pass through a convolutional network and produce a fixed number of bounding box proposals with corresponding confidence values, and class probabilities for every cell on a grid. The pony also took great risk but delivered a mind blowing live demo of YOLO Object Detector during its oral session at CVPR.

Folks from ParisTech also presented their work on improving localization of object detectors -
  • LocNet: Improving Localization Accuracy for Object Detection: This work proposes an approach to improve object localization. First the initial bounding box is enlarged. Then a network is used to predict for each row/column the probability of it belonging inside the bounding box (or alternatively the probability of that row/column being an edge of the final bounding box). 

Recognition and Parsing in 3D

Object detection in 3D seems to be gradually catching up with its 2D counterpart -
  • Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images:  This paper from Jianxiong Xiao's group in Princeton shows how to extend the Region Proposal Network (RPN) from Faster-RCNN to do 3D region proposal. This region proposal is amodal meaning that the proposed bounding box includes the entire object in 3D volume irrespective of truncation or occlusion. For recognition they combine deep learned 2D appearance features from image with 3D geometric features from depth extracted using 2D/3D ConvNets respectively.
     
  • 3D Semantic Parsing of Large-Scale Indoor Spaces: This paper due to a collaboration between Stanford, Cornell and University of Cambridge deals with parsing of large scale 3D scenes. For instance consider parsing an entire building into individual rooms and hallways, and then further identifying different components of each room such as walls, floor, ceiling, doors, furnitures etc. The paper also suggests novel applications that can result from this kind of parsing - generating building statistics, analysing workspace efficiency, and manipulation of space such as removing a wall partition between two adjacent rooms. 
  • GOGMA: Globally-Optimal Gaussian Mixture Alignment: Gaussian mixture alignment can be used for aligning point clouds where both the source and target point clouds are first parameterized by a GMM. The alignment is then posed as minimizing a discrepancy measure between the 2 GMMs. This paper presents a globally optimal solution when the discrepancy measure used is L2 distance between the densities. 







Weakly Supervised and Unsupervised Learning

With availability of good deep learning frameworks like Tensorflow, Caffe etc, supervised learning in many cases is a no-brainer given the data. But collecting and curating data is a huge challenge in itself. There are even startups like Spare5 (they also had a booth at CVPR) which provide data annotation as a service. Nonetheless sometimes collecting data is not even an option. Say for example the task of finding dense correspondence between images. A dataset for this task would have a huge impact on other tasks like optical flow computation or depth estimation. However we are yet to see any dataset for this purpose. The following paper has an interesting take on solving this problem in an unsupervised fashion -
  • Learning Dense Correspondence via 3D-guided Cycle Consistency: The idea in this paper is to match both source and target real images with images rendered from a synthetic mesh. This establishes a consistency cycle where the correspondence between the matched synthetic images is known from the rendering engine by construction. This is used as a supervisory signal for training a ConvNet that produces a flow field given source and target images. 

It wouldn't be wrong to say - Piotr Dollar : Edge Detection :: Ross Girshick : Object Detection. In this next paper Piotr and collaborators from Georgia Tech introduce an unsupervised method for edge detection -
  • Unsupervised Learning of Edges: The key insight is that motion discontinuities always correspond to edges even though the relation does not hold in the other direction. Their algorithm alternates between computing optical flow and using it to train their edge detector. The output of the improved detector is then further used to improve the optical flow and the process continues. Over time both optical flow and edge detection benefit from each other.

Denoising Autoencoders have been used before for unsupervised feature learning. However depending on the noise characteristics the encoder may not need to capture the semantics and often the clean output can be produced from low level features. The next paper shows how to use an encoder-decoder framework to do unsupervised feature learning in a way that forces features to capture higher level semantics and as a byproduct produces reasonable inpainting results -
  • Context Encoders: Feature Learning by Inpainting: Given an image with a region cropped out from it, the encoder produces a feature representation. These features are passed on to a decoder that produces the missing region. One of their results is that pretraining a recognition network in this fashion achieves better performance than random initialization of weights. 

The next couple papers are about weakly supervised learning where annotation is available though not in the most direct form.
  • Deep Structured Scene Parsing by Learning with Image Descriptions:  The paper proposes a method for hierarchically parsing an image using sentence descriptions. For example given a sentence - A man holding a red bottle sits on the chair standing by a monitor on the table, the task is to parse the scene into something like 
  • WarpNet: Weakly Supervised Matching for Single-view Reconstruction: Consider the problem of reconstructing a bird from a single image. One way to do this is to use a fine grained birds dataset and use key-point matches across instances to do non-rigid structure from motion. However traditional key-point detectors like SIFT would fail to give good correspondence due to appearance variations. This paper proposes WarpNet, a network that produces a Thin Plate Splines warping that is used as a shape prior for achieving robust matches. Furthermore the network is trained in an unsupervised manner without any part annotations.


Semantic Segmentation

FCNs and Hypercolumn have been the dominant approaches for semantic segmentation since CVPR 2015. But it is not trivial to adapt these to instance segmentation which is the goal of the next paper- 

In a previous work from Vladlen Koltun's group inference over fully connected CRFs was proposed as an effective and efficient (using filtering in a high dimensional space with permutohedral lattice) way to do semantic segmentation. The following is an extension of that work to semantic segmentation in videos -
  • Feature Space Optimization for Semantic Video Segmentation: For semantic segmentation in still images the fully connected CRFs could operate in a very simple feature space - RGB color channels and pixel coordinates. An obvious extension to video using time as an extra feature does not work since moving objects take apart pixels belonging to the same object in consecutive frames. The proposed approach learns an optimal feature space over which fully connected CRFs with gaussian edge potentials can be applied as before. 

The following paper from UCLA and Google proposes an alternative to using fully connected CRFs -


Image Captioning

Andrej Karpathy has been a pioneer in image caption generation technology (even though it is far from being perfect yet). In the following paper he and Justin Johnson from Fei Fei Li's group in Stanford extend their captioning algorithm to generate captions densely on an image -
  • DenseCap: Fully Convolutional Localization Networks for Dense Captioning: An image is fed into a network that produces convolutional feature maps. A novel localization layer then takes this feature maps and produces region proposals. Bilinear interpolation is used to get region features from the full size feature maps which are fed into another network with RNN at the end that generates a description for that region. 


Stereo and Monocular Vision

The following paper from Noah Snavely and team from Google shows us how to generate high quality images from deep nets -
  • Deep Stereo: Learning to predict new views from world's imagery: The task is that of novel view synthesis from a given pair of images from different views. A plane sweep volume is feed into a network with a carefully designed architecture that consists of a Color Tower which learns to warp pixels and a Selection Tower which learns to select pixels from the warped image produced by the Color Tower. 

Current stereo depth estimation algorithms work very well for images of static scenes with sufficient baseline. However, depth estimation in low baseline conditions where the foreground and background move relative to each other is still a hard problem. The next paper from Vladlen Koltun's group talks about dense depth estimation of scenes with moving objects from consecutive frames captured using a monocular camera.
  • Dense Monocular Depth Estimation in Complex Dynamic Scenes: Due to moving objects present in the scene, the optical flow field is first segmented. Depth is estimated using separate epipolar geometry of each segment while enforcing the condition that each segment is connected to the scene.
     

But there is no disentangling fun from work ...

 

 









Friday 11 December 2015

Gaussian Process Regression for Object Detection

Object detection by sliding window is now a relic of the past. Modern approaches to object detection such as R-CNN rely on simply scoring region proposals using a CNN. This approach is faster as it requires scoring a few hundred bounding boxes as compared to an exhaustive search over the entire image. However, such an object detector can only be as good as the region proposal mechanism under the hood.  Zhang et.al present a novel approach that overcomes this challenge by using the scores evaluated at these region proposals to iteratively regress to better locations of the object using Bayesian Inference and Gaussian Process Regression (GPR). I will be making simplifying assumptions to bring out the main idea. The actual implementation can be looked up in their publication.


Approach of Zhang et.al in "Improving Object Detection with CNN via Bayesian Optimization and Structured Prediction"
























Let $\mathcal{Y}$ bet the set of all possible locations in the image and $f:\mathcal{Y} \rightarrow \mathbb{R}$ be a scoring function that computes our object detectors confidence, $f(y)$ for a given region $y$ in the image. Ideally we would like to search for local maxima of $f$ but since we can compute the scores at only selected regions $\{y_1, \cdots, y_N\}$ in the image, we need to somehow regress the locations of the local maxima given the training data $\{(y_1,f(y_1)),\cdots,(y_N,f(y_N))\}$. This is where GPR comes into picture.

Gaussian Process Regression

Gaussian Process refers to a set of random variables such that any subset of them follows a multivariate gaussian distribution. In case of object detection these random variables happen to be the scores of different regions in the image frame, one random variable per region. The scores $S_y$ we observe for a region $y$ also has some noise $\epsilon_y$ mixed with it

\begin{equation}




S_y = f(y) + \epsilon_y




\end{equation}

where $\epsilon_y \sim \mathcal{N}(0,\beta^{-1})$. In a usual regression setting we would like to estimate the parameters of the $f(\cdot)$ given a set of observations so that given a new $y'$, we can compute the its score. In GPR instead of directly specifying a parametric form for $f$, we provide a prior over the set of all possible functions

\begin{equation}

f \sim P(f|\theta)

\end{equation}

Note that here a function is being interpreted as an infinite dimensional vector where each element of the vector specifies the function's value at a particular point in it's domain. However, we are only dealing with a finite samples of the distribution, namely the observed data. As the name suggests in GPR the prior is a gaussian process

\begin{equation}




(f(y_1),\cdots,f(y_N)) \sim \mathcal{N}(y_1,\cdots,y_N|\mathbf{m}_0,\mathbf{K}_0)




\end{equation}

where $m_0$ is the mean and $K_0$ is a Gram matrix defined using a kernel $k(y_i,y_j)$. The kernel that is used in the paper is squared exponential covariance kernel with automatic relevance determination which looks like this

\begin{equation}




k(y_i,y_j) = \eta \exp(-(y_i-y_j)^T \Lambda (y_i-y_j))




\end{equation}

But we are interested in the distribution of $\mathbf{S}=\{S_1,\cdots,S_N\}$, which using bayes rule and marginalizing over $\mathbf{f}={f(y_1),\cdots,f(y_N)}$ can be written as 

\begin{equation}




P(\mathbf{S}) = \int P(\mathbf{S}|\mathbf{f})P(\mathbf{f}) d\mathbf{f}




\end{equation}

To compute this we shall use the following result from Section 2.3.3 of "Pattern Recognition and Machine Learning by Christopher M. Bishop" which says that if 

\begin{align}

P(x) &= \mathcal{N}(x|\mu,\Sigma)\\

P(y|x) &= \mathcal{N}(y|Ax+b,\Lambda) \\

\end{align}

then $P(y) = \mathcal{N}(y|A\mu + b, \Lambda + A\Sigma A^T)$. In our case 

\begin{align}
P(\mathbf{f}) &= \mathcal{N}(\mathbf{f}|\mathbf{m}_0,\mathbf{K}_0) \\
P(\mathbf{S}|\mathbf{f}) &= \mathcal{N}(\mathbf{S}|\mathbf{f},\beta^{-1}\mathbf{I}_N)
\end{align}

which implies that $P(\mathbf{S}) = \mathcal{N}(\mathbf{m}_0,\mathbf{K}_N)$ where $\mathbf{K}_N = \mathbf{K}_0 + \beta^{-1}\mathbf{I}_N$. Furthermore the mean used in the paper is a constant for every dimension that is $\mathbf{m}_0 = m_0 \mathbf{1}_N$

Now let us consider a new region $y_{N+1}$ for which we want to compute a distribution over its score $S_{y_{N+1}}$. In order to get this we shall first compute the joint distribution of $\mathbf{S}_{N+1} = \{\mathbf{S},S_{y_{N+1}}\}$ which by straightforward extension is given by

\begin{equation}

P(\mathbf{S}_{N+1}) = \mathcal{N}(\mathbf{m}_0,\mathbf{K}_{N+1})

\end{equation}

where $\mathbf{K}_{N+1}$ can be written in terms of $\mathbf{K}_N$ as 

\begin{equation}

\mathbf{K}_{N+1} =

\left[

\begin{array}{cc}

\mathbf{K}_N & \mathbf{k}_{N+1}\\

\mathbf{k}_{N+1}^T & k_{N+1}

\end{array}

\right]

\end{equation}

where 

\begin{align}

\mathbf{k}_{N+1} &= [k(y_{N+1},y_1) \cdots, k(y_{N+1},y_N)]^T \\

k_{N+1} &= k(y_{N+1},y_{N+1}) + \beta^{-1}

\end{align}

Finally we marginalize to get the conditional probability of $S_{y_{N+1}}$ given all the previous scores $\mathbf{S}$. For this we refer to the following property of Gaussian distributed random variables from  Section 2.3.1 of Bishop's book - Let $x_a,x_b$ partition the variables distributed according to a multivariate Gaussian distribution with a corresponding partitioning of the mean and covariance matrix as

\begin{align}
\mathbf{\mu} &=
\left[
\begin{array}{c}
\mathbf{\mu}_{a}\\
\mathbf{\mu}_{b}
\end{array}
\right] \\
\mathbf{\Sigma} &=
\left[
\begin{array}{cc}
\mathbf{\Sigma}_{aa} & \mathbf{\Sigma}_{ab} \\
\mathbf{\Sigma}_{ba} & \mathbf{\Sigma}_{bb}
\end{array}
\right]
\end{align}

then

\begin{equation}
P(\mathbf{x}_b|\mathbf{x}_a) = \mathcal{N}(\mathbf{x}_b|\mathbf{\mu}_b+\mathbf{\Sigma}_{ba} \mathbf{\Sigma}_{aa}^{-1} (\mathbf{x}_a-\mathbf{\mu}_a), \: \mathbf{\Sigma}_{bb} - \mathbf{\Sigma}_{ba}\mathbf{\Sigma}_{aa}^{-1}\mathbf{\Sigma}_{ab})
\end{equation}

Substituting the terms we get

\begin{align}

P(S_{y_{N+1}}|\mathbf{S}) = \mathcal{N}(S_{y_{N+1}}|\mu,\sigma)

\end{align}

where $\mu = m_0 + \mathbf{k}_N^T\mathbf{K}_N^{-1}[S_{y_j}-m_0]_{j=1,\cdots,N}$ and $\sigma = k_{N+1}-\mathbf{k}_{N+1}^T\mathbf{K}_N^{-1}\mathbf{k}_{N+1}$. This is the primary result of GPR - It gives a distribution of scores at any new location given the scores at all previously locations.

Fine grained search using Bayesian Optimization

Ideally we would like to pick a new location $y_{N+1}$ that would lead to maximum improvement in the score over the best scoring location observed so far. However, since we only have a distribution of score at any given location, we shall pick the next location that maximizes the expected improvement in score which is given by

\begin{align}

a_{EI}(y_{N+1}|\mathbf{S}) = \int_{\hat{S}_N}^{\infty} (S_{y_{N+1}}-\hat{S}_N).P(S_{y_{N+1}}|\mathbf{S})dS_{y_{N+1}}

\end{align}

where $\hat{S}_N = \max_{j=1,\cdots,N}S_{y_j}$. The paper provides a closed form solution of this acquisition function in terms of $\mu,\sigma$ and cumulative distribution of normal distribution. Please refer to equation 6 in the paper for the exact expression. In practice, instead of analytically maximizing $a_{EI}$, it is used to score the remaining region proposals.

In summary the steps are:
  1. Region Proposal: Propose regions using selective search or similar techniques
  2. Pruning: Discard proposals with low classification scores and select some top ranked regions
  3. Find local maxima where each such location would correspond to multiple instances
  4. For each local maxima select $N$ regions in its vicinity whose scores correspond to $\mathbf{S}$
  5. Fit a GP model using these sample scores
  6. Select one of the remaining regions using $a_{EI}$ as the scoring criterion
  7. Repeat until there are no more acceptable proposals


Experimental result


The above graph shows a detection performance comparison in terms of  mean average precision (mAP) amongst variants of Selective Search (SS) and the proposed approach (SS+FTS) for an oracle scoring function on PASCAL VOC 2007 detection benchmark. An oracle scoring function is the one that gives us the true score which in case of object detection means that it always gives a higher score to a better localization. Bounding box intersection over union (IOU) was used as the oracle scorer. SS + Local random search follows steps 1 through 3 but then randomly samples bounding boxes in the vicinity of the detected local maxima. On the horizontal axis is the increasingly strict criterion (IOU with ground truth region) for declaring a predicted region as a correct detection. 

Clearly, for lower IOU thresholds all methods perform quite well meaning that Selective Search quite often predicts a detection region somewhere in the vicinity of the ground truth. However as we demand more accurate localizations, most methods drop down to mAP of  30% except the proposed approach which maintains a performance of 95%. It is no surprise that if one increases the number of region proposals by SS from 2000 to 10000 one would do better  and hence an mAP of 50% for SS quality. But what is surprising is that SS+FTS with just 2100 boxes is able to beat a region proposal driven detection approach by a margin of 45% for IOU of 0.9. 


Mean Average Precision (mAP)

A little about the metric itself. It is important to understand metrics in order to understand how algorithms compare to each other. A good metric should reflect the expectations from the algorithm. For detection we expect two things from our detector:
  1. If there is an object it should detect it which is quantified by Recall
  2. If it believes there is an object at a location then there actually should be an object there which is quantified by Precision
Note that achieving one does not necessarily mean achieving other. For example a detector that says there is an object at every location has perfect recall but extremely low precision. Similarly a detector that always says there is no object at any location has perfect Precision but 0 Recall. Hence there is a sweet spot to be identified.

So now assume our algorithm produces a list of detection regions along with corresponding confidence scores for each image. These detections are ranked in order of decreasing confidence across all images. Each detection is then marked as a true positive (TP) or a false positive (FP) by assigning it to the closest ground truth and comparing the IOU with a threshold (x axis of the above plot).  For each position, $k$ in the ranked list one could compute precision and recall at top $k$ as $\frac{\#TP}{k}$ and $\frac{\#TP}{\#P}$ respectively. Note that recall is monotonically increasing with $k$ but precision might increase or decrease. So to avoid the wiggle, we interpolate the value as follows

\[
p_{interp}(r) = \max_{\tilde{r} \geq r} p(\tilde{r})
\]

where $p(\tilde{r})$ is precision at recall $\tilde{r}$. Note that we have discrete measurements for $\tilde{r} = r(k)$ for some $k$. Now a summarization of the precision recall curve is average precision defined as follows

\[
AP = \sum_{r\in{0,0.1,\cdots,1}} p_{interp}(r)
\]

Finally since detectors for multiple contending classes are evaluated simultaneously AP for different classes are averaged together and reported as mAP.


Conclusion

Poor recall from region proposal algorithms can directly hamper the performance of even the best of detection scoring functions. Gaussian Process Regression along with Bayesian optimization seems to be a very good strategy for efficiently overcoming this issue while keeping the computational advantage of dealing with a sparse set of locations instead of an exhaustive search. 

Saturday 11 July 2015

DTAM: Dense Tracking and Mapping

Update: While working on a project I implemented and introduced several modifications to DTAM. For more information please refer to https://arxiv.org/abs/1606.05002. A video for the same submission is available here.

While most VSLAM systems live off robust feature detection and tracking, DTAM is a novel pixel-wise approach which uses the availability of many low baseline frames to its advantage in both creating dense maps and using them for robust camera pose estimation. The original paper by Richard Newcombe can be found here. The following video shows the depth maps created by DTAM and demonstrates tracking quality superior to PTAM.


Like PTAM, it has 2 distinct parts - pose tracking and 3D mapping. But unlike PTAM which just keeps track of sparse set of 3D points, DTAM maintains dense depth maps for selected keyframes.

Dense Mapping

The construction of depth map for a reference image $r$ from a set of short baseline frames denoted by $I(r)$ is posed as a global minimization problem. Central to this optimization is definition of a Cost Volume, $C_r$.

Cost Volume

The cost volume $C_r(I_m)$ defines the photometric error of every pixel in the reference frame with respect to a frame $m \in \mathcal{I}(r)$ as a function of inverse depth. In practice, of course the inverse depth is discretized into $L$ values so that the cost volume can be calculated and stored as an $M \times N \times L$ matrix where $M$ and $N$ are the number of rows and columns in the image.  More specifically, under the brightness constancy assumption, $C_r(u, d)$ is a measure of how different would the intensity of a pixel $u$ in image $I_r$ be from its projection in any of the images in $\mathcal{I}(r)$, assuming that the inverse depth of the pixel in $I_r$ is $d$. This can be mathematically written as
\begin{equation}
C_r(u,d) = \frac{1}{|\mathcal{I}|}\sum_{m \in \mathcal{I}} \|\rho(I_m,u,d) \|_1
\end{equation}
where
\begin{equation}
\rho(I_m,u,d)  = I_r(u) - I_m(\pi(KT_{mr}\pi^{-1}(u,d)))
\end{equation}
While the above equation looks terrifying, its really simple once broken down into pieces. Here goes-
1. Project pixel $u$ with inverse depth $d$ from pixel space to camera space using camera intrinsic matrix $K$

\begin{equation}
\pi^{-1}(u,d) = \frac{1}{d}K^{-1}
\left[
\begin{array}{c}
u_x \\
u_y \\
1
\end{array}
\right]
\end{equation}
2. Transform the 3D point (using rigid body transformation) from the coordinate system of camera $r$ to camera $m$

\begin{equation}
T_{mr}\pi^{-1}(u,d)
\end{equation}
3. Project the 3D point in camera $m$'s coordinate system to its pixel coordinates
\begin{equation}
\pi(KT_{mr}\pi^{-1}(u,d))
\end{equation}
where $\pi(x,y,z) = (x/z, y/z)$ is a dehomogenization operation


Note that the cost volume is an average of  photometric error over multiple frames because using just one frame would produce noisy estimates. The following figure shows this effect.

The thin colored lines in the 3 images show photometric errors at 3 different pixels a,b and c for different frames  $m \in \mathcal{I(r)}$ as a function of inverse depth. The bold red line shows the averaged cost.
An important point to note in the above figure is that the minimas in all 3 plots are located at the correct location relative to each other in terms of depth ordering of the pixels a,b and c. Also the pixels with more discriminative patches are better localized.

Optimization objective function

The objective function for obtaining a dense map is a regularized version of the cost that we described in the previous section. The regularization is a penalty term to ensure smoothness. Note that while a depth map is smooth in general, there would be discontinuities arising due to occlusion boundaries. Image gradients are often an important cue for such discontinuities. To incorporate such cases the regularization is chosen to be a weighted Huber norm of the inverse depth map gradient. So if the inverse depth map is denoted by $d = \xi(u)$, then the regularization term is
\begin{equation}
g(u)\|\nabla \xi(u)\|_{\epsilon}
\end{equation}
where $\|.\|_{\epsilon}$ is the Huber loss, a robust loss function that does not penalize outliers too harshly, thereby reducing their affect on the optimization. The Huber loss is defined by
\begin{equation}
\|x\|_{\epsilon} = \left\{
\begin{array}{ll}
\frac{\|x\|_2^2}{2} & \|x\|_2 \leq \epsilon \\
\|x\|_1 - \frac{\epsilon}{2} & otherwise
\end{array}
\right.
\end{equation}
$g$ is intensity gradient based weight, which adaptively reduces the regularization at intensity edges
\begin{equation}
g(u) = e^{-\alpha \|\nabla I_r(u)\|_2^{\beta}}
\end{equation}

So the regularized objective function is
\begin{equation}
E(\xi) = \int_u \{ g(u)\|\nabla \xi(u)\|_{\epsilon} + \lambda C_r(u,\xi(d))  \} du
\end{equation}
So now we have a cost function that we need to minimize. As it is well know, coming up with an optimization problem is easy, solving it however, is an entirely different beast.

Solving the minimization problem

For computation the inverse depth is represented by a $MN\times 1 $ vector $d$ and regularization weights by a $MN\times 1$ vector $g$.

The first step is to approximate the regularization term by $AGd$ where $Ax = \nabla x$ and $G = diag(g)$. Now our optimization problem looks like
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \lambda \mathbb{1}^T C(d)
\end{equation}
But this optimization is insane because $C(d)$ is just a lookup table where given the inverse depth at every pixel we can look up the value from the cost volume and arrange them in a $MN\times 1$ vector. So the two terms need to be decoupled somehow.

The decoupling uses 2 tricks.
a) Introduction of auxiliary variable or rather a vector of variables, $a$ of size $MN\times 1$
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
Here $\theta$ controls how tightly coupled $d$ and $a$ are. Theta is initially kept large but is decreased with every iteration as we trust values of $a$ and $d$ more and more.
At this point, a coordinate descent type iterative convex optimization solution is possible already which is as follows:
Beginning with an initial value of $d$, for each pixel do a point-wise exhaustive search for $a$ over the range of inverse depth values, i.e
\begin{equation}
\hat{a_i} = \underset{a_i}{\operatorname{argmin}} \frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)
\end{equation}
Now using $\hat{a}$ solve the convex optimization problem
\begin{equation}
\hat{d} = \underset{d}{\operatorname{argmin}}  \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-\hat{a})^2
\end{equation}
Note that this problem is convex because Huber norm is convex and so is the quadratic term.
To get the solution iterate between the above 2 steps until convergence. However, to avoid solving the large optimization problem the authors have used a second trick which allows

b) Legendre-Fenchel Transform
To understand this we need some background in convex optimization.
A convex function $f(x)$ can be written as
\begin{equation}
f(x) =  \underset{y}{\operatorname{sup}} x^Ty - f^*(y)
\end{equation}
where $f^*(x)$ is the convex conjugate of $f$. The convex conjugate is given by the exact same equation above but with $f$ and $f^*$ exchanged. In fact $f^{**} = f$ for a convex function. Now let us compute conjugate transform of Huber norm $H(x)$.
\begin{equation}
H^*(x) = \underset{y}{\operatorname{sup}} x^Ty - H(y)
\end{equation}
Taking the derivative and setting it to $0$ gives us
\begin{align}
& x - \mathbb{I}(\|y\|_2\leq \epsilon) \frac{y}{\epsilon} - \mathbb{I}(\|y\|_2 > \epsilon) sign(y) = 0 \\
\implies & y = x\epsilon \; \text{, when } \|x\|_2\leq 1 \\
\implies & H^*(x) = \frac{\epsilon\|x\|_2^2}{2} \; \text{, when } \|x\|_2\leq 1
\end{align}
Note that for $\|x\|_2>1$ if we draw an analogy with the 1 dimensional case where $x,y\in \mathbb{R}$, it can be seen that the gradient becomes positive making $x^Ty - H(y)$ a monotonically increasing function of $y$ and thus
\begin{equation}
H^*(x) = \infty \; \text{, when } \|x\|_2 > 1
\end{equation}
This can be written more concisely as $H^*(x) = \epsilon\|x\|^2/2 + \delta(x)$, where $\delta(x) = \infty$ when $\|x\|_2>1$. Also the function using the Legendre-Fenchel Transform (LFT) we can write
\begin{equation}
H(x) = \underset{y}{\operatorname{max}} x^Ty - \epsilon\|y\|^2/2 - \delta(y)
\end{equation}
Equipped with this information we can replace $\|AGd\|_\epsilon$ in our energy function with its LFT
\begin{equation}
\underset{q}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2 - \delta(q)\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
and our finial optimization problem can be written as
\begin{equation}
\underset{d,a}{\operatorname{min}}\underset{q \ni \|q\| \leq 1}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
But how does this math wizardry help us? Well now given $\hat{a}$ we can do gradient descent over $d$ and a gradient ascent over $q$ because we are minimizing with respect to $d$ and maximizing with respect to $q$ simultaneously. For more details on the derivatives, refer to Section 2.2.3 in the paper. Given $\hat{d}$ and $\hat{q}$, $a$ is obtained as before using point-wise search. As if all this wasn't enough, the authors of DTAM go on to describe a way to speed up point-wise search which is described in the following section.

Accelerating Point-wise Search
From the cost volume it its trivial to obtain for every pixel $u$, the bounds on data cost $C_u^{min}=C_u(d^{min})$ and $C_u^{max}=C_u(d^{max})$. Now for $\underset{a}{\operatorname{min}}\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)$, the maximum possible value is $\lambda C_u^{max}$ obtained by setting $a_i=d_i$. Using this fact we can get bounds on the search range of $a_i$ about $d_i$ through the following constraint
\begin{align}
\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C_u^{min} &\leq \lambda C_u^{max} \\
\implies a_i &\in \left[ d_i-\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})}, d_i+\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})} \right]
\end{align}
Note that since $\theta$ is decreased with every iteration, this bound becomes narrower and narrower allowing accelerated search for optimal $a$. The following image shows this phenomenon

The white band denotes the search region

Tracking

Tracking is done in 2 stages. First stage involves rotational odometry assuming no translation using "S. J. Lovegrove and A. J. Davison. Real-time spherical mosaicing using whole image alignment. ECCV 2010" which is resilient to motion blur. The second stage is a full $6DOF$ 2.5D alignment. The two pose estimates are composed to get the final pose. I will only be discussing the second stage because it is different from the way I would have approached it. I would have simply back projected the inverse depth map to a point cloud and solved for the rotation and translation using Iterative Closest Point technique in 3D space. Their way however seems to be more robust as it doesn't rely on nearest neighbor point correspondences. The DTAM approach is as follows:

The rotational odometry gives us a pose $T_{wv}$. The dense model is projected into this pose to give us an image $I_v$ and inverse depth map $\xi_v$. If the current frame is denoted by $l$, then we need to estimate $T_{vl}$ so that $T_{wv}T_{vl}$ would give us the pose of the current camera. Let $T_{vl}(\psi)$ denote a parametrized representation of $6DOF$ camera pose. Then we want to minimize a pixel wise photometric error
\begin{equation}
F(\psi) = \frac{1}{2} \sum_u (f_u(\psi))^2 = \frac{1}{2}\|f\|^2\\
\end{equation}
where,
\begin{equation}
f_u(\psi) = I_l(\pi(KT_{lv}(\psi)\pi^{-1}(u,\xi_v(u)))) - I_v(u)
\end{equation}
The above equation can be interpreted in exactly the same way as the photometric error used to define the cost volume above. To solve this first $f_u$ is approximated by $\hat{f}_u(\psi)$ which is a first order Taylor series expansion of $f_u$ about $0$, ie $\hat{f_u}(\psi) = f_u(0) + \nabla f_u(0) \psi$. Now, to minimize $F$ we require its gradient to be $0$. This implies
\begin{align}
&\nabla F = 0 \\
\implies &\nabla \hat{f}(\psi)^{T}\hat{f}(\psi) = 0 \\
\implies &\nabla f(0)(f(0) + \nabla f(0)\psi) = 0
\end{align}
One solution to the last equation is obtained by setting $f(0) + \nabla f(0)\psi = 0$. This system of linear equations can be solved to get $\psi$. The $\psi$ is used to update the virtual camera pose $T_{wv}$ and the process can be iterated until convergence.

This completes our technical discussion of DTAM.

Conclusion

My first thought when I read the DTAM paper was that it was incredibly complex to implement and get right, let alone getting it to run in real time. But at the same time it is a very novel approach to the problem of dense depth map construction and also camera pose estimation using dense per pixel photometric error.

I have tried my best to understand, simplify and explain the techniques used in the paper while also at times talking about alternate methods and why they are less preferable over the ones used in DTAM.

This blog took me many evenings to complete but I am glad I saw it through from start to finish :)

Thursday 18 June 2015

PTAM: Parallel Tracking and Mapping


SLAM or Simultaneous Localization and Mapping refers to a range of techniques that are used for localizing the position of a moving object in a 3D world coordinate system while simultaneously generating a 3D map of its environment based on real-time data gathered from various sensors placed on the object. Sensors used in this application could include Inertial Measurement Unit (gyroscope, accelerometer, magnetometer etc often used in mobile phones and aircrafts), range scanners such as LIDAR (which is used in Google's Self Driving Car) and SONAR (which is used in submarines). However, with advancements in Computer Vision, images taken from a camera mounted on the moving object have been shown to be good enough for the task. Not surprisingly these techniques are studied under the banner of Visual SLAM. Unlike Structure From Motion (SFM) which is also a method for computing orientation and location of cameras and position of sparse feature points from multiple images of the scene, SLAM assumes a spatio-temporal continuity In places where high precision is a necessity such as the landing system installed on Nasa's Mars Rover, multiple sensors need to be used together giving rise to the domain of Sensor Fusion. The following video is an example of a Visual SLAM system





PTAM is a real-time Visual SLAM system that was published by George Klein and David Murray in 2006. As the name suggests the system splits the two components of VSLAM, tracking points across images and mapping, i.e. adding novel points to the 3D map, into 2 separate parallel threads on a dual core computer. The reasons for doing this are multiple:

1. Mapping involves triangulating and localizing new points and camera and refining already triangulated points and localized cameras. This is a computationally expensive process, known as bundle adjustment (BA) and can not be done for every video frame. Tracking on the other hand by definition is knowing where a point is located on each of the frames and where the camera is located and how it is oriented in each of the images.

2. The time that is saved on the tracking thread, by not having to carry the burden of bundle adjustment can be used to do a more exhaustive search for matching points across frames which is crucial for accuracy.

3. Often consecutive video frames carry redundant information due to high frame rates. Since tracking is no longer dependent on mapping, the mapping thread could choose to run BA on selected key-frames saving some time to interleave global BA with local BA steps. This helps prevent the problem of drift that occurs when BA is only performed locally. 

Map and track representation
The map is a collection of $M$ 3D points in the world coordinate system $\mathcal{W}$. Each point represents a locally planar textured patch. The point coordinates of the $j^{th}$ point is represented as $p_{j \mathcal{W}} = \left(x_{j \mathcal{W}}, y_{j \mathcal{W}}, z_{j \mathcal{W}}, 1 \right)$. Each point also has a unit normal and reference to the source pixel. 

The map also consists of $N$ keyframes, each being represented as a 4-level gaussian pyramid. Each keyframe has an associated camera centered coordinate system denoted by $K_i$ and transformation between this coordinate system and the world is given by $E_{K_i \mathcal W}$. So the reference to the source pixel comprises of the index of the image in which the point first appears, the level of the pyramid and the coordinates of the point. 

Tracking Thread
When a new frame is acquired, its pose is predicted using a decaying velocity motion model. The authors don't clarify the details of the motion model but say that it is similar to alpha-beta constant velocity model but without any new measurements. The alpha beta constant velocity model is given by

\begin{align}
\hat{x}_t^- &= x_{t-1} + v \Delta T \\
\hat{v}_t^- &= v_{t-1} \\
\hat{x}_t^+ &= \hat{x}_t^- + \alpha (x_t - \hat{x}_t^-) \\
\hat{v}_t^+ &= \hat{v}_t^- + \beta (x_t - \hat{x}_t^-) / \Delta T
\end{align}

The first 2 equations are equations of kinematics without acceleration and the last 2 equations incorporate the measurement $x_t$ to make correction. Since we don't have any measurement of the camera pose, we can just set it to $0$.  This is perhaps what is meant by the decaying velocity model but I might be wrong.

Once the pose is predicted, the current 3D map can be projected onto the new image plane (the camera intrinsics are ass) and the neighborhood of the projected location is searched for patch correspondence. First only a few points are projected and the camera pose is updated from these matches. Then a larger number of points are projected and searched and pose is refined again. 

Tricks Worth Noting: Patch Search
The paper describes some rather nifty tricks for patch search. Some salient features were:

1. An 8x8 patch in the source frame is searched in the target frame around a neighborhood of the predicted projection of the point. 

2. The template patch is first affine warped to accommodate view point changes. The affine warp is given by

\begin{align}
A =
\left[
\begin{array}{cc}
\frac{\partial{u_t}}{\partial{u_s}} & \frac{\partial{u_t}}{\partial{v_s}} \\
\frac{\partial{v_t}}{\partial{u_s}} & \frac{\partial{v_t}}{\partial{v_s}}
\end{array}
\right]
\end{align}

Where $(u_x,v_s)$ are the coordinates of source pixel and $(u_t,v_t)$ are coordinates of the target pixel. The matrix is computed by measuring the change in the target pixel location by a unit change in the source pixel location. 

3. The patch is searched in the pyramid level whose scale matches that of the original patch. This choice of level $l$ is made such that $\frac{|A|}{4^l}$ is closes to unity. The idea is that $|A|$ is a measure of the area occupied by the source pixel in the target image. So if in the highest resolution image the pixel area is measured as 4 then in the 1st level the pixel's are would be one, i.e, comparable to the size of source pixel. Note that the template patch is selected from its designated source level.

4. Instead of searching an all points in the neighborhood of the predicted projection, SSD scores are calculated only for FAST corners (generated without non-maximum suppression) lying inside a circular region around the point.

Pose Update
For updating the pose, the following robust objective function of the reprojection error $e_j$ is minimized with respect to the new camera's pose

\begin{equation}
\min_{\mu} \sum_{j \in S} Obj(\frac{|e_j|}{\sigma_j},\sigma_T)
\end{equation}
where $\mu$ is the current camera's pose which is being optimized over, $S$ is the set of patches matched in the previous step, $\sigma_j$ is the assumed measurement noise and $Obj(r,\sigma_T)$ is the Tukey Biweight objective function as defined here. The camera pose relative to the source image's camera is represented by 6 parameters -  3 for translation and 3 for rotation matrix expressed in the exponential map form.  The exponential map is a mapping from the axis-angle representation (Lie Algebra so(3)) of a 3D rotation to a rotation matrix (Lie Group SO(3)). Rodrigues' formula is a compact expression to compute this mapping and is given by

\begin{equation}
R = exp(\zeta) = I + sin \theta K + (1-cos \theta) K^2
\end{equation}
with $\zeta = \theta \mathbf{v} $ where $\theta $  is the angle of rotation about the axis represented by unit vector $\mathbf{v}$ and $K = [\mathbf{v}]_\times$ is the cross product matrix for vector $\mathbf{v}$.

But do we really need such a complicated parametrization? Yes! This is because the above objective function is minimized by iterative techniques such as gradient descent. This means that the gradients would be added to the pose parameters. This would have caused a problem with the usual 9-parameter representation of the rotation matrix with its elements because addition is not a valid operation in the vector space of SO(3). Said simply adding two rotation matrices (or a rotation and a non-rotation matrix) do not necessarily yield a rotation matrix. But this is clearly true in the vector space of Lie Algebra. 

Useful Information: Derivative of Rodrigues Formula
The last question for pose update is how do you calculate the gradient of the complicated objective function? No need to freak out! A closed form solution of the derivative of the rodrigues formula is provided in equation 9 here and the derivative of the objective function can be obtained by applying chain rule.

Mapping Thread
The mapping thread first initializes the map using 2 view stereo where the views are selected with user cooperation. Then the map is refined and expanded whenever a new keyframe is added. 

Steps for initialization:

1. User places the camera above the scene and presses a key. This selects the first keyframe. A 1000 FAST corners are computed on this frame. These points are tracked (perhaps using KLT tracker, since the tracking thread assumes an initial map and hence can not be used) until the acquisition of the second keyframe.

2. User then smoothly moves the camera to a slightly offset position and presses a key again to select the second keyframe. 

3. Now that we have two keyframes and correspondences between them, an Essential Matrix can be computed using the 5 point algorithm and RANSAC. The essential matrix can be decomposed to get a rotation matrix and translation vector for the second keyframe with respect to the first. This pose estimate can be used to triangulate the points. The authors don't mention whether the triangulation is done linearly or non-linearly.

4. Since the translation vector and triangulated point positions obtained in the previous step are arbitrary upto a scaling factor, they choose the scaling factor by assuming that camera translated 10cm between the stereo pair. Also the map is oriented so that the dominant detected plane lies at $z=0$.

Map Expansion: Keyframe acquisition

Keyframes are added whenever the following condition are met:

1. Tracking quality is good (where tracking quality is measured by fraction of feature measurement that were successful)

2. The frame should be separated from the last keyframe by at least 20 frames

3. The camera is a minimum distance away from the nearest keypoint already in the map. According to this ensures that there is enough baseline to make the system well conditioned for triangulation. The minimum distance of course depends on the mean distance of the points from the camera. If the mean distance is small then the minimum distance could be kept small and vice versa.

Map Expansion: Measurement of existing 3d points in new keyframes

Due to real time constraints the tracking thread might have measured only some of the tracked points in the new frame. The remaining points are re-projected and measured in the mapping thread. 

Map Expansion: Acquisition of new 3d points

Tracking thread has already computed FAST corners on the new keyframe. Non-maximal Suppression and thresholding based on Shi-Tomasi scores are used to select most salient points. The points that are close to existing measurements are discarded. The remaining points are candidates for new 3d map points.

To triangulate the new points, a second view is selected which is keyframe closest to the current keyframe. Correspondence is established using epipolar search. Patches around corner points near epipolar lines are matched using zero-mean SSD. This time however no affine warping is performed and patches are searched only in corresponding levels of pyramid. If match is found the new point is triangulated and inserted into the map. 

Bundle Adjustment: Global

Finally, all current 3d points and camera poses are refined using an iterative minimization procedure known as bundle adjustment. When the camera is not exploring, i.e, now new keyframes are added, a global bundle adjustment is carried out with all existing cameras and 3d points. If $\mu = \{\mu_1, \cdots, \mu_N\}$ denotes camera poses and $P = \{p_1, \cdots, p_M\}$ denotes 3d points with $S_i$ being the set of measurements in frame $i$ then the optimization problem is 

\begin{equation}
\min_{\mu, P} \sum_{i=1}^{N} \sum_{j\in S_i} Obj(\frac{|e_ij|}{\sigma_ij},\sigma_T)
\end{equation}

Observe, that this problem is very similar to the previous pose refinement problem but now the optimization also over camera poses besides 3d point locations. This is solved using Levenberg Marquardt (more on this in the next blog post). Interested readers are referred to Ceres Solver which is a C++ library developed by Google for solving such Non-linear Least Square optimization problems. Moreover, one no longer has to freak out over providing analytical gradients (Jacobian matrix). The solver uses Automatic Differentiation to compute gradients up to working precision. Note that Auto Differentiation is not Numerical Differentiation a.k.a Finite Difference Method and is not crippled by rounding or quantization errors. Even higher order derivatives can be computed precisely using Auto Differentiation. Its just magic! 

Bundle Adjustment: Local

Global optimization may take up to 10s of seconds as the map size increases, hence when the camera is exploring local bundle adjustment is done. In this the poses of only the most recent keyframe and its 4 closest keyframes are refined along with the location of all points visible in these keyframes using all the measurements of these 3d points ever made. In essence, just limit the variables being optimized over in the bundle adjustment problem above. 

With that I think we've covered most of the important points in this paper. Hopefully the summarization helps readers new to the problem. Enjoy the demo video of PTAM in action below and please refer to the PTAM project page for more details. Your feedback is invaluable, so feel free to comment below or reach me via email.