Friday, 11 December 2015

Gaussian Process Regression for Object Detection

Object detection by sliding window is now a relic of the past. Modern approaches to object detection such as R-CNN rely on simply scoring region proposals using a CNN. This approach is faster as it requires scoring a few hundred bounding boxes as compared to an exhaustive search over the entire image. However, such an object detector can only be as good as the region proposal mechanism under the hood.  Zhang et.al present a novel approach that overcomes this challenge by using the scores evaluated at these region proposals to iteratively regress to better locations of the object using Bayesian Inference and Gaussian Process Regression (GPR). I will be making simplifying assumptions to bring out the main idea. The actual implementation can be looked up in their publication.


Approach of Zhang et.al in "Improving Object Detection with CNN via Bayesian Optimization and Structured Prediction"
























Let $\mathcal{Y}$ bet the set of all possible locations in the image and $f:\mathcal{Y} \rightarrow \mathbb{R}$ be a scoring function that computes our object detectors confidence, $f(y)$ for a given region $y$ in the image. Ideally we would like to search for local maxima of $f$ but since we can compute the scores at only selected regions $\{y_1, \cdots, y_N\}$ in the image, we need to somehow regress the locations of the local maxima given the training data $\{(y_1,f(y_1)),\cdots,(y_N,f(y_N))\}$. This is where GPR comes into picture.

Gaussian Process Regression

Gaussian Process refers to a set of random variables such that any subset of them follows a multivariate gaussian distribution. In case of object detection these random variables happen to be the scores of different regions in the image frame, one random variable per region. The scores $S_y$ we observe for a region $y$ also has some noise $\epsilon_y$ mixed with it

\begin{equation}




S_y = f(y) + \epsilon_y




\end{equation}

where $\epsilon_y \sim \mathcal{N}(0,\beta^{-1})$. In a usual regression setting we would like to estimate the parameters of the $f(\cdot)$ given a set of observations so that given a new $y'$, we can compute the its score. In GPR instead of directly specifying a parametric form for $f$, we provide a prior over the set of all possible functions

\begin{equation}

f \sim P(f|\theta)

\end{equation}

Note that here a function is being interpreted as an infinite dimensional vector where each element of the vector specifies the function's value at a particular point in it's domain. However, we are only dealing with a finite samples of the distribution, namely the observed data. As the name suggests in GPR the prior is a gaussian process

\begin{equation}




(f(y_1),\cdots,f(y_N)) \sim \mathcal{N}(y_1,\cdots,y_N|\mathbf{m}_0,\mathbf{K}_0)




\end{equation}

where $m_0$ is the mean and $K_0$ is a Gram matrix defined using a kernel $k(y_i,y_j)$. The kernel that is used in the paper is squared exponential covariance kernel with automatic relevance determination which looks like this

\begin{equation}




k(y_i,y_j) = \eta \exp(-(y_i-y_j)^T \Lambda (y_i-y_j))




\end{equation}

But we are interested in the distribution of $\mathbf{S}=\{S_1,\cdots,S_N\}$, which using bayes rule and marginalizing over $\mathbf{f}={f(y_1),\cdots,f(y_N)}$ can be written as 

\begin{equation}




P(\mathbf{S}) = \int P(\mathbf{S}|\mathbf{f})P(\mathbf{f}) d\mathbf{f}




\end{equation}

To compute this we shall use the following result from Section 2.3.3 of "Pattern Recognition and Machine Learning by Christopher M. Bishop" which says that if 

\begin{align}

P(x) &= \mathcal{N}(x|\mu,\Sigma)\\

P(y|x) &= \mathcal{N}(y|Ax+b,\Lambda) \\

\end{align}

then $P(y) = \mathcal{N}(y|A\mu + b, \Lambda + A\Sigma A^T)$. In our case 

\begin{align}
P(\mathbf{f}) &= \mathcal{N}(\mathbf{f}|\mathbf{m}_0,\mathbf{K}_0) \\
P(\mathbf{S}|\mathbf{f}) &= \mathcal{N}(\mathbf{S}|\mathbf{f},\beta^{-1}\mathbf{I}_N)
\end{align}

which implies that $P(\mathbf{S}) = \mathcal{N}(\mathbf{m}_0,\mathbf{K}_N)$ where $\mathbf{K}_N = \mathbf{K}_0 + \beta^{-1}\mathbf{I}_N$. Furthermore the mean used in the paper is a constant for every dimension that is $\mathbf{m}_0 = m_0 \mathbf{1}_N$

Now let us consider a new region $y_{N+1}$ for which we want to compute a distribution over its score $S_{y_{N+1}}$. In order to get this we shall first compute the joint distribution of $\mathbf{S}_{N+1} = \{\mathbf{S},S_{y_{N+1}}\}$ which by straightforward extension is given by

\begin{equation}

P(\mathbf{S}_{N+1}) = \mathcal{N}(\mathbf{m}_0,\mathbf{K}_{N+1})

\end{equation}

where $\mathbf{K}_{N+1}$ can be written in terms of $\mathbf{K}_N$ as 

\begin{equation}

\mathbf{K}_{N+1} =

\left[

\begin{array}{cc}

\mathbf{K}_N & \mathbf{k}_{N+1}\\

\mathbf{k}_{N+1}^T & k_{N+1}

\end{array}

\right]

\end{equation}

where 

\begin{align}

\mathbf{k}_{N+1} &= [k(y_{N+1},y_1) \cdots, k(y_{N+1},y_N)]^T \\

k_{N+1} &= k(y_{N+1},y_{N+1}) + \beta^{-1}

\end{align}

Finally we marginalize to get the conditional probability of $S_{y_{N+1}}$ given all the previous scores $\mathbf{S}$. For this we refer to the following property of Gaussian distributed random variables from  Section 2.3.1 of Bishop's book - Let $x_a,x_b$ partition the variables distributed according to a multivariate Gaussian distribution with a corresponding partitioning of the mean and covariance matrix as

\begin{align}
\mathbf{\mu} &=
\left[
\begin{array}{c}
\mathbf{\mu}_{a}\\
\mathbf{\mu}_{b}
\end{array}
\right] \\
\mathbf{\Sigma} &=
\left[
\begin{array}{cc}
\mathbf{\Sigma}_{aa} & \mathbf{\Sigma}_{ab} \\
\mathbf{\Sigma}_{ba} & \mathbf{\Sigma}_{bb}
\end{array}
\right]
\end{align}

then

\begin{equation}
P(\mathbf{x}_b|\mathbf{x}_a) = \mathcal{N}(\mathbf{x}_b|\mathbf{\mu}_b+\mathbf{\Sigma}_{ba} \mathbf{\Sigma}_{aa}^{-1} (\mathbf{x}_a-\mathbf{\mu}_a), \: \mathbf{\Sigma}_{bb} - \mathbf{\Sigma}_{ba}\mathbf{\Sigma}_{aa}^{-1}\mathbf{\Sigma}_{ab})
\end{equation}

Substituting the terms we get

\begin{align}

P(S_{y_{N+1}}|\mathbf{S}) = \mathcal{N}(S_{y_{N+1}}|\mu,\sigma)

\end{align}

where $\mu = m_0 + \mathbf{k}_N^T\mathbf{K}_N^{-1}[S_{y_j}-m_0]_{j=1,\cdots,N}$ and $\sigma = k_{N+1}-\mathbf{k}_{N+1}^T\mathbf{K}_N^{-1}\mathbf{k}_{N+1}$. This is the primary result of GPR - It gives a distribution of scores at any new location given the scores at all previously locations.

Fine grained search using Bayesian Optimization

Ideally we would like to pick a new location $y_{N+1}$ that would lead to maximum improvement in the score over the best scoring location observed so far. However, since we only have a distribution of score at any given location, we shall pick the next location that maximizes the expected improvement in score which is given by

\begin{align}

a_{EI}(y_{N+1}|\mathbf{S}) = \int_{\hat{S}_N}^{\infty} (S_{y_{N+1}}-\hat{S}_N).P(S_{y_{N+1}}|\mathbf{S})dS_{y_{N+1}}

\end{align}

where $\hat{S}_N = \max_{j=1,\cdots,N}S_{y_j}$. The paper provides a closed form solution of this acquisition function in terms of $\mu,\sigma$ and cumulative distribution of normal distribution. Please refer to equation 6 in the paper for the exact expression. In practice, instead of analytically maximizing $a_{EI}$, it is used to score the remaining region proposals.

In summary the steps are:
  1. Region Proposal: Propose regions using selective search or similar techniques
  2. Pruning: Discard proposals with low classification scores and select some top ranked regions
  3. Find local maxima where each such location would correspond to multiple instances
  4. For each local maxima select $N$ regions in its vicinity whose scores correspond to $\mathbf{S}$
  5. Fit a GP model using these sample scores
  6. Select one of the remaining regions using $a_{EI}$ as the scoring criterion
  7. Repeat until there are no more acceptable proposals


Experimental result


The above graph shows a detection performance comparison in terms of  mean average precision (mAP) amongst variants of Selective Search (SS) and the proposed approach (SS+FTS) for an oracle scoring function on PASCAL VOC 2007 detection benchmark. An oracle scoring function is the one that gives us the true score which in case of object detection means that it always gives a higher score to a better localization. Bounding box intersection over union (IOU) was used as the oracle scorer. SS + Local random search follows steps 1 through 3 but then randomly samples bounding boxes in the vicinity of the detected local maxima. On the horizontal axis is the increasingly strict criterion (IOU with ground truth region) for declaring a predicted region as a correct detection. 

Clearly, for lower IOU thresholds all methods perform quite well meaning that Selective Search quite often predicts a detection region somewhere in the vicinity of the ground truth. However as we demand more accurate localizations, most methods drop down to mAP of  30% except the proposed approach which maintains a performance of 95%. It is no surprise that if one increases the number of region proposals by SS from 2000 to 10000 one would do better  and hence an mAP of 50% for SS quality. But what is surprising is that SS+FTS with just 2100 boxes is able to beat a region proposal driven detection approach by a margin of 45% for IOU of 0.9. 


Mean Average Precision (mAP)

A little about the metric itself. It is important to understand metrics in order to understand how algorithms compare to each other. A good metric should reflect the expectations from the algorithm. For detection we expect two things from our detector:
  1. If there is an object it should detect it which is quantified by Recall
  2. If it believes there is an object at a location then there actually should be an object there which is quantified by Precision
Note that achieving one does not necessarily mean achieving other. For example a detector that says there is an object at every location has perfect recall but extremely low precision. Similarly a detector that always says there is no object at any location has perfect Precision but 0 Recall. Hence there is a sweet spot to be identified.

So now assume our algorithm produces a list of detection regions along with corresponding confidence scores for each image. These detections are ranked in order of decreasing confidence across all images. Each detection is then marked as a true positive (TP) or a false positive (FP) by assigning it to the closest ground truth and comparing the IOU with a threshold (x axis of the above plot).  For each position, $k$ in the ranked list one could compute precision and recall at top $k$ as $\frac{\#TP}{k}$ and $\frac{\#TP}{\#P}$ respectively. Note that recall is monotonically increasing with $k$ but precision might increase or decrease. So to avoid the wiggle, we interpolate the value as follows

\[
p_{interp}(r) = \max_{\tilde{r} \geq r} p(\tilde{r})
\]

where $p(\tilde{r})$ is precision at recall $\tilde{r}$. Note that we have discrete measurements for $\tilde{r} = r(k)$ for some $k$. Now a summarization of the precision recall curve is average precision defined as follows

\[
AP = \sum_{r\in{0,0.1,\cdots,1}} p_{interp}(r)
\]

Finally since detectors for multiple contending classes are evaluated simultaneously AP for different classes are averaged together and reported as mAP.


Conclusion

Poor recall from region proposal algorithms can directly hamper the performance of even the best of detection scoring functions. Gaussian Process Regression along with Bayesian optimization seems to be a very good strategy for efficiently overcoming this issue while keeping the computational advantage of dealing with a sparse set of locations instead of an exhaustive search. 

Saturday, 11 July 2015

DTAM: Dense Tracking and Mapping

Update: While working on a project I implemented and introduced several modifications to DTAM. For more information please refer to https://arxiv.org/abs/1606.05002. A video for the same submission is available here.

While most VSLAM systems live off robust feature detection and tracking, DTAM is a novel pixel-wise approach which uses the availability of many low baseline frames to its advantage in both creating dense maps and using them for robust camera pose estimation. The original paper by Richard Newcombe can be found here. The following video shows the depth maps created by DTAM and demonstrates tracking quality superior to PTAM.


Like PTAM, it has 2 distinct parts - pose tracking and 3D mapping. But unlike PTAM which just keeps track of sparse set of 3D points, DTAM maintains dense depth maps for selected keyframes.

Dense Mapping

The construction of depth map for a reference image $r$ from a set of short baseline frames denoted by $I(r)$ is posed as a global minimization problem. Central to this optimization is definition of a Cost Volume, $C_r$.

Cost Volume

The cost volume $C_r(I_m)$ defines the photometric error of every pixel in the reference frame with respect to a frame $m \in \mathcal{I}(r)$ as a function of inverse depth. In practice, of course the inverse depth is discretized into $L$ values so that the cost volume can be calculated and stored as an $M \times N \times L$ matrix where $M$ and $N$ are the number of rows and columns in the image.  More specifically, under the brightness constancy assumption, $C_r(u, d)$ is a measure of how different would the intensity of a pixel $u$ in image $I_r$ be from its projection in any of the images in $\mathcal{I}(r)$, assuming that the inverse depth of the pixel in $I_r$ is $d$. This can be mathematically written as
\begin{equation}
C_r(u,d) = \frac{1}{|\mathcal{I}|}\sum_{m \in \mathcal{I}} \|\rho(I_m,u,d) \|_1
\end{equation}
where
\begin{equation}
\rho(I_m,u,d)  = I_r(u) - I_m(\pi(KT_{mr}\pi^{-1}(u,d)))
\end{equation}
While the above equation looks terrifying, its really simple once broken down into pieces. Here goes-
1. Project pixel $u$ with inverse depth $d$ from pixel space to camera space using camera intrinsic matrix $K$

\begin{equation}
\pi^{-1}(u,d) = \frac{1}{d}K^{-1}
\left[
\begin{array}{c}
u_x \\
u_y \\
1
\end{array}
\right]
\end{equation}
2. Transform the 3D point (using rigid body transformation) from the coordinate system of camera $r$ to camera $m$

\begin{equation}
T_{mr}\pi^{-1}(u,d)
\end{equation}
3. Project the 3D point in camera $m$'s coordinate system to its pixel coordinates
\begin{equation}
\pi(KT_{mr}\pi^{-1}(u,d))
\end{equation}
where $\pi(x,y,z) = (x/z, y/z)$ is a dehomogenization operation


Note that the cost volume is an average of  photometric error over multiple frames because using just one frame would produce noisy estimates. The following figure shows this effect.

The thin colored lines in the 3 images show photometric errors at 3 different pixels a,b and c for different frames  $m \in \mathcal{I(r)}$ as a function of inverse depth. The bold red line shows the averaged cost.
An important point to note in the above figure is that the minimas in all 3 plots are located at the correct location relative to each other in terms of depth ordering of the pixels a,b and c. Also the pixels with more discriminative patches are better localized.

Optimization objective function

The objective function for obtaining a dense map is a regularized version of the cost that we described in the previous section. The regularization is a penalty term to ensure smoothness. Note that while a depth map is smooth in general, there would be discontinuities arising due to occlusion boundaries. Image gradients are often an important cue for such discontinuities. To incorporate such cases the regularization is chosen to be a weighted Huber norm of the inverse depth map gradient. So if the inverse depth map is denoted by $d = \xi(u)$, then the regularization term is
\begin{equation}
g(u)\|\nabla \xi(u)\|_{\epsilon}
\end{equation}
where $\|.\|_{\epsilon}$ is the Huber loss, a robust loss function that does not penalize outliers too harshly, thereby reducing their affect on the optimization. The Huber loss is defined by
\begin{equation}
\|x\|_{\epsilon} = \left\{
\begin{array}{ll}
\frac{\|x\|_2^2}{2} & \|x\|_2 \leq \epsilon \\
\|x\|_1 - \frac{\epsilon}{2} & otherwise
\end{array}
\right.
\end{equation}
$g$ is intensity gradient based weight, which adaptively reduces the regularization at intensity edges
\begin{equation}
g(u) = e^{-\alpha \|\nabla I_r(u)\|_2^{\beta}}
\end{equation}

So the regularized objective function is
\begin{equation}
E(\xi) = \int_u \{ g(u)\|\nabla \xi(u)\|_{\epsilon} + \lambda C_r(u,\xi(d))  \} du
\end{equation}
So now we have a cost function that we need to minimize. As it is well know, coming up with an optimization problem is easy, solving it however, is an entirely different beast.

Solving the minimization problem

For computation the inverse depth is represented by a $MN\times 1 $ vector $d$ and regularization weights by a $MN\times 1$ vector $g$.

The first step is to approximate the regularization term by $AGd$ where $Ax = \nabla x$ and $G = diag(g)$. Now our optimization problem looks like
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \lambda \mathbb{1}^T C(d)
\end{equation}
But this optimization is insane because $C(d)$ is just a lookup table where given the inverse depth at every pixel we can look up the value from the cost volume and arrange them in a $MN\times 1$ vector. So the two terms need to be decoupled somehow.

The decoupling uses 2 tricks.
a) Introduction of auxiliary variable or rather a vector of variables, $a$ of size $MN\times 1$
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
Here $\theta$ controls how tightly coupled $d$ and $a$ are. Theta is initially kept large but is decreased with every iteration as we trust values of $a$ and $d$ more and more.
At this point, a coordinate descent type iterative convex optimization solution is possible already which is as follows:
Beginning with an initial value of $d$, for each pixel do a point-wise exhaustive search for $a$ over the range of inverse depth values, i.e
\begin{equation}
\hat{a_i} = \underset{a_i}{\operatorname{argmin}} \frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)
\end{equation}
Now using $\hat{a}$ solve the convex optimization problem
\begin{equation}
\hat{d} = \underset{d}{\operatorname{argmin}}  \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-\hat{a})^2
\end{equation}
Note that this problem is convex because Huber norm is convex and so is the quadratic term.
To get the solution iterate between the above 2 steps until convergence. However, to avoid solving the large optimization problem the authors have used a second trick which allows

b) Legendre-Fenchel Transform
To understand this we need some background in convex optimization.
A convex function $f(x)$ can be written as
\begin{equation}
f(x) =  \underset{y}{\operatorname{sup}} x^Ty - f^*(y)
\end{equation}
where $f^*(x)$ is the convex conjugate of $f$. The convex conjugate is given by the exact same equation above but with $f$ and $f^*$ exchanged. In fact $f^{**} = f$ for a convex function. Now let us compute conjugate transform of Huber norm $H(x)$.
\begin{equation}
H^*(x) = \underset{y}{\operatorname{sup}} x^Ty - H(y)
\end{equation}
Taking the derivative and setting it to $0$ gives us
\begin{align}
& x - \mathbb{I}(\|y\|_2\leq \epsilon) \frac{y}{\epsilon} - \mathbb{I}(\|y\|_2 > \epsilon) sign(y) = 0 \\
\implies & y = x\epsilon \; \text{, when } \|x\|_2\leq 1 \\
\implies & H^*(x) = \frac{\epsilon\|x\|_2^2}{2} \; \text{, when } \|x\|_2\leq 1
\end{align}
Note that for $\|x\|_2>1$ if we draw an analogy with the 1 dimensional case where $x,y\in \mathbb{R}$, it can be seen that the gradient becomes positive making $x^Ty - H(y)$ a monotonically increasing function of $y$ and thus
\begin{equation}
H^*(x) = \infty \; \text{, when } \|x\|_2 > 1
\end{equation}
This can be written more concisely as $H^*(x) = \epsilon\|x\|^2/2 + \delta(x)$, where $\delta(x) = \infty$ when $\|x\|_2>1$. Also the function using the Legendre-Fenchel Transform (LFT) we can write
\begin{equation}
H(x) = \underset{y}{\operatorname{max}} x^Ty - \epsilon\|y\|^2/2 - \delta(y)
\end{equation}
Equipped with this information we can replace $\|AGd\|_\epsilon$ in our energy function with its LFT
\begin{equation}
\underset{q}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2 - \delta(q)\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
and our finial optimization problem can be written as
\begin{equation}
\underset{d,a}{\operatorname{min}}\underset{q \ni \|q\| \leq 1}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
But how does this math wizardry help us? Well now given $\hat{a}$ we can do gradient descent over $d$ and a gradient ascent over $q$ because we are minimizing with respect to $d$ and maximizing with respect to $q$ simultaneously. For more details on the derivatives, refer to Section 2.2.3 in the paper. Given $\hat{d}$ and $\hat{q}$, $a$ is obtained as before using point-wise search. As if all this wasn't enough, the authors of DTAM go on to describe a way to speed up point-wise search which is described in the following section.

Accelerating Point-wise Search
From the cost volume it its trivial to obtain for every pixel $u$, the bounds on data cost $C_u^{min}=C_u(d^{min})$ and $C_u^{max}=C_u(d^{max})$. Now for $\underset{a}{\operatorname{min}}\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)$, the maximum possible value is $\lambda C_u^{max}$ obtained by setting $a_i=d_i$. Using this fact we can get bounds on the search range of $a_i$ about $d_i$ through the following constraint
\begin{align}
\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C_u^{min} &\leq \lambda C_u^{max} \\
\implies a_i &\in \left[ d_i-\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})}, d_i+\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})} \right]
\end{align}
Note that since $\theta$ is decreased with every iteration, this bound becomes narrower and narrower allowing accelerated search for optimal $a$. The following image shows this phenomenon

The white band denotes the search region

Tracking

Tracking is done in 2 stages. First stage involves rotational odometry assuming no translation using "S. J. Lovegrove and A. J. Davison. Real-time spherical mosaicing using whole image alignment. ECCV 2010" which is resilient to motion blur. The second stage is a full $6DOF$ 2.5D alignment. The two pose estimates are composed to get the final pose. I will only be discussing the second stage because it is different from the way I would have approached it. I would have simply back projected the inverse depth map to a point cloud and solved for the rotation and translation using Iterative Closest Point technique in 3D space. Their way however seems to be more robust as it doesn't rely on nearest neighbor point correspondences. The DTAM approach is as follows:

The rotational odometry gives us a pose $T_{wv}$. The dense model is projected into this pose to give us an image $I_v$ and inverse depth map $\xi_v$. If the current frame is denoted by $l$, then we need to estimate $T_{vl}$ so that $T_{wv}T_{vl}$ would give us the pose of the current camera. Let $T_{vl}(\psi)$ denote a parametrized representation of $6DOF$ camera pose. Then we want to minimize a pixel wise photometric error
\begin{equation}
F(\psi) = \frac{1}{2} \sum_u (f_u(\psi))^2 = \frac{1}{2}\|f\|^2\\
\end{equation}
where,
\begin{equation}
f_u(\psi) = I_l(\pi(KT_{lv}(\psi)\pi^{-1}(u,\xi_v(u)))) - I_v(u)
\end{equation}
The above equation can be interpreted in exactly the same way as the photometric error used to define the cost volume above. To solve this first $f_u$ is approximated by $\hat{f}_u(\psi)$ which is a first order Taylor series expansion of $f_u$ about $0$, ie $\hat{f_u}(\psi) = f_u(0) + \nabla f_u(0) \psi$. Now, to minimize $F$ we require its gradient to be $0$. This implies
\begin{align}
&\nabla F = 0 \\
\implies &\nabla \hat{f}(\psi)^{T}\hat{f}(\psi) = 0 \\
\implies &\nabla f(0)(f(0) + \nabla f(0)\psi) = 0
\end{align}
One solution to the last equation is obtained by setting $f(0) + \nabla f(0)\psi = 0$. This system of linear equations can be solved to get $\psi$. The $\psi$ is used to update the virtual camera pose $T_{wv}$ and the process can be iterated until convergence.

This completes our technical discussion of DTAM.

Conclusion

My first thought when I read the DTAM paper was that it was incredibly complex to implement and get right, let alone getting it to run in real time. But at the same time it is a very novel approach to the problem of dense depth map construction and also camera pose estimation using dense per pixel photometric error.

I have tried my best to understand, simplify and explain the techniques used in the paper while also at times talking about alternate methods and why they are less preferable over the ones used in DTAM.

This blog took me many evenings to complete but I am glad I saw it through from start to finish :)

Thursday, 18 June 2015

PTAM: Parallel Tracking and Mapping


SLAM or Simultaneous Localization and Mapping refers to a range of techniques that are used for localizing the position of a moving object in a 3D world coordinate system while simultaneously generating a 3D map of its environment based on real-time data gathered from various sensors placed on the object. Sensors used in this application could include Inertial Measurement Unit (gyroscope, accelerometer, magnetometer etc often used in mobile phones and aircrafts), range scanners such as LIDAR (which is used in Google's Self Driving Car) and SONAR (which is used in submarines). However, with advancements in Computer Vision, images taken from a camera mounted on the moving object have been shown to be good enough for the task. Not surprisingly these techniques are studied under the banner of Visual SLAM. Unlike Structure From Motion (SFM) which is also a method for computing orientation and location of cameras and position of sparse feature points from multiple images of the scene, SLAM assumes a spatio-temporal continuity In places where high precision is a necessity such as the landing system installed on Nasa's Mars Rover, multiple sensors need to be used together giving rise to the domain of Sensor Fusion. The following video is an example of a Visual SLAM system





PTAM is a real-time Visual SLAM system that was published by George Klein and David Murray in 2006. As the name suggests the system splits the two components of VSLAM, tracking points across images and mapping, i.e. adding novel points to the 3D map, into 2 separate parallel threads on a dual core computer. The reasons for doing this are multiple:

1. Mapping involves triangulating and localizing new points and camera and refining already triangulated points and localized cameras. This is a computationally expensive process, known as bundle adjustment (BA) and can not be done for every video frame. Tracking on the other hand by definition is knowing where a point is located on each of the frames and where the camera is located and how it is oriented in each of the images.

2. The time that is saved on the tracking thread, by not having to carry the burden of bundle adjustment can be used to do a more exhaustive search for matching points across frames which is crucial for accuracy.

3. Often consecutive video frames carry redundant information due to high frame rates. Since tracking is no longer dependent on mapping, the mapping thread could choose to run BA on selected key-frames saving some time to interleave global BA with local BA steps. This helps prevent the problem of drift that occurs when BA is only performed locally. 

Map and track representation
The map is a collection of $M$ 3D points in the world coordinate system $\mathcal{W}$. Each point represents a locally planar textured patch. The point coordinates of the $j^{th}$ point is represented as $p_{j \mathcal{W}} = \left(x_{j \mathcal{W}}, y_{j \mathcal{W}}, z_{j \mathcal{W}}, 1 \right)$. Each point also has a unit normal and reference to the source pixel. 

The map also consists of $N$ keyframes, each being represented as a 4-level gaussian pyramid. Each keyframe has an associated camera centered coordinate system denoted by $K_i$ and transformation between this coordinate system and the world is given by $E_{K_i \mathcal W}$. So the reference to the source pixel comprises of the index of the image in which the point first appears, the level of the pyramid and the coordinates of the point. 

Tracking Thread
When a new frame is acquired, its pose is predicted using a decaying velocity motion model. The authors don't clarify the details of the motion model but say that it is similar to alpha-beta constant velocity model but without any new measurements. The alpha beta constant velocity model is given by

\begin{align}
\hat{x}_t^- &= x_{t-1} + v \Delta T \\
\hat{v}_t^- &= v_{t-1} \\
\hat{x}_t^+ &= \hat{x}_t^- + \alpha (x_t - \hat{x}_t^-) \\
\hat{v}_t^+ &= \hat{v}_t^- + \beta (x_t - \hat{x}_t^-) / \Delta T
\end{align}

The first 2 equations are equations of kinematics without acceleration and the last 2 equations incorporate the measurement $x_t$ to make correction. Since we don't have any measurement of the camera pose, we can just set it to $0$.  This is perhaps what is meant by the decaying velocity model but I might be wrong.

Once the pose is predicted, the current 3D map can be projected onto the new image plane (the camera intrinsics are ass) and the neighborhood of the projected location is searched for patch correspondence. First only a few points are projected and the camera pose is updated from these matches. Then a larger number of points are projected and searched and pose is refined again. 

Tricks Worth Noting: Patch Search
The paper describes some rather nifty tricks for patch search. Some salient features were:

1. An 8x8 patch in the source frame is searched in the target frame around a neighborhood of the predicted projection of the point. 

2. The template patch is first affine warped to accommodate view point changes. The affine warp is given by

\begin{align}
A =
\left[
\begin{array}{cc}
\frac{\partial{u_t}}{\partial{u_s}} & \frac{\partial{u_t}}{\partial{v_s}} \\
\frac{\partial{v_t}}{\partial{u_s}} & \frac{\partial{v_t}}{\partial{v_s}}
\end{array}
\right]
\end{align}

Where $(u_x,v_s)$ are the coordinates of source pixel and $(u_t,v_t)$ are coordinates of the target pixel. The matrix is computed by measuring the change in the target pixel location by a unit change in the source pixel location. 

3. The patch is searched in the pyramid level whose scale matches that of the original patch. This choice of level $l$ is made such that $\frac{|A|}{4^l}$ is closes to unity. The idea is that $|A|$ is a measure of the area occupied by the source pixel in the target image. So if in the highest resolution image the pixel area is measured as 4 then in the 1st level the pixel's are would be one, i.e, comparable to the size of source pixel. Note that the template patch is selected from its designated source level.

4. Instead of searching an all points in the neighborhood of the predicted projection, SSD scores are calculated only for FAST corners (generated without non-maximum suppression) lying inside a circular region around the point.

Pose Update
For updating the pose, the following robust objective function of the reprojection error $e_j$ is minimized with respect to the new camera's pose

\begin{equation}
\min_{\mu} \sum_{j \in S} Obj(\frac{|e_j|}{\sigma_j},\sigma_T)
\end{equation}
where $\mu$ is the current camera's pose which is being optimized over, $S$ is the set of patches matched in the previous step, $\sigma_j$ is the assumed measurement noise and $Obj(r,\sigma_T)$ is the Tukey Biweight objective function as defined here. The camera pose relative to the source image's camera is represented by 6 parameters -  3 for translation and 3 for rotation matrix expressed in the exponential map form.  The exponential map is a mapping from the axis-angle representation (Lie Algebra so(3)) of a 3D rotation to a rotation matrix (Lie Group SO(3)). Rodrigues' formula is a compact expression to compute this mapping and is given by

\begin{equation}
R = exp(\zeta) = I + sin \theta K + (1-cos \theta) K^2
\end{equation}
with $\zeta = \theta \mathbf{v} $ where $\theta $  is the angle of rotation about the axis represented by unit vector $\mathbf{v}$ and $K = [\mathbf{v}]_\times$ is the cross product matrix for vector $\mathbf{v}$.

But do we really need such a complicated parametrization? Yes! This is because the above objective function is minimized by iterative techniques such as gradient descent. This means that the gradients would be added to the pose parameters. This would have caused a problem with the usual 9-parameter representation of the rotation matrix with its elements because addition is not a valid operation in the vector space of SO(3). Said simply adding two rotation matrices (or a rotation and a non-rotation matrix) do not necessarily yield a rotation matrix. But this is clearly true in the vector space of Lie Algebra. 

Useful Information: Derivative of Rodrigues Formula
The last question for pose update is how do you calculate the gradient of the complicated objective function? No need to freak out! A closed form solution of the derivative of the rodrigues formula is provided in equation 9 here and the derivative of the objective function can be obtained by applying chain rule.

Mapping Thread
The mapping thread first initializes the map using 2 view stereo where the views are selected with user cooperation. Then the map is refined and expanded whenever a new keyframe is added. 

Steps for initialization:

1. User places the camera above the scene and presses a key. This selects the first keyframe. A 1000 FAST corners are computed on this frame. These points are tracked (perhaps using KLT tracker, since the tracking thread assumes an initial map and hence can not be used) until the acquisition of the second keyframe.

2. User then smoothly moves the camera to a slightly offset position and presses a key again to select the second keyframe. 

3. Now that we have two keyframes and correspondences between them, an Essential Matrix can be computed using the 5 point algorithm and RANSAC. The essential matrix can be decomposed to get a rotation matrix and translation vector for the second keyframe with respect to the first. This pose estimate can be used to triangulate the points. The authors don't mention whether the triangulation is done linearly or non-linearly.

4. Since the translation vector and triangulated point positions obtained in the previous step are arbitrary upto a scaling factor, they choose the scaling factor by assuming that camera translated 10cm between the stereo pair. Also the map is oriented so that the dominant detected plane lies at $z=0$.

Map Expansion: Keyframe acquisition

Keyframes are added whenever the following condition are met:

1. Tracking quality is good (where tracking quality is measured by fraction of feature measurement that were successful)

2. The frame should be separated from the last keyframe by at least 20 frames

3. The camera is a minimum distance away from the nearest keypoint already in the map. According to this ensures that there is enough baseline to make the system well conditioned for triangulation. The minimum distance of course depends on the mean distance of the points from the camera. If the mean distance is small then the minimum distance could be kept small and vice versa.

Map Expansion: Measurement of existing 3d points in new keyframes

Due to real time constraints the tracking thread might have measured only some of the tracked points in the new frame. The remaining points are re-projected and measured in the mapping thread. 

Map Expansion: Acquisition of new 3d points

Tracking thread has already computed FAST corners on the new keyframe. Non-maximal Suppression and thresholding based on Shi-Tomasi scores are used to select most salient points. The points that are close to existing measurements are discarded. The remaining points are candidates for new 3d map points.

To triangulate the new points, a second view is selected which is keyframe closest to the current keyframe. Correspondence is established using epipolar search. Patches around corner points near epipolar lines are matched using zero-mean SSD. This time however no affine warping is performed and patches are searched only in corresponding levels of pyramid. If match is found the new point is triangulated and inserted into the map. 

Bundle Adjustment: Global

Finally, all current 3d points and camera poses are refined using an iterative minimization procedure known as bundle adjustment. When the camera is not exploring, i.e, now new keyframes are added, a global bundle adjustment is carried out with all existing cameras and 3d points. If $\mu = \{\mu_1, \cdots, \mu_N\}$ denotes camera poses and $P = \{p_1, \cdots, p_M\}$ denotes 3d points with $S_i$ being the set of measurements in frame $i$ then the optimization problem is 

\begin{equation}
\min_{\mu, P} \sum_{i=1}^{N} \sum_{j\in S_i} Obj(\frac{|e_ij|}{\sigma_ij},\sigma_T)
\end{equation}

Observe, that this problem is very similar to the previous pose refinement problem but now the optimization also over camera poses besides 3d point locations. This is solved using Levenberg Marquardt (more on this in the next blog post). Interested readers are referred to Ceres Solver which is a C++ library developed by Google for solving such Non-linear Least Square optimization problems. Moreover, one no longer has to freak out over providing analytical gradients (Jacobian matrix). The solver uses Automatic Differentiation to compute gradients up to working precision. Note that Auto Differentiation is not Numerical Differentiation a.k.a Finite Difference Method and is not crippled by rounding or quantization errors. Even higher order derivatives can be computed precisely using Auto Differentiation. Its just magic! 

Bundle Adjustment: Local

Global optimization may take up to 10s of seconds as the map size increases, hence when the camera is exploring local bundle adjustment is done. In this the poses of only the most recent keyframe and its 4 closest keyframes are refined along with the location of all points visible in these keyframes using all the measurements of these 3d points ever made. In essence, just limit the variables being optimized over in the bundle adjustment problem above. 

With that I think we've covered most of the important points in this paper. Hopefully the summarization helps readers new to the problem. Enjoy the demo video of PTAM in action below and please refer to the PTAM project page for more details. Your feedback is invaluable, so feel free to comment below or reach me via email.


 

Wednesday, 27 May 2015

Kernel Smoothing

The generic regression problem is to learn a function $f$ which maps input $X$ to its desired output $Y$. However in many cases we require $f$ to be smooth, meaning that continuously varying input should effectuate continuously varying output. Kernel smoothing methods are directed towards this subset of regression problems.

Regression can be thought of as computing an expectation of the output conditioned on the input
\begin{equation}
\hat{f}(X) = \mathbb{E}[Y|X]
\end{equation}

An approximation to this expectation is the k-nearest-neighbor average. Simply find $k$ inputs in the training data set closest to the query input and average their corresponding outputs. This estimate is piece-wise continuous. It holds constant while the nearest-neighbors remain unchanged but can change abruptly even when one of the k-neighbors changes.

Nadaraya and Watson came up with a solution to the problem in 1964 (perhaps independently) and to no surprise their approach is now known as Nadaraya-Watson kernel regression. The idea behind this was simply to do a weighted average where training data points with inputs closer to the query input would have a larger say as to what the output should be. 

This heuristic has been motivated mathematically in Section 6.3.1 of Bishop's "Pattern Recognition and Machine Learning". A sketch of the result is presented below:

We have training input output pairs $\{x_n,t_n\}$. Using this data the joint distribution of input output variables is modeled by Parzen density estimator $p(x,t)$, which is a smooth non-parametric density estimation model

\begin{equation}
p(x,t) = \frac{1}{N}\sum_n q(x-x_n,t-t_n)
\end{equation}

where $q$ is a component density function. Parzen density estimator is basically trying to approximate the density function as a sum of density functions each sitting at a training data point for all given training points.

Now the conditional expectation can be written down in terms of $p(x,t)$ which can in turn be replaced by the Parzen density estimate. An assumption about component density functions having zero mean  and a quick change of variable later, we get the Nadaraya-Watson model which is

\begin{equation}
f(x) = \frac{\sum_n g(x-x_n)t_n}{\sum_m g(x-x_m)}
\end{equation}

where, $g(x)=\int_{-\infty}^{\infty} q(x,t)dt$. This can be more succinctly be written as 

\begin{equation}
f(x) = \sum_n k(x,x_n)t_n
\end{equation}

where, $k(x,x_n) = \frac{g(x-x_n)}{\sum_m g(x-x_m)}$. 

A visual comparison of k-nearest-neighbor and kernel smoothing using Nadaraya-Watson model with Epanechnikov Kernel is presented in Figure 6.1 in the second edition of "The Elements of Statistical Learning".