Perception and Learning in Machines: DTAM: Dense Tracking and Mapping

Update: While working on a project I implemented and introduced several modifications to DTAM. For more information please refer to https://arxiv.org/abs/1606.05002. A video for the same submission is available here.

While most VSLAM systems live off robust feature detection and tracking, DTAM is a novel pixel-wise approach which uses the availability of many low baseline frames to its advantage in both creating dense maps and using them for robust camera pose estimation. The original paper by Richard Newcombe can be found here. The following video shows the depth maps created by DTAM and demonstrates tracking quality superior to PTAM.

Like PTAM, it has 2 distinct parts - pose tracking and 3D mapping. But unlike PTAM which just keeps track of sparse set of 3D points, DTAM maintains dense depth maps for selected keyframes.

Dense Mapping

The construction of depth map for a reference image $r$ from a set of short baseline frames denoted by $I(r)$ is posed as a global minimization problem. Central to this optimization is definition of a Cost Volume, $C_r$.

Cost Volume

The cost volume $C_r(I_m)$ defines the photometric error of every pixel in the reference frame with respect to a frame $m \in \mathcal{I}(r)$ as a function of inverse depth. In practice, of course the inverse depth is discretized into $L$ values so that the cost volume can be calculated and stored as an $M \times N \times L$ matrix where $M$ and $N$ are the number of rows and columns in the image. More specifically, under the brightness constancy assumption, $C_r(u, d)$ is a measure of how different would the intensity of a pixel $u$ in image $I_r$ be from its projection in any of the images in $\mathcal{I}(r)$, assuming that the inverse depth of the pixel in $I_r$ is $d$. This can be mathematically written as
\begin{equation}
C_r(u,d) = \frac{1}{|\mathcal{I}|}\sum_{m \in \mathcal{I}} \|\rho(I_m,u,d) \|_1
\end{equation}
where
\begin{equation}
\rho(I_m,u,d) = I_r(u) - I_m(\pi(KT_{mr}\pi^{-1}(u,d)))
\end{equation}
While the above equation looks terrifying, its really simple once broken down into pieces. Here goes-
1. Project pixel $u$ with inverse depth $d$ from pixel space to camera space using camera intrinsic matrix $K$

\begin{equation}
\pi^{-1}(u,d) = \frac{1}{d}K^{-1}
\left[
\begin{array}{c}
u_x \\
u_y \\
1
\end{array}
\right]
\end{equation}
2. Transform the 3D point (using rigid body transformation) from the coordinate system of camera $r$ to camera $m$

\begin{equation}
T_{mr}\pi^{-1}(u,d)
\end{equation}
3. Project the 3D point in camera $m$'s coordinate system to its pixel coordinates
\begin{equation}
\pi(KT_{mr}\pi^{-1}(u,d))
\end{equation}

where $\pi(x,y,z) = (x/z, y/z)$ is a dehomogenization operation

Note that the cost volume is an average of photometric error over multiple frames because using just one frame would produce noisy estimates. The following figure shows this effect.

The thin colored lines in the 3 images show photometric errors at 3 different pixels a,b and c for different frames $m \in \mathcal{I(r)}$ as a function of inverse depth. The bold red line shows the averaged cost.

An important point to note in the above figure is that the minimas in all 3 plots are located at the correct location relative to each other in terms of depth ordering of the pixels a,b and c. Also the pixels with more discriminative patches are better localized.

Optimization objective function

The objective function for obtaining a dense map is a regularized version of the cost that we described in the previous section. The regularization is a penalty term to ensure smoothness. Note that while a depth map is smooth in general, there would be discontinuities arising due to occlusion boundaries. Image gradients are often an important cue for such discontinuities. To incorporate such cases the regularization is chosen to be a weighted Huber norm of the inverse depth map gradient. So if the inverse depth map is denoted by $d = \xi(u)$, then the regularization term is
\begin{equation}
g(u)\|\nabla \xi(u)\|_{\epsilon}
\end{equation}
where $\|.\|_{\epsilon}$ is the Huber loss, a robust loss function that does not penalize outliers too harshly, thereby reducing their affect on the optimization. The Huber loss is defined by
\begin{equation}
\|x\|_{\epsilon} = \left\{
\begin{array}{ll}
\frac{\|x\|_2^2}{2} & \|x\|_2 \leq \epsilon \\
\|x\|_1 - \frac{\epsilon}{2} & otherwise
\end{array}
\right.
\end{equation}
$g$ is intensity gradient based weight, which adaptively reduces the regularization at intensity edges
\begin{equation}
g(u) = e^{-\alpha \|\nabla I_r(u)\|_2^{\beta}}
\end{equation}

So the regularized objective function is
\begin{equation}
E(\xi) = \int_u \{ g(u)\|\nabla \xi(u)\|_{\epsilon} + \lambda C_r(u,\xi(d)) \} du
\end{equation}
So now we have a cost function that we need to minimize. As it is well know, coming up with an optimization problem is easy, solving it however, is an entirely different beast.

Solving the minimization problem

For computation the inverse depth is represented by a $MN\times 1 $ vector $d$ and regularization weights by a $MN\times 1$ vector $g$.

The first step is to approximate the regularization term by $AGd$ where $Ax = \nabla x$ and $G = diag(g)$. Now our optimization problem looks like
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \lambda \mathbb{1}^T C(d)
\end{equation}
But this optimization is insane because $C(d)$ is just a lookup table where given the inverse depth at every pixel we can look up the value from the cost volume and arrange them in a $MN\times 1$ vector. So the two terms need to be decoupled somehow.

The decoupling uses 2 tricks.
a) Introduction of auxiliary variable or rather a vector of variables, $a$ of size $MN\times 1$
\begin{equation}
d^* = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
Here $\theta$ controls how tightly coupled $d$ and $a$ are. Theta is initially kept large but is decreased with every iteration as we trust values of $a$ and $d$ more and more.
At this point, a coordinate descent type iterative convex optimization solution is possible already which is as follows:
Beginning with an initial value of $d$, for each pixel do a point-wise exhaustive search for $a$ over the range of inverse depth values, i.e
\begin{equation}
\hat{a_i} = \underset{a_i}{\operatorname{argmin}} \frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)
\end{equation}
Now using $\hat{a}$ solve the convex optimization problem
\begin{equation}
\hat{d} = \underset{d}{\operatorname{argmin}} \|AGd\|_{\epsilon} + \frac{1}{2\theta}(d-\hat{a})^2
\end{equation}
Note that this problem is convex because Huber norm is convex and so is the quadratic term.
To get the solution iterate between the above 2 steps until convergence. However, to avoid solving the large optimization problem the authors have used a second trick which allows

b) Legendre-Fenchel Transform
To understand this we need some background in convex optimization.
A convex function $f(x)$ can be written as
\begin{equation}
f(x) = \underset{y}{\operatorname{sup}} x^Ty - f^*(y)
\end{equation}
where $f^*(x)$ is the convex conjugate of $f$. The convex conjugate is given by the exact same equation above but with $f$ and $f^*$ exchanged. In fact $f^{**} = f$ for a convex function. Now let us compute conjugate transform of Huber norm $H(x)$.
\begin{equation}
H^*(x) = \underset{y}{\operatorname{sup}} x^Ty - H(y)
\end{equation}
Taking the derivative and setting it to $0$ gives us
\begin{align}
& x - \mathbb{I}(\|y\|_2\leq \epsilon) \frac{y}{\epsilon} - \mathbb{I}(\|y\|_2 > \epsilon) sign(y) = 0 \\
\implies & y = x\epsilon \; \text{, when } \|x\|_2\leq 1 \\
\implies & H^*(x) = \frac{\epsilon\|x\|_2^2}{2} \; \text{, when } \|x\|_2\leq 1
\end{align}
Note that for $\|x\|_2>1$ if we draw an analogy with the 1 dimensional case where $x,y\in \mathbb{R}$, it can be seen that the gradient becomes positive making $x^Ty - H(y)$ a monotonically increasing function of $y$ and thus
\begin{equation}
H^*(x) = \infty \; \text{, when } \|x\|_2 > 1
\end{equation}
This can be written more concisely as $H^*(x) = \epsilon\|x\|^2/2 + \delta(x)$, where $\delta(x) = \infty$ when $\|x\|_2>1$. Also the function using the Legendre-Fenchel Transform (LFT) we can write
\begin{equation}
H(x) = \underset{y}{\operatorname{max}} x^Ty - \epsilon\|y\|^2/2 - \delta(y)
\end{equation}
Equipped with this information we can replace $\|AGd\|_\epsilon$ in our energy function with its LFT
\begin{equation}
\underset{q}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2 - \delta(q)\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
and our finial optimization problem can be written as
\begin{equation}
\underset{d,a}{\operatorname{min}}\underset{q \ni \|q\| \leq 1}{\operatorname{max}} \; \{(AGd)^Tq - \epsilon\|q\|^2/2\} + \frac{1}{2\theta}(d-a)^2 + \lambda \mathbb{1}^T C(a)
\end{equation}
But how does this math wizardry help us? Well now given $\hat{a}$ we can do gradient descent over $d$ and a gradient ascent over $q$ because we are minimizing with respect to $d$ and maximizing with respect to $q$ simultaneously. For more details on the derivatives, refer to Section 2.2.3 in the paper. Given $\hat{d}$ and $\hat{q}$, $a$ is obtained as before using point-wise search. As if all this wasn't enough, the authors of DTAM go on to describe a way to speed up point-wise search which is described in the following section.

Accelerating Point-wise Search
From the cost volume it its trivial to obtain for every pixel $u$, the bounds on data cost $C_u^{min}=C_u(d^{min})$ and $C_u^{max}=C_u(d^{max})$. Now for $\underset{a}{\operatorname{min}}\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C(u_i , a_i)$, the maximum possible value is $\lambda C_u^{max}$ obtained by setting $a_i=d_i$. Using this fact we can get bounds on the search range of $a_i$ about $d_i$ through the following constraint
\begin{align}
\frac{1}{2\theta}(d_i - a_i)^2 + \lambda C_u^{min} &\leq \lambda C_u^{max} \\
\implies a_i &\in \left[ d_i-\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})}, d_i+\sqrt{2\theta \lambda(C_u^{max}-C_u^{min})} \right]
\end{align}
Note that since $\theta$ is decreased with every iteration, this bound becomes narrower and narrower allowing accelerated search for optimal $a$. The following image shows this phenomenon

The white band denotes the search region

Tracking

Tracking is done in 2 stages. First stage involves rotational odometry assuming no translation using "S. J. Lovegrove and A. J. Davison. Real-time spherical mosaicing using whole image alignment. ECCV 2010" which is resilient to motion blur. The second stage is a full $6DOF$ 2.5D alignment. The two pose estimates are composed to get the final pose. I will only be discussing the second stage because it is different from the way I would have approached it. I would have simply back projected the inverse depth map to a point cloud and solved for the rotation and translation using Iterative Closest Point technique in 3D space. Their way however seems to be more robust as it doesn't rely on nearest neighbor point correspondences. The DTAM approach is as follows:

The rotational odometry gives us a pose $T_{wv}$. The dense model is projected into this pose to give us an image $I_v$ and inverse depth map $\xi_v$. If the current frame is denoted by $l$, then we need to estimate $T_{vl}$ so that $T_{wv}T_{vl}$ would give us the pose of the current camera. Let $T_{vl}(\psi)$ denote a parametrized representation of $6DOF$ camera pose. Then we want to minimize a pixel wise photometric error
\begin{equation}
F(\psi) = \frac{1}{2} \sum_u (f_u(\psi))^2 = \frac{1}{2}\|f\|^2\\
\end{equation}
where,
\begin{equation}
f_u(\psi) = I_l(\pi(KT_{lv}(\psi)\pi^{-1}(u,\xi_v(u)))) - I_v(u)
\end{equation}
The above equation can be interpreted in exactly the same way as the photometric error used to define the cost volume above. To solve this first $f_u$ is approximated by $\hat{f}_u(\psi)$ which is a first order Taylor series expansion of $f_u$ about $0$, ie $\hat{f_u}(\psi) = f_u(0) + \nabla f_u(0) \psi$. Now, to minimize $F$ we require its gradient to be $0$. This implies
\begin{align}
&\nabla F = 0 \\
\implies &\nabla \hat{f}(\psi)^{T}\hat{f}(\psi) = 0 \\
\implies &\nabla f(0)(f(0) + \nabla f(0)\psi) = 0
\end{align}
One solution to the last equation is obtained by setting $f(0) + \nabla f(0)\psi = 0$. This system of linear equations can be solved to get $\psi$. The $\psi$ is used to update the virtual camera pose $T_{wv}$ and the process can be iterated until convergence.

This completes our technical discussion of DTAM.

Conclusion

My first thought when I read the DTAM paper was that it was incredibly complex to implement and get right, let alone getting it to run in real time. But at the same time it is a very novel approach to the problem of dense depth map construction and also camera pose estimation using dense per pixel photometric error.

I have tried my best to understand, simplify and explain the techniques used in the paper while also at times talking about alternate methods and why they are less preferable over the ones used in DTAM.

This blog took me many evenings to complete but I am glad I saw it through from start to finish :)

Saturday, 11 July 2015

DTAM: Dense Tracking and Mapping