Perception and Learning in Machines: PTAM: Parallel Tracking and Mapping

SLAM or Simultaneous Localization and Mapping refers to a range of techniques that are used for localizing the position of a moving object in a 3D world coordinate system while simultaneously generating a 3D map of its environment based on real-time data gathered from various sensors placed on the object. Sensors used in this application could include Inertial Measurement Unit (gyroscope, accelerometer, magnetometer etc often used in mobile phones and aircrafts), range scanners such as LIDAR (which is used in Google's Self Driving Car) and SONAR (which is used in submarines). However, with advancements in Computer Vision, images taken from a camera mounted on the moving object have been shown to be good enough for the task. Not surprisingly these techniques are studied under the banner of Visual SLAM. Unlike Structure From Motion (SFM) which is also a method for computing orientation and location of cameras and position of sparse feature points from multiple images of the scene, SLAM assumes a spatio-temporal continuity In places where high precision is a necessity such as the landing system installed on Nasa's Mars Rover, multiple sensors need to be used together giving rise to the domain of Sensor Fusion. The following video is an example of a Visual SLAM system

PTAM is a real-time Visual SLAM system that was published by George Klein and David Murray in 2006. As the name suggests the system splits the two components of VSLAM, tracking points across images and mapping, i.e. adding novel points to the 3D map, into 2 separate parallel threads on a dual core computer. The reasons for doing this are multiple:

1. Mapping involves triangulating and localizing new points and camera and refining already triangulated points and localized cameras. This is a computationally expensive process, known as bundle adjustment (BA) and can not be done for every video frame. Tracking on the other hand by definition is knowing where a point is located on each of the frames and where the camera is located and how it is oriented in each of the images.

2. The time that is saved on the tracking thread, by not having to carry the burden of bundle adjustment can be used to do a more exhaustive search for matching points across frames which is crucial for accuracy.

3. Often consecutive video frames carry redundant information due to high frame rates. Since tracking is no longer dependent on mapping, the mapping thread could choose to run BA on selected key-frames saving some time to interleave global BA with local BA steps. This helps prevent the problem of drift that occurs when BA is only performed locally.

Map and track representation
The map is a collection of $M$ 3D points in the world coordinate system $\mathcal{W}$. Each point represents a locally planar textured patch. The point coordinates of the $j^{th}$ point is represented as $p_{j \mathcal{W}} = \left(x_{j \mathcal{W}}, y_{j \mathcal{W}}, z_{j \mathcal{W}}, 1 \right)$. Each point also has a unit normal and reference to the source pixel.

The map also consists of $N$ keyframes, each being represented as a 4-level gaussian pyramid. Each keyframe has an associated camera centered coordinate system denoted by $K_i$ and transformation between this coordinate system and the world is given by $E_{K_i \mathcal W}$. So the reference to the source pixel comprises of the index of the image in which the point first appears, the level of the pyramid and the coordinates of the point.

Tracking Thread
When a new frame is acquired, its pose is predicted using a decaying velocity motion model. The authors don't clarify the details of the motion model but say that it is similar to alpha-beta constant velocity model but without any new measurements. The alpha beta constant velocity model is given by

\begin{align}
\hat{x}_t^- &= x_{t-1} + v \Delta T \\
\hat{v}_t^- &= v_{t-1} \\
\hat{x}_t^+ &= \hat{x}_t^- + \alpha (x_t - \hat{x}_t^-) \\
\hat{v}_t^+ &= \hat{v}_t^- + \beta (x_t - \hat{x}_t^-) / \Delta T
\end{align}

The first 2 equations are equations of kinematics without acceleration and the last 2 equations incorporate the measurement $x_t$ to make correction. Since we don't have any measurement of the camera pose, we can just set it to $0$. This is perhaps what is meant by the decaying velocity model but I might be wrong.

Once the pose is predicted, the current 3D map can be projected onto the new image plane (the camera intrinsics are ass) and the neighborhood of the projected location is searched for patch correspondence. First only a few points are projected and the camera pose is updated from these matches. Then a larger number of points are projected and searched and pose is refined again.

Tricks Worth Noting: Patch Search
The paper describes some rather nifty tricks for patch search. Some salient features were:

1. An 8x8 patch in the source frame is searched in the target frame around a neighborhood of the predicted projection of the point.

2. The template patch is first affine warped to accommodate view point changes. The affine warp is given by

\begin{align}
A =
\left[
\begin{array}{cc}
\frac{\partial{u_t}}{\partial{u_s}} & \frac{\partial{u_t}}{\partial{v_s}} \\
\frac{\partial{v_t}}{\partial{u_s}} & \frac{\partial{v_t}}{\partial{v_s}}
\end{array}
\right]
\end{align}

Where $(u_x,v_s)$ are the coordinates of source pixel and $(u_t,v_t)$ are coordinates of the target pixel. The matrix is computed by measuring the change in the target pixel location by a unit change in the source pixel location.

3. The patch is searched in the pyramid level whose scale matches that of the original patch. This choice of level $l$ is made such that $\frac{|A|}{4^l}$ is closes to unity. The idea is that $|A|$ is a measure of the area occupied by the source pixel in the target image. So if in the highest resolution image the pixel area is measured as 4 then in the 1st level the pixel's are would be one, i.e, comparable to the size of source pixel. Note that the template patch is selected from its designated source level.

4. Instead of searching an all points in the neighborhood of the predicted projection, SSD scores are calculated only for FAST corners (generated without non-maximum suppression) lying inside a circular region around the point.

Pose Update
For updating the pose, the following robust objective function of the reprojection error $e_j$ is minimized with respect to the new camera's pose

\begin{equation}
\min_{\mu} \sum_{j \in S} Obj(\frac{|e_j|}{\sigma_j},\sigma_T)
\end{equation}
where $\mu$ is the current camera's pose which is being optimized over, $S$ is the set of patches matched in the previous step, $\sigma_j$ is the assumed measurement noise and $Obj(r,\sigma_T)$ is the Tukey Biweight objective function as defined here. The camera pose relative to the source image's camera is represented by 6 parameters - 3 for translation and 3 for rotation matrix expressed in the exponential map form. The exponential map is a mapping from the axis-angle representation (Lie Algebra so(3)) of a 3D rotation to a rotation matrix (Lie Group SO(3)). Rodrigues' formula is a compact expression to compute this mapping and is given by

\begin{equation}
R = exp(\zeta) = I + sin \theta K + (1-cos \theta) K^2
\end{equation}
with $\zeta = \theta \mathbf{v} $ where $\theta $ is the angle of rotation about the axis represented by unit vector $\mathbf{v}$ and $K = [\mathbf{v}]_\times$ is the cross product matrix for vector $\mathbf{v}$.

But do we really need such a complicated parametrization? Yes! This is because the above objective function is minimized by iterative techniques such as gradient descent. This means that the gradients would be added to the pose parameters. This would have caused a problem with the usual 9-parameter representation of the rotation matrix with its elements because addition is not a valid operation in the vector space of SO(3). Said simply adding two rotation matrices (or a rotation and a non-rotation matrix) do not necessarily yield a rotation matrix. But this is clearly true in the vector space of Lie Algebra.

Useful Information: Derivative of Rodrigues Formula
The last question for pose update is how do you calculate the gradient of the complicated objective function? No need to freak out! A closed form solution of the derivative of the rodrigues formula is provided in equation 9 here and the derivative of the objective function can be obtained by applying chain rule.

Mapping Thread
The mapping thread first initializes the map using 2 view stereo where the views are selected with user cooperation. Then the map is refined and expanded whenever a new keyframe is added.

Steps for initialization:

1. User places the camera above the scene and presses a key. This selects the first keyframe. A 1000 FAST corners are computed on this frame. These points are tracked (perhaps using KLT tracker, since the tracking thread assumes an initial map and hence can not be used) until the acquisition of the second keyframe.

2. User then smoothly moves the camera to a slightly offset position and presses a key again to select the second keyframe.

3. Now that we have two keyframes and correspondences between them, an Essential Matrix can be computed using the 5 point algorithm and RANSAC. The essential matrix can be decomposed to get a rotation matrix and translation vector for the second keyframe with respect to the first. This pose estimate can be used to triangulate the points. The authors don't mention whether the triangulation is done linearly or non-linearly.

4. Since the translation vector and triangulated point positions obtained in the previous step are arbitrary upto a scaling factor, they choose the scaling factor by assuming that camera translated 10cm between the stereo pair. Also the map is oriented so that the dominant detected plane lies at $z=0$.

Map Expansion: Keyframe acquisition

Keyframes are added whenever the following condition are met:

1. Tracking quality is good (where tracking quality is measured by fraction of feature measurement that were successful)

2. The frame should be separated from the last keyframe by at least 20 frames

3. The camera is a minimum distance away from the nearest keypoint already in the map. According to this ensures that there is enough baseline to make the system well conditioned for triangulation. The minimum distance of course depends on the mean distance of the points from the camera. If the mean distance is small then the minimum distance could be kept small and vice versa.

Map Expansion: Measurement of existing 3d points in new keyframes

Due to real time constraints the tracking thread might have measured only some of the tracked points in the new frame. The remaining points are re-projected and measured in the mapping thread.

Map Expansion: Acquisition of new 3d points

Tracking thread has already computed FAST corners on the new keyframe. Non-maximal Suppression and thresholding based on Shi-Tomasi scores are used to select most salient points. The points that are close to existing measurements are discarded. The remaining points are candidates for new 3d map points.

To triangulate the new points, a second view is selected which is keyframe closest to the current keyframe. Correspondence is established using epipolar search. Patches around corner points near epipolar lines are matched using zero-mean SSD. This time however no affine warping is performed and patches are searched only in corresponding levels of pyramid. If match is found the new point is triangulated and inserted into the map.

Bundle Adjustment: Global

Finally, all current 3d points and camera poses are refined using an iterative minimization procedure known as bundle adjustment. When the camera is not exploring, i.e, now new keyframes are added, a global bundle adjustment is carried out with all existing cameras and 3d points. If $\mu = \{\mu_1, \cdots, \mu_N\}$ denotes camera poses and $P = \{p_1, \cdots, p_M\}$ denotes 3d points with $S_i$ being the set of measurements in frame $i$ then the optimization problem is

\begin{equation}
\min_{\mu, P} \sum_{i=1}^{N} \sum_{j\in S_i} Obj(\frac{|e_ij|}{\sigma_ij},\sigma_T)
\end{equation}

Observe, that this problem is very similar to the previous pose refinement problem but now the optimization also over camera poses besides 3d point locations. This is solved using Levenberg Marquardt (more on this in the next blog post). Interested readers are referred to Ceres Solver which is a C++ library developed by Google for solving such Non-linear Least Square optimization problems. Moreover, one no longer has to freak out over providing analytical gradients (Jacobian matrix). The solver uses Automatic Differentiation to compute gradients up to working precision. Note that Auto Differentiation is not Numerical Differentiation a.k.a Finite Difference Method and is not crippled by rounding or quantization errors. Even higher order derivatives can be computed precisely using Auto Differentiation. Its just magic!

Bundle Adjustment: Local

Global optimization may take up to 10s of seconds as the map size increases, hence when the camera is exploring local bundle adjustment is done. In this the poses of only the most recent keyframe and its 4 closest keyframes are refined along with the location of all points visible in these keyframes using all the measurements of these 3d points ever made. In essence, just limit the variables being optimized over in the bundle adjustment problem above.

With that I think we've covered most of the important points in this paper. Hopefully the summarization helps readers new to the problem. Enjoy the demo video of PTAM in action below and please refer to the PTAM project page for more details. Your feedback is invaluable, so feel free to comment below or reach me via email.

Thursday, 18 June 2015

PTAM: Parallel Tracking and Mapping