Skip to main content

Finding 3D pose with a monocular camera

As winner of the EMVA Young Professional Award 2014, presented at the recent EMVA business conference, Jakob Engel, a PhD student in the Computer Vision Group at the Technical University of Munich, Germany, updates on the potential 3D imaging applications for his novel approach to real-time visual odometry using a monocular camera

The Computer Vision Group of the Technical University Munich has recently developed a novel, direct method to reconstruct the 3D environment from the video of a commodity hand-held camera, while at the same time tracking its exact position in real time. Commonly referred to as monocular SLAM (Simultaneous Localisation and Mapping), such methods are widely used in robotics, autonomous driving, or as the basis for virtual and augmented reality applications.

While multi-camera setups or active sensors such as structured light or time-of-fight cameras simplify the problem, compared to ordinary monocular cameras they are larger, more expensive and require more power, all of which are important criteria for commercialisation. In addition, stereo setups or active sensors have a very limited range at which they can provide reliable information – determined by the baseline of the sensor, for example – or do not work in direct sunlight. Monocular cameras, on the other hand, are scale-independent and fully passive, which allows them to operate in all environments at very different scales.

While all existing monocular SLAM algorithms are based on keypoints, the proposed method is a direct approach: instead of abstracting images to keypoint-observations, the method maps and tracks directly on image intensities (see Fig. 2). This has the fundamental advantage that all information in the images can be used, including the edges for example, instead of only relying on image corners (keypoints). Especially in man-made environments where there is often very little texture, this leads to denser and more detailed 3D reconstructions, as well as more accurate and robust camera tracking.



Figure 2: Keypoint-based methods abstract images to feature observations, and discard all other information. In contrast, the semi-dense direct approach maps and tracks directly on image intensities: this means, firstly, all information, including edges, are used, and secondly, rich, semi-dense information about the geometry of the scene is directly obtained. 

In the example of autonomous navigation of robots or cars, in unknown terrain a fundamental requirement is knowledge about the robot’s current position and the position of potential obstacles. Like humans and most animals, robots can use vision as a primary sensor to acquire this information. This SLAM technique has potential uses in navigation of unmanned micro-aerial vehicles where the size and power consumption of the sensor is subject to severe limitations. Deployed as a swarm, nano-quadrotors, which fit in the palm of a hand and weigh less than 25 grams, can be equipped with a nano-camera to navigate autonomously using this technique.

Another example of where this technique could be used is in virtual or augmented reality in smartphones. With cameras present in every modern smartphone, tablet or other wearable devices, virtual and augmented reality applications are becoming more and more prominent. A fundamental requirement for many such applications is exact real-time estimation of the pose of the device, as well as the 3D structure of the environment, which is what this technique is particularly good at.

The method is based on estimating and maintaining a semi-dense depth map (containing per-pixel depth) by continuous propagation and probabilistic fusion of pixel-wise stereo comparisons to previous frames. New frames are then tracked using direct image alignment: using the estimated semi-dense depth map, the camera pose is estimated by direct minimisation of the photometric error (intensity differences) between the two frames. To maintain real-time performance, even on a smartphone, only image pixels with a sufficiently large gradient are used, as the depth can only be estimated for these pixels.

Combined with a scale-drift aware pose-graph framework, large 3D scenes can be reconstructed accurately with only an ordinary hand-held monocular camera or from a mobile phone camera.

More information as well as videos can be found at:


Media Partners