Thanks for visiting Imaging and Machine Vision Europe.

You're trying to access an editorial feature that is only available to logged in, registered users of Imaging and Machine Vision Europe. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Finding 3D pose with a monocular camera

Share this on social media:

Tags: 

As winner of the EMVA Young Professional Award 2014, presented at the recent EMVA business conference, Jakob Engel, a PhD student in the Computer Vision Group at the Technical University of Munich, Germany, updates on the potential 3D imaging applications for his novel approach to real-time visual odometry using a monocular camera

The Computer Vision Group of the Technical University Munich has recently developed a novel, direct method to reconstruct the 3D environment from the video of a commodity hand-held camera, while at the same time tracking its exact position in real time. Commonly referred to as monocular SLAM (Simultaneous Localisation and Mapping), such methods are widely used in robotics, autonomous driving, or as the basis for virtual and augmented reality applications.

While multi-camera setups or active sensors such as structured light or time-of-fight cameras simplify the problem, compared to ordinary monocular cameras they are larger, more expensive and require more power, all of which are important criteria for commercialisation. In addition, stereo setups or active sensors have a very limited range at which they can provide reliable information – determined by the baseline of the sensor, for example – or do not work in direct sunlight. Monocular cameras, on the other hand, are scale-independent and fully passive, which allows them to operate in all environments at very different scales.

While all existing monocular SLAM algorithms are based on keypoints, the proposed method is a direct approach: instead of abstracting images to keypoint-observations, the method maps and tracks directly on image intensities (see Fig. 2). This has the fundamental advantage that all information in the images can be used, including the edges for example, instead of only relying on image corners (keypoints). Especially in man-made environments where there is often very little texture, this leads to denser and more detailed 3D reconstructions, as well as more accurate and robust camera tracking.

 

 

Figure 2: Keypoint-based methods abstract images to feature observations, and discard all other information. In contrast, the semi-dense direct approach maps and tracks directly on image intensities: this means, firstly, all information, including edges, are used, and secondly, rich, semi-dense information about the geometry of the scene is directly obtained. 

In the example of autonomous navigation of robots or cars, in unknown terrain a fundamental requirement is knowledge about the robot’s current position and the position of potential obstacles. Like humans and most animals, robots can use vision as a primary sensor to acquire this information. This SLAM technique has potential uses in navigation of unmanned micro-aerial vehicles where the size and power consumption of the sensor is subject to severe limitations. Deployed as a swarm, nano-quadrotors, which fit in the palm of a hand and weigh less than 25 grams, can be equipped with a nano-camera to navigate autonomously using this technique.

Another example of where this technique could be used is in virtual or augmented reality in smartphones. With cameras present in every modern smartphone, tablet or other wearable devices, virtual and augmented reality applications are becoming more and more prominent. A fundamental requirement for many such applications is exact real-time estimation of the pose of the device, as well as the 3D structure of the environment, which is what this technique is particularly good at.

The method is based on estimating and maintaining a semi-dense depth map (containing per-pixel depth) by continuous propagation and probabilistic fusion of pixel-wise stereo comparisons to previous frames. New frames are then tracked using direct image alignment: using the estimated semi-dense depth map, the camera pose is estimated by direct minimisation of the photometric error (intensity differences) between the two frames. To maintain real-time performance, even on a smartphone, only image pixels with a sufficiently large gradient are used, as the depth can only be estimated for these pixels.

Combined with a scale-drift aware pose-graph framework, large 3D scenes can be reconstructed accurately with only an ordinary hand-held monocular camera or from a mobile phone camera.

More information as well as videos can be found at: http://vision.in.tum.de/research/semidense.

Related analysis & opinion

02 December 2019

Takashi Someda, CTO at Hacarus, on the advantages of sparse modelling AI tools

26 July 2019

Limited data is a common problem when training CNNs in industrial imaging applications. Petra Thanner and Daniel Soukup, from the Austrian Institute of Technology, discuss ways of working with CNNs when data is scarce

Zeiss's Smartzoom 5 digital microscope can remove glare from images by using angular illumination

23 May 2019

Reporting from the EMVA’s business conference in Copenhagen, Greg Blackman discovers how angular illumination and computational imaging can dramatically improve the resolution of a system

05 April 2019

Greg Blackman reports on the complexities of training AllGo Systems' driver monitoring neural networks, which the firm's VP of engineering, Nirmal Kumar Sancheti, spoke about at the Embedded World trade fair

09 October 2018

A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September

Related features and analysis & opinion

02 December 2019

Takashi Someda, CTO at Hacarus, on the advantages of sparse modelling AI tools

26 July 2019

Limited data is a common problem when training CNNs in industrial imaging applications. Petra Thanner and Daniel Soukup, from the Austrian Institute of Technology, discuss ways of working with CNNs when data is scarce

Zeiss's Smartzoom 5 digital microscope can remove glare from images by using angular illumination

23 May 2019

Reporting from the EMVA’s business conference in Copenhagen, Greg Blackman discovers how angular illumination and computational imaging can dramatically improve the resolution of a system

05 April 2019

Greg Blackman reports on the complexities of training AllGo Systems' driver monitoring neural networks, which the firm's VP of engineering, Nirmal Kumar Sancheti, spoke about at the Embedded World trade fair

29 March 2019

Greg Blackman reports from Embedded World, in Nuremberg, where he finds rapid progress in technology for imaging at the edge

Surface-based matching in Halcon, from MVTec Software

22 November 2019

Greg Blackman looks at the latest techniques to capture and analyse 3D image data

Depth map from the SceneScan. Credit: Nerian Vision

21 November 2019

Dr Konstantin Schauwecker, CEO of Nerian Vision, describes the firm’s stereo vision sensor for fast depth perception with FPGAs

26 July 2019

Matthew Dale explores the high-resolution imaging solutions emerging for inspecting OLEDs and other electronic displays