Finding 3D pose with a monocular camera

Share this on social media:

Tags: 

As winner of the EMVA Young Professional Award 2014, presented at the recent EMVA business conference, Jakob Engel, a PhD student in the Computer Vision Group at the Technical University of Munich, Germany, updates on the potential 3D imaging applications for his novel approach to real-time visual odometry using a monocular camera

The Computer Vision Group of the Technical University Munich has recently developed a novel, direct method to reconstruct the 3D environment from the video of a commodity hand-held camera, while at the same time tracking its exact position in real time. Commonly referred to as monocular SLAM (Simultaneous Localisation and Mapping), such methods are widely used in robotics, autonomous driving, or as the basis for virtual and augmented reality applications.

While multi-camera setups or active sensors such as structured light or time-of-fight cameras simplify the problem, compared to ordinary monocular cameras they are larger, more expensive and require more power, all of which are important criteria for commercialisation. In addition, stereo setups or active sensors have a very limited range at which they can provide reliable information – determined by the baseline of the sensor, for example – or do not work in direct sunlight. Monocular cameras, on the other hand, are scale-independent and fully passive, which allows them to operate in all environments at very different scales.

While all existing monocular SLAM algorithms are based on keypoints, the proposed method is a direct approach: instead of abstracting images to keypoint-observations, the method maps and tracks directly on image intensities (see Fig. 2). This has the fundamental advantage that all information in the images can be used, including the edges for example, instead of only relying on image corners (keypoints). Especially in man-made environments where there is often very little texture, this leads to denser and more detailed 3D reconstructions, as well as more accurate and robust camera tracking.

 

 

Figure 2: Keypoint-based methods abstract images to feature observations, and discard all other information. In contrast, the semi-dense direct approach maps and tracks directly on image intensities: this means, firstly, all information, including edges, are used, and secondly, rich, semi-dense information about the geometry of the scene is directly obtained. 

In the example of autonomous navigation of robots or cars, in unknown terrain a fundamental requirement is knowledge about the robot’s current position and the position of potential obstacles. Like humans and most animals, robots can use vision as a primary sensor to acquire this information. This SLAM technique has potential uses in navigation of unmanned micro-aerial vehicles where the size and power consumption of the sensor is subject to severe limitations. Deployed as a swarm, nano-quadrotors, which fit in the palm of a hand and weigh less than 25 grams, can be equipped with a nano-camera to navigate autonomously using this technique.

Another example of where this technique could be used is in virtual or augmented reality in smartphones. With cameras present in every modern smartphone, tablet or other wearable devices, virtual and augmented reality applications are becoming more and more prominent. A fundamental requirement for many such applications is exact real-time estimation of the pose of the device, as well as the 3D structure of the environment, which is what this technique is particularly good at.

The method is based on estimating and maintaining a semi-dense depth map (containing per-pixel depth) by continuous propagation and probabilistic fusion of pixel-wise stereo comparisons to previous frames. New frames are then tracked using direct image alignment: using the estimated semi-dense depth map, the camera pose is estimated by direct minimisation of the photometric error (intensity differences) between the two frames. To maintain real-time performance, even on a smartphone, only image pixels with a sufficiently large gradient are used, as the depth can only be estimated for these pixels.

Combined with a scale-drift aware pose-graph framework, large 3D scenes can be reconstructed accurately with only an ordinary hand-held monocular camera or from a mobile phone camera.

More information as well as videos can be found at: http://vision.in.tum.de/research/semidense.

Related analysis & opinion

09 October 2018

A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September

24 May 2018

Data is now a fiercely guarded asset for most companies and, as the European General Data Protection Regulation (GDPR) comes into force, Framos’ Dr Christopher Scheubel discusses potential new business models based on 3D vision data, following a talk he gave at the Embedded Vision Summit in Santa Clara this week

11 December 2018

Dr Guillaume Girardin, at Yole Développement, sets out some of the forces driving the growth of 3D imaging and sensing technologies

28 August 2018

Technology that advances 3D imaging, makes lenses more resistant to vibration, turns a CMOS camera virtually into a CCD, and makes SWIR imaging less expensive, are all innovations shortlisted for this year’s Vision Award, to be presented at the Vision show in Stuttgart

22 June 2018

Robot bin picking has been worked on for a number of years, and while it has been shown to be possible it’s only now that the technology is coming to fruition. Greg Blackman looks at what was on display at Automatica

Related features and analysis & opinion

16 November 2018

Online retail sales in the US exceeded $453 billion in 2017, according to the US Department of Commerce. Although this may seem like a substantial amount, it only accounts for 13 per cent of the total retail sales made in the region throughout the year, meaning the majority of transactions still take place via the millions of customers walking through their doors and aisles every day.

09 October 2018

A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September

24 May 2018

Data is now a fiercely guarded asset for most companies and, as the European General Data Protection Regulation (GDPR) comes into force, Framos’ Dr Christopher Scheubel discusses potential new business models based on 3D vision data, following a talk he gave at the Embedded Vision Summit in Santa Clara this week

19 February 2019

The agri-food industry is on the verge of a revolution thanks to advances in precision farming. Machine vision plays a crucial role in these advances, as Keely Portway finds out

19 February 2019

Greg Blackman explores some novel ways of imaging glass, including a 3D technique to measure the flatness of glass panels

11 December 2018

Dr Guillaume Girardin, at Yole Développement, sets out some of the forces driving the growth of 3D imaging and sensing technologies

15 November 2018

Recent years have seen 3D imaging grow in importance within the machine vision industry. The shortlist for this year’s Vision Awards at the Stuttgart trade fair is testament to how the technology behind 3D has advanced – and this looks set to continue and present a number of new use cases.

According to Raymond Boridy, product manager at Teledyne Dalsa, many applications that were previously not possible have been opened up thanks to 3D.

28 August 2018

Technology that advances 3D imaging, makes lenses more resistant to vibration, turns a CMOS camera virtually into a CCD, and makes SWIR imaging less expensive, are all innovations shortlisted for this year’s Vision Award, to be presented at the Vision show in Stuttgart