Scientists construct 3D models from 2D images

Share this on social media:


Scientists have 'taught' a computer program to construct 3D models from single 2D images, using the same visual cues that humans rely on to infer 3D information. Most 3D imaging systems calculate depth information by comparing the details in a selection of images taken from different positions, or they use a structured array of lighting to indicated the shape of an object.

The new technique, dubbed Make3D, needs no special equipment, and can even reconstruct 3D information from pictures taken by household cameras. To do this, the software first tries to differentiate the planes that make up the surfaces within the image, deconstructing it into a number of 3D 'super-pixels'.

It then infers the depth and orientation of these planes using the visual cues that humans normally rely on to understand 2D images. These include texture variations, gradients of surfaces, and whether objects are in focus or not. 'For example, the texture of surfaces appears different when viewed at different distances or orientations,' says Ashutosh Saxena from Stanford University in the US, who worked on Make3D. 'A tiled floor with parallel lines will appear to have tilted lines in an image, such that distant regions will have larger variations in the line orientations, and nearby regions will have smaller variations in line orientations. We capture some of these features in our model.'  

The software learnt to correctly understand these features and associate them with depth measurements using a machine learning algorithm in which it was 'trained' on 400 2D images and their corresponding 3D data gleaned from laser scans of different scenarios.  The team then tested Make3D on a remaining 134 images with corresponding 3D laser scan data, and 588 images collected from the internet, with independent observers verifying whether the results were realistic. Overall, 64.9 per cent were considered qualitatively accurate. For the times when it does make mistakes, Saxena has recently built a tool that allows users to tweak the constructed model. The algorithm also learns from the alteration, so that it can get better results in future challenges. 

'This is very similar to the case when a kid makes a mistake and parents give the kid a few hints on how to do the task so that he does it better the next time,' says Saxena. The research was presented at the Siggraph conference in Los Angeles, California which took place on 11-15 August. 

The team suggests that the technique could be used for many day-to-day applications, where expensive 3D vision equipment may not be feasible. For example, estate agents could easily create 3D models of their houses to show prospective buyers.  It could also be used in machine vision applications. The precision of stereovision depends on the distance between the two cameras: if cameras are too close together, it’s difficult to estimate the depth of a distant object. This limits robot navigation applications, since fast-moving autonomous robots and cars need to accurately predict depth to avoid obstacles. ‘Combining stereo and monocular cues gives better results than either alone,’ says Saxena. 

In the future, the team will investigate how to make the software recognise the contents of an image and identify the type of scene, which could improve a robot’s ability to manoeuvre around its environment.