Turning Grand Theft Auto into a deep learning dataset
A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September
What can computer vision learn from video games? Researchers at the University of Bologna in Italy have trained neural networks using images from the video game Grand Theft Auto (GTA). The idea is to see if the computer graphics from GTA can be made to seem like real images - from the neural network’s perspective at least.
Pierluigi Zama Ramirez, a PhD student at the University of Bologna, described the work at the European Machine Vision Association’s machine vision forum, held in Bologna from the 5 to 7 September.
One of the big problems with deep learning, Zama Ramirez explained, is annotating the large amount of image data needed to get an accurate output from a neural network. He said that a task like semantic segmentation – classifying each pixel of the image – mostly has to be done manually, which can take two to six hours for each image.
The advantage of training a neural network on synthetic data, such as those produced by computer graphics, is that ‘you can obtain the labels almost for free’, Zama Ramirez said, as well as having access to a lot of images.
The downside is that the models trained on synthetic data cannot achieve the same performance as models trained on real data.
The researchers therefore set about trying to make GTA images look more realistic using generative adversarial networks (GANs). This is a framework that consists of two neural networks: a generator and a discriminator. The generator takes a synthetic image from the video game and tries to transform it into a realistic image. The discriminator then takes the adapted images and a real image dataset, and tries to classify which is real and which is fake. Over time the generator gets better at producing more realistic images, while the discriminator becomes more adept at flagging synthetic data. This process produces a realistic image.
There are two branches of GAN: the pixel-level approach, like Cycle-GAN, and the feature-level method. Pixel-level approaches don’t exploit any semantic information, i.e. the context of the image, so a framework like Cycle-GAN can introduce a lot of artefacts, such as trees sitting in the sky.
Zama Ramirez worked with a new pixel-level GAN approach that exploits semantic information during the generation process. Here, the discriminator does not only classify if the image is real or fake, but also performs semantic segmentation of the image. This leads the generator to produce images that have the same semantic content as the source synthetic images.
The group worked with a dataset of 20,000 training images with semantic labels from Grand Theft Auto V, and a validation training set of 2,975 cityscape images without labels. A network was trained on GTA adapted images, and then the performance evaluated against the cityscape validation set.
The performance of the network trained on GTA adapted images increased from 18.23 per cent to 31.4 per cent mean intersection over union (mIoU), and from 60.43 per cent to 80 per cent pixel accuracy, compared to just using GTA raw synthetic data.
‘Training a network on our adapted images can achieve almost double that from training a network on just synthetic data,’ Zama Ramirez commented. ‘The adapted images belong much more to the real distribution than the synthetic images.’
However, he added that the adapted images ‘still can’t reach the same accuracy of performance as when trained on real data.’
The group is now employing a Cycle-GAN approach consisting of two generators and two discriminators to try and achieve even better performance. Whether Grand Theft Auto can be made to appear completely real is yet to be seen.