Deep learning for embedded vision highlighted at EMVA conference
Matthew Dale reports from EMVA’s debut Embedded Vision Europe conference in Stuttgart, where deep learning ‘at the edge’ was discussed
Deep learning for embedded vision was one of the big talking points at the first Embedded Vision Europe conference, which took place in Stuttgart, Germany on 12-13 October.
The first Embedded Vision Europe conference (EVE) closed on Friday at the ICS Stuttgart with a fully booked attendance of about 200 participants. (Credit: EMVA)
The conference, organised by the European Machine Vision Association (EMVA) and Messe Stuttgart, looked to address the latest advances in embedded vision processing, an area taking hold because of the availability of low-cost, low-power and inexpensive embedded processing boards. These kinds of advances are happening largely outside of industrial vision, although they have the potential to greatly impact the machine vision market.
During her opening address, Gabriele Jansen, one of the founders of EMVA, commented that the hardware and software of embedded vision is set to boost the machine vision industry far beyond traditional growth rates and expectations.
Both keynote speakers discussed image processing using neural networks, a type of artificial intelligence, also called deep learning. Alex Myakov of Intel, in his keynote, made it clear that many of the neural networks currently available are not suitable for embedded devices.
‘At the edge you have lower platform compute and a smaller amount of memory available,’ he explained. ‘On top of that a large amount of accuracy is required for different use cases.’
As a result, out of the current networks shown by Myakov – including AlexNet, GoogLeNet, ResNet, Inception, VGG, BN-NIN and ENet – only E-net and GoogLeNet were candidates for running at the edge, as ‘there’s no way you can run any of those other nets on existing edge platforms', he stated. Edge processing refers to image processing close to where the images are captured, so onboard an embedded device.
In efforts to adapt the larger existing networks to embedded platforms, researchers began looking at which connections and synapses could be removed in order to make the networks smaller.
‘It’s a trade-off game,’ commented Myakov. ‘You have a large net on the server and you try to squeeze that into your little device.’
These attempts to move heavier server-side topologies to the edge failed for the most part, according to Myakov. ‘You started with a 95 per cent detection rate and ended up with 30 per cent, which wouldn’t be good for anything,’ he said. ‘But that research was extremely useful, as people understood how to actually create a proper feature extraction part of a network and a proper detection part of a network.’
This led to a wave of creativity about a year ago that saw researchers start to design networks especially for embedded platforms. ‘It is [now] safe to conclude that researchers are working on specific edge networks for detection and classification,’ Myakov said.
‘Two of our researchers solved the problem of creating a lightweight network for detection at the edge,’ he continued. ‘In comparison to other networks, ours achieved 95 per cent average precision at 1.5 gigaflops and 1.1 million parameters.’ This places the new embedded network within a similar region of capability as GoogLenet and Enet, making it more suitable for edge-based processing than many server-based networks.
The Embedded Vision Europe organisers, from left: Thomas Walter, Landesmesse Stuttgart; Gabriele Jansen, Vision Ventures; Thomas Lübkemeier, EMVA general secretary; Florian Niethammer, Landesmesse Stuttgart. (Credit: EMVA)
In the opening keynote presentation, Qualcomm Technologies’ Raj Talluri proposed a two-pronged approach to handling image data in embedded environments, using both traditional computer vision and deep learning.
Intelligent cameras are now using deep learning for object recognition, detection, tracking, identification, re-identification and classification, according to Talluri. The problem, however, is that in order for these cameras to detect and recognise objects from further away, higher resolution and therefore more processing power is required for deep learning applications.
‘The right way to solve this problem is to not just use brute force and apply deep learning on the whole image, but use the traditional computer vision techniques to actually detect the object, find what is moving and what’s not moving, put a box around it and run the deep learning network on [only] that part,’ he explained.
Talluri showed an example video of pedestrians located and tracked while walking on a business park campus. Using a pure deep learning solution comprising a detector and classifier, the footage could be processed at a rate of three frames per second, requiring 20 giga-MACs (multiply accumulate operations), 1,000MB/s memory bandwidth, and eight CPU cores running at 90 per cent usage.
By first using a traditional computer vision processer to highlight the pedestrians before applying a lightweight deep learning classifier, Qualcomm was able to process footage just as effectively at 25fps, using only 20 mega-MACs, 450MB/s memory bandwidth, and two CPU cores running at 60 per cent capacity.
The hybrid approach was also demonstrated using footage from a car park. The traditional algorithm would highlight cars as they entered the car park, and place boxes around the vehicles as well as the drivers once they got out of the car. Deep learning would then activate and keep track of the drivers as they moved around, even after they got back into their cars. ‘This was all done at extremely low power inside the camera on an on-board processor,’ commented Talluri. The technique therefore shows promise for embedded applications where processing power is limited to what is available on the device.
‘That is what I think we’re going to see moving forward, this concept of machine vision applications using computer vision techniques, but also using deep learning to augment them,’ said Talluri. ‘It would have been very hard to believe that you can do this [a few years ago] on such a small processor with a little bit of memory.’
Deep learning for industry
After being given multiple examples of how deep learning could benefit drone, security and automotive applications, the conference’s attention turned to industrial machine vision when Olivier Despont of Cognex addressed how deep learning could be implemented in factory environments using only a fraction of the training data normally required.
Despont said that, with the assistance of deep learning, object inspection and counting can now take place in a wider range of complex visual circumstances, code recognition can be achieved despite deformed characters, and product classification can be done in bulk numbers.
Cognex acquired Swiss firm Vidi Systems earlier this year for its deep learning software designed for industrial vision. The Vidi deep learning library is able to locate, extract and classify differences between multiple images, and can be used to address the majority of machine vision problems, according to Despont.
He said the software can build an application in less than five minutes by being fed representative data of both positive and negative outcomes. It requires only a single GPU to run, with the option to scale up to four GPUs to meet the processing times required by its users.
The software is able to analyse between 10 and 15 megapixels per second, per GPU, according to Despont, and can generalise and conceptualise the standard and differing parts of images up to 49 megapixels in resolution.
The software algorithm was originally developed at a Fraunhofer institute, and was worked on for six years before being acquired by Vidi Systems and industrialised for a further five years. The algorithm reduces the amount of data required to train deep learning applications from thousands of images to tens by preparing it in an innovative way – undisclosed by Cognex – before feeding it into the neural network. This reduces the time taken to prepare inspection processes, an important factor in competitive industrial environments.
To demonstrate the software’s capability, Despont explained how, using only 60 images, Cognex was able to produce a deep learning application that identified defects in screws, despite their specular appearance. A similar result was achieved inspecting distorted characters in OCR applications, using initial sets of 30 to 50 images for each character to train the software. ‘This covers stamped characters into metal, moulded characters in plastic, inkjet printed text or text found very close to barcodes,’ Despont commented. ‘This can all be done in half a day without writing a single line of code.’
Despont went on to highlight a number of other applications addressable by the software, including textile inspection, automotive part inspection, metal surface inspection and welding inspection. If a new defect was to start occurring in a production line, the software offers an ‘unsupervised’ mode, where it is only trained using data lacking anomalies – rather than being trained using previous defects – therefore any occurring differences in incoming data are highlighted as anomalies. This is particularly useful in industries such as textiles, where little variation is allowed between products, Despont said.
Lastly, Despont emphasised how the deep learning software could be used in industry to address image classification involving many classes, even with large variations within each class. In food production, for example, a box containing a number of the same produce could be arranged in an almost limitless number of ways. Again, using only a relatively small amount of data, the algorithm could be used to infer and identify the multiple arrangements as the same class of produce.
While the Vidi software is still PC-based, Cognex is working towards it being usable on a smart camera platform.