More than Moore's law can provide
‘Nothing is certain except death and taxes,’ said Benjamin Franklin. If Franklin were alive today, he might be tempted to add: ‘…but next year’s computers will almost certainly be faster than this year’s computers.’ Because of the exponential progression of PC performance, image-processing for most applications can be handled by standard hardware. However, sometimes standard hardware just isn’t fast enough: ‘Ever since I started in imaging 20 years ago, there have always been applications for which the PC wasn’t fast enough,’ says Mark Williamson, sales and marketing director at Stemmer Imaging. ‘The old consensus used to be that PCs are getting faster, and [imaging-specific] hardware acceleration is becoming unnecessary. In actual fact, as PCs have gotten faster, algorithms have become more advanced and users still want to run these processing applications as fast as possible.’
For the handful of applications that need more processing power than an off-the-shelf PC can provide, additional image-processing hardware accelerators can make up for the shortfall. These applications are usually quite specialist, as Williamson observes: ‘Hardware acceleration is normally only needed for high-end, high processing speed applications; it’s not really needed on general machine vision applications. It’s probably only one of our customers out of every 20 or 30 that need to make use of this acceleration, and it’s usually for the complex, expensive and difficult applications.’
One such application cited by Williamson is the automated inspection of flat-panel displays during manufacture: ‘These flat-panels are getting bigger and bigger, and the factories cannot say “ok it’s a bigger flat-panel, and therefore it’s going to take longer to inspect”,’ he notes. Williamson goes on to explain that Moore’s law, which states that the number of transistors on a silicon chip doubles every two years, is applicable to many areas of the semiconductor industry. As such, the size of silicon features on an integrated circuit is halved every two years. To keep up with these shrinking features, says Williamson, manufacturers ‘need to increase the resolution of their inspections, and when increasing the resolution, the inspection can’t take any longer – they won’t let the inspection take longer, because it has to be cheaper, and therefore it has to go faster’. With high-resolution camera technology progressing with a similar relentlessness, processing capacity frequently becomes the bottleneck in such applications. ‘These might not necessarily be high-throughput applications in terms of units per hour, but they are the applications in which resolution has to be so fine that the resulting data is produced in a large volume.’
Accelerating an image processing application can be accomplished by the use of three main hardware approaches. Pierantonio Boriero, imaging product line manager at Matrox Imaging, describes the ‘processing trinity’: ‘We have multicore CPUs, we have GPUs, and we also have FPGAs to off-load processing.’ The three types of hardware are very different, and each is suited to particular kinds of processing tasks. DSP- or CPU-based accelerator boards have been the main products up until recently – usually sitting on an expansion card, which plugs into a PC. ‘They allowed load to be taken away from the host processors,’ explains Boriero, ‘meaning that tasks were completed more quickly than the host system could manage otherwise.’
The high speeds of GPUs (graphics processing units), as featured on graphics cards, are usually associated with the needs of computer gamers. The rationale behind the use of both multicore-processors and GPUs is the same, only where a multicore CPU may contain between two and eight processing cores, a modern GPU can have 256 or more identical processing pipelines on one chip. Both the multicore CPU and the GPU are well suited to image processing tasks that can be parallelised, and the GPU in particular is ‘dramatically faster than a normal processor’, according to Williamson, although he notes that not every application is suitable for GPU processing.
‘Several companies have tried to use GPUs to accelerate image processing,’ he explains, adding that there are three main approaches to the problem. First, customers could potentially make use of the proprietary GPU programming environments, such as Nvidia’s Cuda framework. This approach requires users to write software specifically for the GPU; ‘there aren’t any imaging libraries based on Cuda, so you would have to do it from first principles,’ says Williamson.
The second option is to make the GPU acceleration transparent to the user. This is the route Matrox has chosen with its MIL (Matrox Imaging Library) products. Matrox’s Boriero explains: ‘In MIL, we have a library of functions that is quite comprehensive. We went through that library and determined which functions would benefit from acceleration on multicore processors or on the GPU,’ he says. ‘But not all algorithms can benefit from parallel processing; some are inherently serial in nature, meaning that they have to execute in a serial fashion, and really cannot benefit from parallel execution.’
Williamson, however, believes that this approach has other limitations: ‘From the user’s point of view, their tasks will just run quicker, but the issue is the I/O,’ he says, referring to inputs and outputs in the sense of moving data between the GPU and the host CPU managing the application. Matrox’s Boriero explains that this limitation arises due to the GPU’s display-only origins. The GPU’s I/O was not designed to be symmetrical, as its output would normally go to a display rather than back into the system’s main memory. The performance in this respect is being addressed by the manufacturers.
According to Williamson, it is preferable to create a whole processing pipeline that will run on the GPU, without needing to move data back to the host frequently. Stemmer’s approach comes in the form of its Common Vision Blox (CVB) GPU product. CVB-GPU makes use of the same high-level shader language used by games developers, which works with Microsoft’s Direct-X and Direct 3D frameworks. All memory management is automated by the product. ‘The advantage is that you’re not tied to any particular hardware, and the language is similar to C – we can basically write any function,’ he says.
Matrox has a similar line of reasoning, although Boriero states that Direct Compute and Open CL are the frameworks of choice for the company: An open interface to which anybody can code ‘is certainly the future,’ he says. Furthermore, Open CL is not Windows-centric, and so customers using Mac OS and Linux are able to use it.
The applications Williamson cites as most suitable for GPU-accelerated techniques are those requiring the same operation to be applied to every pixel in an image. ‘This is where a GPU is a very good choice,’ he explains. ‘They’re designed to run an algorithm across the whole image, and so they’re good at doing the same thing many times.’ Example applications include processing x-ray images, in which several images would first be averaged, before some distortions are added. Williamson states that this type of processing required specialist hardware until recently. The high-level shader language will run on any graphics card, and the language allows the user to turn various components of the processing pipeline on and off while it is running, allowing real-time experimentation with process parameters. ‘High-level shader language generically supports as many cards as you can install,’ says Williamson.
‘GPUs can’t be used to accelerate every application,’ points out Matrox’s Boriero. ‘Customers shouldn’t say that they want to use GPU-based processing just because of all the hoo-haa about it. They need to look at their algorithm, see where the bottlenecks are, and see how much it would benefit from parallel execution; that’s all a GPU is effectively – hundreds of cores working in parallel.’
Levelling the playing field
Parallelisation is not the best way to accelerate every part of a processing problem, as Mike Bailey, senior systems engineer at National Instruments (NI) explains: ‘If, for example, we were trying to threshold an image and then detect a blob on it, because there are two steps in the algorithm, that task would split very well. We could have one processor that’s purely going to do the threshold, and then we’d pass the image on to the next processor, which would then do the blob detection. This is an example of pipelining. If, however, we split the problem and use parallelism, you would split the image in half. For the threshold, this would work very well, because it’s an easy point-by-point algorithm, but if we wanted to detect the blob on the image, we’d have to detect it in both halves in case it falls across the split, and so it becomes more tricky,’ he says. NI has approached the problem by developing ways for its LabView software product to split the company’s vision algorithms automatically across multiple processors. Aside from optimising the workload for multicore CPUs, NI uses FPGAs (field-programmable gate arrays – a type of reprogrammable microprocessor) as co-processors or as image processors. ‘FPGAs help out by bringing the data back to a manageable level,’ says Bailey.
Matrox’s Boriero believes that the FPGA-based approach is best suited to in-stream processing. He explains which tasks are best-suited to the technique: ‘They’re very demanding tasks, but kind of dumb in that they’re very repetitive. Either the tasks don’t have complex heuristics, or they don’t have heuristics at all. The performance is not data dependent, because they’re doing the same thing over and over again regardless of what the data is. These are tasks such as applying a spatial filter to remove noise, or to enhance an image to make sure the edges stand out – to sharpen the image for example.
‘We’ve determined that the best place to do FPGA processing is at the input,’ says Boriero, ‘where the data and images are being acquired.’ Although FPGA pre-processors are sometimes available on high-end cameras, Boriero believes that the high power consumption and warm running temperature of processing devices has a negative effect on the imaging device, and so the two should be kept separate.
When it comes to its hardware, Stemmer too uses FPGAs. Williamson explains that FPGAs were previously programmed in an unusual language (VHDL), suggesting that this difficulty may account for a lack of FPGA-accelerated products. ‘What you might get is a frame grabber with an FPGA put on there,’ he says, ‘or a camera that might do averaging or flat-field correction and maybe some filtering. The tasks done will be very specific in terms of what they’re coded as, and this is because that coding takes a lot of effort.’ For this reason, Stemmer distributes a product produced by Silicon Software. The software creates a flowchart environment in which a user can program FPGAs visually.
A processing trinity
‘Our Supersight platform moves away from the add-on-based solution, towards more of a system-level solution,’ says Matrox’s Boriero. ‘Existing technology can’t put all of the necessary processing elements onto a single expansion board when carrying out video capture, video processing, and also display. To meet our target speeds, we want to be able to capture the image as directly as possible, and send it to the accelerator. This isn’t really feasible anymore [using standard accelerator cards], because we can’t fit the necessary processing elements onto the same card as the acquisition hardware – there’s just not enough real estate, and we can’t provide enough power for the devices. This is why we’ve come up with a system-level approach – the Supersight.’ The system is an industrial PC with architecture to accommodate a combination of multicore CPU cards, GPU cards, and FPGA-accelerated video capture cards into a single chassis. The separate boards are connected by PCIe x16 throughout, meaning that communication between component processors has high throughput. Boriero believes that these PCIe interconnects in particular make the system unique.
According to Stemmer’s Williamson, camera data rates are doubling every two or three years, meaning that there is still a ‘bottleneck at the top end in which we never have enough processing power’. He cites applications such as line scan and web inspection, stating that, in some cases, the fastest image processors are unable to keep up with the production speeds.
Progress is spurred on as much by the whole semiconductor industry’s adherence to Moore’s law as by the efforts of the imaging companies, and the latter are happy for the help: ‘Personally, I think that multicore is going to be the most scalable technology,’ says NI’s Bailey. ‘Intel is working on it, we’re still following Moore’s law, and we’re doubling processing power every two years. It’s almost a freebee in that we can rely upon the power of the PC to enable us to do more with the measurements.’
Similarly, Stemmer hopes to follow gamers to higher performance: ‘We hope to ride the wave of the gaming market,’ says Williamson. ‘The performance of GPUs doubles every couple of years, and we’ll be adding more and more functionality [to CVB-GPU], and more pre-written algorithms, so that we can do more and more – and faster.’