Computer vision to speed police investigations similar to Boston bombing

Share this on social media:

Tags: 

Police investigating the Boston Marathon bombing drew heavily on the vast amount of video data from the scene to build their case. The latest video analytic approaches could have helped speed the investigation even further says Dr James Orwell, leader of the Visual Surveillance Research Group at the Digital Imaging Research Centre at Kingston University

The alleged bombers of the Boston Marathon appear to have been swiftly identified and located. The scene of the atrocity was exceptional in several respects, not least the quantity of video recordings being made at the time, and the speed with which these were made available to the investigation. Videos and still pictures taken by loved ones and broadcasting professionals were willingly cast alongside regular CCTV recordings in an unruly mosaic of data. The analysis of this data would also have been on an exceptional scale, with professionals seeking to establish the answers to some ultimately simple questions: ‘who’, ‘what’, ‘where’ and ‘when’. No resource would have been spared. At the present time the work would be undertaken by experts, but the prospects for machine analysis of video evidence are increasingly likely. However, it is hard to envisage a fully automatic analysis, but rather a short-listing process that may dramatically improve efficiency, just as internet search engines allow fast browsing by presenting a shortlist of candidate pages.

No doubt the methodology of the Boston investigation would be blurred by its own urgent agenda and priorities, but broadly there would be four stages of image analysis to undertake. The first stage is to locate each media item in time and space. For installed security cameras this would hopefully be trivial; for the volunteered medley of media, this is a potentially complex task that combines scene recognition with inaccurate device timestamps, owners’ recollections about where the media was recorded, and increasingly, GPS data automatically inserted into the media by smart-phones. This catalogue of the data builds up a picture of what elements of the past have been recorded. In most investigations, these amount only to thin slivers here and there, but most areas around the Boston finishing line would have been recorded many times over, giving investigators the luxury of multiple angles from which to inspect crowded elements and occluding corners.

Techniques for image and video processing can assist with this first analysis stage, especially in the rich structure of an urban scene. Buildings can often be recognised via their distinct arrangement of visual ‘interest points’, as comparisons can be made that are invariant to the viewing angle. Furthermore, the viewing angle can then be estimated, once the structure is known. In this way, the relative and then absolute positions of the various media sources can be built up. Similarly, methods have been developed for synchronising sources, for example by maximising the mutual information between them. Another important aspect is the capacity to visualise easily all this data in a suitable environment.

A second analysis stage aims to enumerate a list of the people present at the event, and to cross-reference the various observations of each person. This allows an understanding of where each person went during the critical period. It also suggests which observation provides the clearest opportunity for ‘recognition’, which comes at a later stage. The cross-referencing of observations can be a very time-consuming task, for which accuracy of synchronisation and placement play a vital role, and each individual needs to be re-identified in the relevant subset of media. In the Boston investigation, two factors made this easier than usual. First, the density of the coverage implies fewer gaps in the views of each individual’s trajectory; investigators spend less time guessing which way a subject turned, while they were off-camera. Secondly, for events such as these, people tend to wear highly coloured, distinctive clothing, so generally it is more straightforward to tell them apart. In contrast, at rush-hour on a working day, investigators would need to use more sophisticated cues to re-identify each individual.

There is much active research into methods to automatically re-identify people between cameras, with steady improvement of results on standard test datasets. Currently, on far-view CCTV data, one can apply a 90/10 rule: 90 per cent of the subjects can be located in the top 10 per cent of the search gallery. This means that we cannot currently rely on this technology to obtain the right answer, but instead we must use it to filter the field in order to make the human process more efficient.

The third stage would be the identification of persons of interest. Following an analysis of the Boston video data, the investigators disclosed various cues about the alleged bombers: they abandoned a rucksack and then calmly walked away from the scene of the blast (everybody else ran). This interpretation of action, behaviour and even intent and purpose is an extremely nuanced skill that humans (and especially surveillance experts) are well adapted to execute. It is also the hardest to emulate with artificial processing mechanisms. In this stage of analysis, automatic methods are still relatively in their infancy. Although techniques for abandoned baggage detection exist, currently they are not sufficiently robust to be deployed in arbitrary environments.

The final stage is to associate the images of any given person of interest with an actual identity, i.e. to ‘recognise’ them as having a particular name and address. This, too, is a specialised skill of humans: we have an evolutionary adaptation to remember and recall the faces of people we have met before. It was therefore a natural step for the investigators to publish these images.

Looking to the future, we see an ever-increasing capacity and propensity to record what is around us. The density of media now reserved for a marathon finishing line, may one day be an everyday occurrence. It may be possible to scrutinise everyday incidents to the same degree as this horrific tragedy, but only if the tools are developed to do the bulk of the work automatically.

Related analysis & opinion

22 February 2019

Ron Low, Framos head of sales Americas and APAC, reports from Framos Tech Days at Photonics West in San Francisco where Sony Japan representatives presented image sensor roadmap updates

05 April 2019

Greg Blackman reports on the complexities of training AllGo Systems' driver monitoring neural networks, which the firm's VP of engineering, Nirmal Kumar Sancheti, spoke about at the Embedded World trade fair

09 October 2018

A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September

24 May 2018

Data is now a fiercely guarded asset for most companies and, as the European General Data Protection Regulation (GDPR) comes into force, Framos’ Dr Christopher Scheubel discusses potential new business models based on 3D vision data, following a talk he gave at the Embedded Vision Summit in Santa Clara this week

Related features and analysis & opinion

22 February 2019

Ron Low, Framos head of sales Americas and APAC, reports from Framos Tech Days at Photonics West in San Francisco where Sony Japan representatives presented image sensor roadmap updates

05 April 2019

Greg Blackman reports on the complexities of training AllGo Systems' driver monitoring neural networks, which the firm's VP of engineering, Nirmal Kumar Sancheti, spoke about at the Embedded World trade fair

29 March 2019

Greg Blackman reports from Embedded World, in Nuremberg, where he finds rapid progress in technology for imaging at the edge

16 November 2018

Online retail sales in the US exceeded $453 billion in 2017, according to the US Department of Commerce. Although this may seem like a substantial amount, it only accounts for 13 per cent of the total retail sales made in the region throughout the year, meaning the majority of transactions still take place via the millions of customers walking through their doors and aisles every day.

09 October 2018

A group at the University of Bologna is trying to make images from Grand Theft Auto more realistic so that they can act as training data for neural networks. Greg Blackman listens to Pierluigi Zama Ramirez’s presentation at the European Machine Vision Forum in Bologna in September