Computer vision to speed police investigations similar to Boston bombing

Share this on social media:

Police investigating the Boston Marathon bombing drew heavily on the vast amount of video data from the scene to build their case. The latest video analytic approaches could have helped speed the investigation even further says Dr James Orwell, leader of the Visual Surveillance Research Group at the Digital Imaging Research Centre at Kingston University

The alleged bombers of the Boston Marathon appear to have been swiftly identified and located. The scene of the atrocity was exceptional in several respects, not least the quantity of video recordings being made at the time, and the speed with which these were made available to the investigation. Videos and still pictures taken by loved ones and broadcasting professionals were willingly cast alongside regular CCTV recordings in an unruly mosaic of data. The analysis of this data would also have been on an exceptional scale, with professionals seeking to establish the answers to some ultimately simple questions: ‘who’, ‘what’, ‘where’ and ‘when’. No resource would have been spared. At the present time the work would be undertaken by experts, but the prospects for machine analysis of video evidence are increasingly likely. However, it is hard to envisage a fully automatic analysis, but rather a short-listing process that may dramatically improve efficiency, just as internet search engines allow fast browsing by presenting a shortlist of candidate pages.

No doubt the methodology of the Boston investigation would be blurred by its own urgent agenda and priorities, but broadly there would be four stages of image analysis to undertake. The first stage is to locate each media item in time and space. For installed security cameras this would hopefully be trivial; for the volunteered medley of media, this is a potentially complex task that combines scene recognition with inaccurate device timestamps, owners’ recollections about where the media was recorded, and increasingly, GPS data automatically inserted into the media by smart-phones. This catalogue of the data builds up a picture of what elements of the past have been recorded. In most investigations, these amount only to thin slivers here and there, but most areas around the Boston finishing line would have been recorded many times over, giving investigators the luxury of multiple angles from which to inspect crowded elements and occluding corners.

Techniques for image and video processing can assist with this first analysis stage, especially in the rich structure of an urban scene. Buildings can often be recognised via their distinct arrangement of visual ‘interest points’, as comparisons can be made that are invariant to the viewing angle. Furthermore, the viewing angle can then be estimated, once the structure is known. In this way, the relative and then absolute positions of the various media sources can be built up. Similarly, methods have been developed for synchronising sources, for example by maximising the mutual information between them. Another important aspect is the capacity to visualise easily all this data in a suitable environment.

A second analysis stage aims to enumerate a list of the people present at the event, and to cross-reference the various observations of each person. This allows an understanding of where each person went during the critical period. It also suggests which observation provides the clearest opportunity for ‘recognition’, which comes at a later stage. The cross-referencing of observations can be a very time-consuming task, for which accuracy of synchronisation and placement play a vital role, and each individual needs to be re-identified in the relevant subset of media. In the Boston investigation, two factors made this easier than usual. First, the density of the coverage implies fewer gaps in the views of each individual’s trajectory; investigators spend less time guessing which way a subject turned, while they were off-camera. Secondly, for events such as these, people tend to wear highly coloured, distinctive clothing, so generally it is more straightforward to tell them apart. In contrast, at rush-hour on a working day, investigators would need to use more sophisticated cues to re-identify each individual.

There is much active research into methods to automatically re-identify people between cameras, with steady improvement of results on standard test datasets. Currently, on far-view CCTV data, one can apply a 90/10 rule: 90 per cent of the subjects can be located in the top 10 per cent of the search gallery. This means that we cannot currently rely on this technology to obtain the right answer, but instead we must use it to filter the field in order to make the human process more efficient.

The third stage would be the identification of persons of interest. Following an analysis of the Boston video data, the investigators disclosed various cues about the alleged bombers: they abandoned a rucksack and then calmly walked away from the scene of the blast (everybody else ran). This interpretation of action, behaviour and even intent and purpose is an extremely nuanced skill that humans (and especially surveillance experts) are well adapted to execute. It is also the hardest to emulate with artificial processing mechanisms. In this stage of analysis, automatic methods are still relatively in their infancy. Although techniques for abandoned baggage detection exist, currently they are not sufficiently robust to be deployed in arbitrary environments.

The final stage is to associate the images of any given person of interest with an actual identity, i.e. to ‘recognise’ them as having a particular name and address. This, too, is a specialised skill of humans: we have an evolutionary adaptation to remember and recall the faces of people we have met before. It was therefore a natural step for the investigators to publish these images.

Looking to the future, we see an ever-increasing capacity and propensity to record what is around us. The density of media now reserved for a marathon finishing line, may one day be an everyday occurrence. It may be possible to scrutinise everyday incidents to the same degree as this horrific tragedy, but only if the tools are developed to do the bulk of the work automatically.

Seaweed is grown on ropes, the moorings and position of which can be tracked by a vision system. Credit: PEBL

11 February 2022