Right on time: new algorithm to determine video playback direction

Share this on social media:

At the IEEE Conference on Computer Vision and Pattern Recognition which  ran from 20 - 27 June 2014, a new algorithm was presented by an international group of researchers that can determine, with 80 per cent accuracy, whether video is running forward or backward. By observing Einstein’s arrow of time, this research could help create more realistic graphics for the entertainment industry and also further the understanding of the visual world.

‘It’s kind of like learning what the structure of the visual world is,’ said William Freeman, a professor of computer science and engineering at MIT and one of the authors of the paper. ‘To study shape perception, you might invert a photograph to make everything that’s black white, and white black, and then check what you can still see and what you can’t. Here we're doing a similar thing, by reversing time, then seeing what it takes to detect that change. We're trying to understand the nature of the temporal signal.’

Three separate algorithms were written by Freeman and his team that tackle the issue in different ways. All three were trained using a series of short videos that had already been identified as either playing backwards or forwards.

The algorithm that performed best begins by dividing a frame of video into a grid of hundreds of thousands of squares; then it divides each of those squares into a smaller, four-by-four grid. For each square in the smaller grid, it determines the direction and distance that clusters of pixels move from one frame to the next.

The algorithm logs approximately 4,000 four-by-four grids, where each square in a grid represents particular directions and degrees of motion. The selected results are chosen to offer a good approximation of all the grids in the training data. Finally, the algorithm combs through the example videos to determine whether particular combinations of grids tend to indicate forward or backward motion.

Following standard practice in the field, the researchers divided their training data into three sets, sequentially training the algorithm on two of the sets and testing its performance against the third. The algorithm’s success rates were 74, 77, and 90 per cent.

One vital aspect of the algorithm is that it can identify the specific regions of a frame that it is using to make its judgments. The types of visual cues that the algorithm is using could indicate the types of cues that the human visual system uses as well.

The next-best-performing algorithm was about 70 per cent accurate. It was based on the assumption that, in forward-moving video, motion tends to propagate outward rather than contracting inward. In a video of a break in pool, for instance, the cue ball is, initially, the only moving object. After it strikes the racked balls, motion begins to appear in a wider and wider radius from the point of contact.

The third algorithm was the least accurate, but it may be the most philosophically interesting. It attempts to offer a statistical definition of the direction of causation.

‘There’s a research area on causality,’ Freeman says. ‘And that’s actually really quite important, medically even, because in epidemiology, you can’t afford to run the experiment twice, to have people experience this problem and see if they get it and have people do that and see if they don’t. But you see things that happen together and you want to figure out: ‘Did one cause the other?’ There’s this whole area of study within statistics on, ‘How can you figure out when something did cause something else?’ And that relates in an indirect way to this study as well.’

If a ball is recorded travelling down a slope and collides with a bump, it would be launched into the air after the contact. However, when played in reverse, the ball would take flight with no apparent cause due to the absence of the bump. The researchers were able to model the intuitive distinction of whether the cause was present as a statistical relationship between a mathematical model of an object’s motion and the ‘noise,’ or error, in the visual signal.

However, the approach works only if the object’s motion can be described by a linear equation, and that’s rarely the case with motions involving human agency. The algorithm can determine, however, whether the video it’s being applied to meets that criterion. And in those cases, its performance is much better.

 

Recent News

25 May 2021

The face recognition imager consumes 10,000 times less energy than a typical camera and processor. CEA-Leti is working with STMicroelectronics on the imager

06 May 2021

The GTOF0503 sensor features a 5µm three-tap iToF pixel, incorporating an array with a resolution of 640 x 480 pixels

30 April 2021

The algorithm can deduce the shape, size and layout of a room by measuring the time it takes for sound from speakers to return to the phone's microphone

20 April 2021

The Kria K26 SOM is built on top of the Zynq UltraScale+ MPSoC architecture. It has 4GB of DDR4 memory and 245 IOs for connecting sensors