New activity-recognition algorithm to be unveiled
Hamed Pirsiavash, a postdoc at MIT, and his former thesis advisor, Deva Ramanan of the University of California at Irvine, will present a new activity-recognition algorithm at IEEE's Conference on Computer Vision and Pattern Recognition, 24-27 June.
One of the advantages is that the algorithm’s execution time scales linearly with the size of the video file it’s searching. The algorithm is able to predict actions part of the way through in incomplete videos, which allows it to handle streamed videos and issue a probability that the action is of the type that it’s looking for. The amount of memory the algorithm requires is fixed, regardless of how many frames of video it’s already reviewed. That means that, unlike many of its predecessors, it can handle video streams of any length.
The algorithm is based on the one used for natural language processing. ‘One of the challenging problems they try to solve is, if you have a sentence, you want to basically parse [scan and analyse] the sentence, saying what is the subject, what is the verb, what is the adverb,’ said Pirsiavash. ‘We see an analogy here, which is, if you have a complex action — like making tea or making coffee — that has some subactions, we can basically stitch together these subactions and look at each one as something like verb, adjective, and adverb.’
These subactions follow similar patterns to grammar; the order of actions is sometimes interchangeable, while some must occur in a certain order that is similar to the organisation of different word types. Using machine vision to learn these rules, Pirsiavash and Ramanan feed their algorithm training examples of videos depicting a particular action, and specify the number of subactions that the algorithm should look for. But they don’t give it any information about what those subactions are, or what the transitions between them look like.
The rules relating subactions are the key to the algorithm’s efficiency. As a video plays, the algorithm constructs a set of hypotheses about which subactions are being depicted where, and it ranks them according to probability. It can’t limit itself to a single hypothesis, as each new frame could require it to revise its probabilities. But it can eliminate hypotheses that don’t conform to its grammatical rules, which dramatically limits the number of possibilities it has to canvass.
The researchers tested their algorithm on eight different types of athletic activity with training videos taken from the internet. They found that, according to metrics standard in the field of computer vision, their algorithm identified new instances of the same activities more accurately than its predecessors.
Pirsiavash is particularly interested in possible medical applications of action detection. The proper execution of physical-therapy exercises, for instance, could have a grammar that’s distinct from improper execution; similarly, the return of motor function in patients with neurological damage could be identified by its unique grammar. Action-detection algorithms could also help determine whether, for instance, elderly patients remembered to take their medication — and issue alerts if they didn’t.