Towards Sequence-based Gesture Recognition

I reviewed some literature on the topic of gesture recognition. I concentrated on more recent publications to find out what the currently most used methods are. Moreover, I tried to find out more about the terminology used in this field.

Isolated vs. Continuous Gesture Recognition

In case you are given a data sequence which contains one isolated gesture, the recognition task becomes somewhat easier. The other scenario would be continuous gesture recognition. In that case, you are given a data sequence which might contain any number of gestures, possibly none. Before performing the actual recognition, you would have to do a segmentation of the gestures first. E.g. Chai et al. propose a framework for doing the segmentation as well as the recognition for continuous data ¹. There are also some approaches which aim at doing the segmentation and recognition all at once, such as Hoai et al. ² and Pitsikalis et al. ³.

For the intended application, I reckon continuous gesture recognition is the more realistic problem description.

Using Dynamic Time Warping

Chai et al. ¹ also mention dynamic time warping (DTW) as a method used in the context of gesture recognition:

"In the early time, gesture recognition borrowed some models from speech recognition, such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM) and Conditional Random Fields (CRF). Besides these traditional models, there are some other methods for gesture recognition. [...]"

This sounds like DTW might be a bit of an outdated approach. However, the paper Chai et al. reference in this context is from 2013⁴, so it might not be quite such a stone-age technique as this quote makes it sound like.

Recurrent Neural Networks

Among the more recent papers concerning gesture recognition I found, some are based on some variation of recurrent neural networks (RNN). Since "vanilla" RNNs often suffer from problems with exploding/vanishing gradients, long short term memory (LSTM) seems to have become the de facto standard. I haven't delved into RNNs very deep, but at first glance they seem like a good approach to the general problem. The recurrent 3D convolutional neural network (R3DCNN) proposed by Molchanov et al. even "supported online recognition with zero or negative lag." ¹ Negative lag seems like something you could impress people with :D

Skipping the Pose Estimation

I have seen a lot of recent papers, which skip the pose estimation by using convolutional neural networks. In order to learn both the spatial and the temporal information from a sequence of images, they often use 3D-CNNs, e.g. Camgoz et al. ⁵ or Molchanov et al. ⁶.

References:

Chai, Xiujuan, et al. "Two streams recurrent neural networks for large-scale continuous gesture recognition." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016.APA ↩ ↩² ↩³
Hoai, Minh, Zhen-Zhong Lan, and Fernando De la Torre. "Joint segmentation and classification of human actions in video." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011. ↩
Pitsikalis, Vassilis, et al. "Multimodal gesture recognition via multiple hypotheses rescoring." The Journal of Machine Learning Research 16.1 (2015): 255-284. ↩
Celebi, Sait, et al. "Gesture recognition using skeleton data with weighted dynamic time warping." VISAPP (1). 2013. ↩
Camgoz, Necati Cihan, et al. "Using convolutional 3d neural networks for user-independent continuous gesture recognition." Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016. ↩
Molchanov, Pavlo, et al. "Hand gesture recognition with 3D convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2015. ↩