ABSTRACT
Enabling computers to recognize human actions in video has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human-computer interaction, and social robotics. Human action recognition, however, is tremendously challenging for computers due to the subtlety of human actions and the complexity of video data. Critical to the success of any human action recognition algorithm is its ability to attend to the relevant information during both training and prediction times.
In the first part of this talk, I will describe a novel approach for training human action classifiers, one that can explicitly factorize human actions from the co-occurring context. Our approach utilizes conjugate samples, which are video clips that are contextually similar to human action samples, but do not contain the actions. Our approach enables the classifier to attend to the relevant information and improve its performance in recognizing human actions under varying context.
In the second part of this talk, I will describe a method for early recognition of human actions, one that can take advantages of multiple cameras. To account for the limited communication bandwidth and processing power, we will learn a camera selection polity so that the system can attend to the most relevant information at each time step. This problem is formulated as a sequential decision process, and the attention policy is learned based on reinforcement learning. Experiments on several datasets demonstrate the effectiveness of this approach for early recognition of human actions.