(:youtube mPKDVDaegP0 :) Inference of human intention from movements may be an essential step towards understanding human actions and is hence important for realizing efficient human-robot interaction. In this paper, we propose the Intention-Driven Dynamics Model (IDDM), a nonparametric Bayesian model for inferring unknown human intentions. We train the model based on observed human movements and we introduce approximate inference algorithms to efficiently infer the human's intention from an ongoing movement. We verify the feasibility of the IDDM in two scenarios, i.e., target inference in robot table tennis and action recognition for interactive humanoid robots. In both tasks, the IDDM achieves substantial improvements over state-of-the-art regression and classification algorithms.
Inferring the meaning of human movements can be an essential step towards efficient interaction between human and intelligent systems. Although recent advances in sensors and algorithms have greatly improved the perception of human body motion: for example, body poses can be tracked in real time using depth cameras, much less attention has been paid to the understanding of a movement's meaning. An important understanding problem is to infer the underlying factor that directs the movements, such as a goal, target, desire, and plan, which we loosely refer to as an intention. Intention inference allows intelligent systems to react in a proactive manner. For instance, a robot's ability to play table tennis can be enhanced with anticipation of an opponent's target from his stroke movement.
Intention inference can be achieved via inverse modeling of intention-driven movements. We consider the generative process of an intention-driven movement, in which the transition of internal states is directed by the intention. However, the internal states of the human are often not unidentifiable from a time series of observations. Instead, people have shown the success of learning latent state variables that are most likely to generate the observations and model the transition in the latent state space. However, the intention, as an important impulse to the dynamics, is usually not perceived and hence not represented by latent variables. To enable intention inference, we advocate explicit modeling of the intention in the transition, resulting in intention-driven dynamics model (IDDM).
Parametric modeling of the transition is difficult due to the complexity of human motion and to the lack of intepretation of the latent states. Bayesian nonparametric models, e.g., Gaussian process (GP), have become popular in modeling human motion. For example, Gaussian process dynamical model (GPDM) uses GP to model both the transition function and the measurement mapping However, the use of GP renders exact intention inference intractable in the proposed IDDM, as the latent variables need to be integrated out. In this paper, we propose approximate approaches to intention inference and discuss further approximations for efficient inference.
We consider human-robot table tennis games. The robot's hardware constraints impose strong limitations on its flexibility. It requires sufficient time to execute a ball-hitting plan: movement initiation to an appropriate preparation pose is needed before the opponent returns the ball, to achieve the required velocity for returning the ball. The robot player uses different preparation poses for forehand and backhand hitting plans. Hence, it is necessary to choose between them based on inference of the opponent's target location.
Playing table tennis is a challenging task for robots, as it requires accurate prediction of the ballís movement and very fast response. Hence, robot table tennis has been used by many groups as a benchmark task in robotics Thus far, none of the groups which have worked on robot table tennis ever got to the levels of a young child despite having robots that could see and move faster and more accurate than humans. Likely explanations for this performance gap are (i) the human ability to predict hitting points from opponent movements and (ii) the robustness of human hitting movements. We used a Barrett WAM robot arm to play table tennis against human players. The robotís hardware constraints impose strong limitations on its flexibility. The robot requires sufficient time to execute a ball-hitting plan: to achieve the required velocity for returning the ball, movement initiation to an appropriate preparation pose is needed before the opponent hits the ball. The robot player uses different preparation poses for forehand and backhand hitting plans. Hence, it is necessary to choose between them based on the modeling the opponentís preference and inference of the opponentís target location for the ball.
Our results demonstrated that the IDDM can improve the target prediction in robot table tennis and choose the correct hitting plan. We have verified the model in a simulated environment, but using data from real human movements recorded from a human playing against another human. The simulation showed that the robot could successfully return the ball when given a prediction by the IDDM model.
In this setting, we use our technique to improve the interaction capabilities of a NAO humanoid robot. In order to realize natural and compelling interactions, the robot needs to correctly recognize the actions of its human partner. This ability, in turn, allows it to act in a proactive manner. We show that the IDDM has the potential to identify the intention of action from movements in a simplified scenario.
To realize safe and meaningful HRI, it is important that robots can recognize the humanís action. The advent of robust, marker-less motion capture techniques has provided us with the technology to record the full skeletal configuration of the human during HRI. Yet, recognition of the humanís action from this high-dimensional data set poses serious challenges. In this paper, we show that the IDDM has the potential to recognize the intention of action from movements in a simplified scenario. Using a Kinect camera, we recorded the 32-dimensional skeletal configuration of a human during the execution of a set of actions namely: crouching (C), jumping (J), kick-high (KH), kick-low (KL), defense (D), punch-high (PH), punch-low (PL), and turn-kick (TK). For each type of action we collected a training set consisting of ten repetitions and a test set of three repetitions. The system downsampled the output of Kinect and processes three skeletal configurations per second.