Abstract
In the standard reinforcement learning setting, the agent is assumed to learn solely from state transitions and rewards from the environment. We consider the extended setting where there is a trainer providing behavioral feedback to the agent whether the executed action was desirable or not. The agent has access to additional information on how to act optimally, but now has to deal with noise in the feedback signal since it is not necessarily accurate. In this talk, I present a Bayesian approach to reinforcement learning with behavioral feedback. Specifically, we extend Kalman temporal difference learning to compute the posterior distribution over Q-values given the state transitions and rewards from the environment as well as the feedback signals from the trainer. I will show that the algorithm can significantly improve performance through experiments on standard reinforcement learning tasks.