POLICY AWARE MODEL LEARNING VIATRANSITION OCCUPANCY MATCHING

Model-based reinforcement learning (MBRL) is an effective paradigm for sampleefficient policy learning. The pre-dominant MBRL strategy iteratively learns the
dynamics model by performing maximum likelihood (MLE) on the entire replay
buffer and trains the policy using fictitious transitions from the learned model.
Given that not all transitions in the replay buffer are equally informative about
the task or the policy’s current progress, this MLE strategy cannot be optimal and
bears no clear relation to the standard RL objective. In this work, we propose
Transition Occupancy Matching (TOM), a policy-aware model learning algorithm
that maximizes a lower bound on the standard RL objective. TOM learns a policyaware dynamics model by minimizing an f-divergence between the distribution of
transitions that the current policy visits in the real environment and in the learned
model; then, the policy can be updated using any pre-existing RL algorithm with
log-transformed reward. TOM’s practical implementation builds on tools from dual
reinforcement learning and learns the optimal transition occupancy ratio between
the current policy and the replay buffer; leveraging this ratio as importance weights,
TOM amounts to performing MLE model learning on the correct, policy aware
transition distribution. Crucially, TOM is a model learning sub-routine and is
compatible with any backbone MBRL algorithm that implements MLE-based
model learning. On the standard set of Mujoco locomotion tasks, TOM is more
sample efficient and achieves higher asymptotic performance.