Effects of Conservatism on Offline Learning

29 Sep 2021 · Karush Suri, Florian Shkurti ·

Conservatism, the act of underestimating an agent's expected value estimates, has demonstrated profound success in model-free, model-based, multi-task, safe and other realms of offline Reinforcement Learning (RL). Recent work, on the other hand, has noted that conservatism often hinders learning of behaviors. To that end, the paper asks the question how does conservatism affect offline learning? The proposed answer studies conservatism in light of value function optimization, approximate objectives that upper bound underestimations and behavior cloning as auxilary regularization objective. Conservative agents implicitly steer estimates away from the true value function, resulting in optimization objectives with high condition numbers. Mitigating these issues requires an upper bounding objective. These approximate upper bounds, however, impose strong geometrical assumptions on the dataset design, a result which is only sparsely fulfilled. Driven by theoretical observations, provision of an auxilary behavior cloning objective as variational regularization to estimates results in accurate value estimation, well-conditioned search spaces and expressive parameterizations. In an empirical study of discrete and continuous control tasks, we validate our theoretical insights and demonstrate the practical effects of learning underestimated value functions.

PDF Abstract