Effects of Conservatism on Offline Learning

29 Sep 2021  ·  Karush Suri, Florian Shkurti ·

Conservatism, the act of underestimating an agent's expected value estimates, has demonstrated profound success in model-free, model-based, multi-task, safe and other realms of offline Reinforcement Learning (RL). Recent work, on the other hand, has noted that conservatism often hinders learning of behaviors. To that end, the paper asks the question how does conservatism affect offline learning? The proposed answer studies conservatism in light of value function optimization, approximate objectives that upper bound underestimations and behavior cloning as auxilary regularization objective. Conservative agents implicitly steer estimates away from the true value function, resulting in optimization objectives with high condition numbers. Mitigating these issues requires an upper bounding objective. These approximate upper bounds, however, impose strong geometrical assumptions on the dataset design, a result which is only sparsely fulfilled. Driven by theoretical observations, provision of an auxilary behavior cloning objective as variational regularization to estimates results in accurate value estimation, well-conditioned search spaces and expressive parameterizations. In an empirical study of discrete and continuous control tasks, we validate our theoretical insights and demonstrate the practical effects of learning underestimated value functions.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here