no code implementations • 23 May 2024 • Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang
Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM).
no code implementations • 29 Sep 2021 • Elioth Sanabria, David Yao, Henry Lam
In this paper, we show that even for problems with large state space, when the solution policy of the MDP can be represented by a tree-like structure, our proposed algorithm retrieves a tree of the solution policy of the MDP in computationally tractable time.