Search Results for author: Josef Dai

Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective

Then, based on this framework, we introduce the IBN to analyze generalization in the reward modeling stage of RLHF.

Paper
Add Code

However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training.

1,180

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.