1 code implementation • 29 Jan 2024 • Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering.
no code implementations • 19 Apr 2023 • Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users.