Deriving the Gradient of the DPO Objective

\ Lemma 1 Restated. Under the Plackett-Luce preference framework, and in particular the Bradley-Terry framework, two reward functions from the same equivalence class induce the same preference distribution.

\ which completes the proof.

\ Lemma 2 Restated. Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem.

\ which completes the proof.

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-08-26T20:00:25+00:00) Deriving the Gradient of the DPO Objective. Retrieved from https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/

MLA

" » Deriving the Gradient of the DPO Objective." Writings, Papers and Blogs on Text Models | Sciencx - Monday August 26, 2024, https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Monday August 26, 2024 » Deriving the Gradient of the DPO Objective., viewed ,<https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » Deriving the Gradient of the DPO Objective. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/

CHICAGO

" » Deriving the Gradient of the DPO Objective." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/

IEEE

" » Deriving the Gradient of the DPO Objective." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/. [Accessed: ]

rf:citation

» Deriving the Gradient of the DPO Objective | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/08/26/deriving-the-gradient-of-the-dpo-objective/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

A.4 Deriving the Gradient of the DPO Objective

A.5 Proof of Lemma 1 and 2

Related Posts