Theoretical Analysis of Direct Preference Optimization

In this section, we give further interpretation of the DPO method, provide theoretical backing, and relate advantages of DPO to issues with actor critic algorithms used for RLHF (such as PPO [37]).

5.1 Your Language Model Is Secretly a Reward Model

\ Definition 1. We say that two reward functions r(x, y) and r ′ (x, y) are equivalent iff r(x, y) − r ′ (x, y) = f(x) for some function f.

\ It is easy to see that this is indeed an equivalence relation, which partitions the set of reward functions into classes. We can state the following two lemmas:

\ Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution.

\ Lemma 2. Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem.

\ The proofs are straightforward and we defer them to Appendix A.5. The first lemma is a well-known under-specification issue with the Plackett-Luce family of models [30]. Due to this under-specification, we usually have to impose additional identifiability constraints to achieve any guarantees on the MLE estimates from Eq. 2 [4]. The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class. We prove the following Theorem in Appendix A.6:

5.2 Instability of Actor-Critic Algorithms

\ Figure 2: Left. The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization. Right. TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature.

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-08-25T21:12:27+00:00) Theoretical Analysis of Direct Preference Optimization. Retrieved from https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/

MLA

" » Theoretical Analysis of Direct Preference Optimization." Writings, Papers and Blogs on Text Models | Sciencx - Sunday August 25, 2024, https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Sunday August 25, 2024 » Theoretical Analysis of Direct Preference Optimization., viewed ,<https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » Theoretical Analysis of Direct Preference Optimization. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/

CHICAGO

" » Theoretical Analysis of Direct Preference Optimization." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/

IEEE

" » Theoretical Analysis of Direct Preference Optimization." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/. [Accessed: ]

rf:citation

» Theoretical Analysis of Direct Preference Optimization | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/08/25/theoretical-analysis-of-direct-preference-optimization/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

5 Theoretical Analysis of DPO

5.1 Your Language Model Is Secretly a Reward Model

5.2 Instability of Actor-Critic Algorithms

Related Posts