Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments

We find that the Best of N baseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various N for the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4.

D.2 Sample Responses and GPT-4 Judgments

In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples.

\ Figure 4: Best of N baseline for N = {1, 4, 16, 64, 128}. Performance plateaus after roughly 64-128 samples.

\ Table 7: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 0.7; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

\ Table 8: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 1.0; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

\ Table 9: GPT-4 chooses GT over DPO. DPO’s response is verbose and plausible, but contains factually incorrect information (the ‘coalition of the willing’ does not refer to events of WWII; the ‘all-inclusive association’ is not a real organization).

\ Table 10: GPT-4 chooses GT over DPO. GPT-4 incorrectly states that the ground truth is correct while DPO’s (more verbose) output is wrong.

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-08-26T21:30:12+00:00) Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments. Retrieved from https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/

MLA

" » Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments." Writings, Papers and Blogs on Text Models | Sciencx - Monday August 26, 2024, https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Monday August 26, 2024 » Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments., viewed ,<https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/

CHICAGO

" » Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/

IEEE

" » Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/. [Accessed: ]

rf:citation

» Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/08/26/performance-of-best-of-n-baseline-for-various-n-and-sample-responses-and-gpt-4-judgments/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

Additional Empirical Results

D.1 Performance of Best of N baseline for Various N

D.2 Sample Responses and GPT-4 Judgments

Related Posts