Human Study Validates GPT-4 Win Rates for TL;DR Summarization

In order to validate the usage of GPT4 for computing win rates, our human study collects human preference data for several matchups in the TL;DR summarization setting. We select three different algorithmic matchups, evaluating DPO (temp. 0.25), SFT (temp. 0.25), and PPO (temp 1.0) compared to the reference algorithm PPO (temp 0.). By selecting matchups for three unique algorithms as well as algorithms with a wide range of win rates vs the reference, we capture the similarity of human and GPT-4 win rates across the response quality spectrum. We sample 150 random comparisons of DPO vs PPO-0 and 100 random comparisons PPO-1 vs PPO-0, assigning two humans to each comparison, producing 275 judgments for DPO-PPO[7] and 200 judgments for PPO-PPO. We sample 125 SFT comparisons, assigning a single human to each. We ignore judgments that humans labeled as ties (which amount to only about 1% of judgments), and measure the raw agreement percentage between human A and human B (for comparisons where we have two human annotators, i.e., not SFT) as well as between each human and GPT-4.

\ Figure 5: Layout of the survey in SurveyMonkey. Each respondent completed 25 similarly formatted judgments.

\ Participants. We have 25 volunteer human raters in total, each comparing 25 summaries (one volunteer completed the survey late and was not included in the final analysis, but is listed here). The raters were Stanford students (from undergrad through Ph.D.), or recent Stanford graduates or visitors, with a STEM (mainly CS) focus. See Figure 5 for a screenshot of the survey interface. We gratefully acknowledge the contribution of each of our volunteers, listed in random order:

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

[7] One volunteer did not respond for the DPO-PPO comparison.

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-08-26T21:45:16+00:00) Human Study Validates GPT-4 Win Rates for TL;DR Summarization. Retrieved from https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/

MLA

" » Human Study Validates GPT-4 Win Rates for TL;DR Summarization." Writings, Papers and Blogs on Text Models | Sciencx - Monday August 26, 2024, https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Monday August 26, 2024 » Human Study Validates GPT-4 Win Rates for TL;DR Summarization., viewed ,<https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » Human Study Validates GPT-4 Win Rates for TL;DR Summarization. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/

CHICAGO

" » Human Study Validates GPT-4 Win Rates for TL;DR Summarization." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/

IEEE

" » Human Study Validates GPT-4 Win Rates for TL;DR Summarization." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/. [Accessed: ]

rf:citation

» Human Study Validates GPT-4 Win Rates for TL;DR Summarization | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/08/26/human-study-validates-gpt-4-win-rates-for-tldr-summarization/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

D.3 Human study details

Related Posts