How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks

OPRO enhances LLM performance in tasks like GSM8K and BBH by optimizing instructions using models like PaLM 2-L and GPT-4. Experiments reveal improved transferability of optimized instructions across mathematical reasoning datasets.


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

:::info Authors:

(1) Chengrun Yang, Google DeepMind and Equal contribution;

(2) Xuezhi Wang, Google DeepMind;

(3) Yifeng Lu, Google DeepMind;

(4) Hanxiao Liu, Google DeepMind;

(5) Quoc V. Le, Google DeepMind;

(6) Denny Zhou, Google DeepMind;

(7) Xinyun Chen, Google DeepMind and Equal contribution.

:::

Abstract and 1. Introduction

2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms

2.2 Meta-Prompt Design

3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression

3.2 Traveling Salesman Problem (TSP)

4 Application: Prompt Optimization and 4.1 Problem Setup

4.2 Meta-Prompt Design

5 Prompt Optimization Experiments and 5.1 Evaluation Setup

5.2 Main Results

5.3 Ablation Studies

5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt

6 Related Work

7 Conclusion, Acknowledgments and References

A Some Failure Cases

B Prompting Formats for Scorer Llm

C Meta-Prompts and C.1 Meta-Prompt for Math Optimization

C.2 Meta-Prompt for Prompt Optimization

D Prompt Optimization Curves on the Remaining Bbh Tasks

E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions

5 PROMPT OPTIMIZATION EXPERIMENTS

We present the evaluation results for prompt optimization in this section. Our experiments demonstrate that OPRO brings a significant performance gain across the board, with different combinations of LLMs as the optimizer and the scorer.

5.1 EVALUATION SETUP

Models. The LLMs we use as the optimizer and the scorer are:

\ • Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4.

\ • Scorer LLM: Pre-trained PaLM 2-L and text-bison.

\ With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates Abegin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Qbegin and Q_end instructions when text-bison is used as the scorer.

\ Benchmarks. Our primary evaluation benchmarks are GSM8K (Cobbe et al., 2021) and Big-Bench Hard (BBH) (Suzgun et al., 2022). GSM8K is a benchmark of grade school math word problems with 7,473 training samples and 1,319 test samples, where chain-of-thought prompting (Wei et al., 2022) and the zero-shot instruction “Let’s think step by step.” (Kojima et al., 2022) have drastically improved the performance over the standard prompting. BBH is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022) that covers a wide range of topics beyond arithmetic reasoning, including symbolic manipulation and commonsense reasoning. Each task contains up to 250 examples in total.

\ To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., MultiArith (Roy & Roth, 2016) and AQuA (Ling et al., 2017).

\ Implementation details. We set the temperature to be 0 when evaluating the performance of generated instructions, in which case the scorer LLM greedily decodes. Unless otherwise specified, we set the default temperature to be 1.0 for optimizer LLMs to generate diverse and creative instructions. At each optimization step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions, then we add these instructions with their training scores to the optimization trajectory in the meta-prompt. Our meta-prompt at each step contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. We study the effect of different hyperparameters in ablation studies (Section 5.3). Appendix C.2 presents the full meta-prompts for different optimizer LLMs.

\

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models


Print Share Comment Cite Upload Translate Updates
APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-09-24T13:40:02+00:00) How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks. Retrieved from https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/

MLA
" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx - Tuesday September 24, 2024, https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/
HARVARD
Writings, Papers and Blogs on Text Models | Sciencx Tuesday September 24, 2024 » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks., viewed ,<https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/>
VANCOUVER
Writings, Papers and Blogs on Text Models | Sciencx - » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/
CHICAGO
" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/
IEEE
" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/. [Accessed: ]
rf:citation
» How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.