How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks

We present the evaluation results for prompt optimization in this section. Our experiments demonstrate that OPRO brings a significant performance gain across the board, with different combinations of LLMs as the optimizer and the scorer.

5.1 EVALUATION SETUP

Models. The LLMs we use as the optimizer and the scorer are:

\ • Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4.

\ • Scorer LLM: Pre-trained PaLM 2-L and text-bison.

\ With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates Abegin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Qbegin and Q_end instructions when text-bison is used as the scorer.

\ Benchmarks. Our primary evaluation benchmarks are GSM8K (Cobbe et al., 2021) and Big-Bench Hard (BBH) (Suzgun et al., 2022). GSM8K is a benchmark of grade school math word problems with 7,473 training samples and 1,319 test samples, where chain-of-thought prompting (Wei et al., 2022) and the zero-shot instruction “Let’s think step by step.” (Kojima et al., 2022) have drastically improved the performance over the standard prompting. BBH is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022) that covers a wide range of topics beyond arithmetic reasoning, including symbolic manipulation and commonsense reasoning. Each task contains up to 250 examples in total.

\ To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., MultiArith (Roy & Roth, 2016) and AQuA (Ling et al., 2017).

\ Implementation details. We set the temperature to be 0 when evaluating the performance of generated instructions, in which case the scorer LLM greedily decodes. Unless otherwise specified, we set the default temperature to be 1.0 for optimizer LLMs to generate diverse and creative instructions. At each optimization step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions, then we add these instructions with their training scores to the optimization trajectory in the meta-prompt. Our meta-prompt at each step contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. We study the effect of different hyperparameters in ablation studies (Section 5.3). Appendix C.2 presents the full meta-prompts for different optimizer LLMs.

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-09-24T13:40:02+00:00) How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks. Retrieved from https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/

MLA

" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx - Tuesday September 24, 2024, https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Tuesday September 24, 2024 » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks., viewed ,<https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/

CHICAGO

" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/

IEEE

" » How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/. [Accessed: ]

rf:citation

» How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/09/24/how-opro-elevates-llm-accuracy-in-gsm8k-and-bbh-benchmarks/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

5 PROMPT OPTIMIZATION EXPERIMENTS

5.1 EVALUATION SETUP

Related Posts