How Effective is vLLM When a Prefix Is Thrown Into the Mix?

We explore the effectiveness of vLLM for the case a prefix is shared among different input prompts


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Abstract and 1 Introduction

2 Background and 2.1 Transformer-Based Large Language Models

2.2 LLM Service & Autoregressive Generation

2.3 Batching Techniques for LLMs

3 Memory Challenges in LLM Serving

3.1 Memory Management in Existing Systems

4 Method and 4.1 PagedAttention

4.2 KV Cache Manager

4.3 Decoding with PagedAttention and vLLM

4.4 Application to Other Decoding Scenarios

4.5 Scheduling and Preemption

4.6 Distributed Execution

5 Implementation

6 Evaluation and 6.1 Experimental Setup

6.2 Basic Sampling

6.3 Parallel Sampling and Beam Search

6.4 Shared prefix

6.5 Chatbot

7 Ablation Studies

8 Discussion

9 Related Work

10 Conclusion, Acknowledgement and References

6.4 Shared prefix

We explore the effectiveness of vLLM for the case a prefix is shared among different input prompts, as illustrated in

\ Figure 16. Translation workload where the input prompts share a common prefix. The prefix includes (a) 1 example with 80 tokens or (b) 5 examples with 341 tokens.

\ Figure 17. Performance on chatbot workload.

\ Fig. 10. For the model, we use LLaMA-13B [52], which is multilingual. For the workload, we use the WMT16 [4] Englishto-German translation dataset and synthesize two prefixes that include an instruction and a few translation examples. The first prefix includes a single example (i.e., one-shot) while the other prefix includes 5 examples (i.e., few-shot). As shown in Fig. 16 (a), vLLM achieves 1.67× higher throughput than Orca (Oracle) when the one-shot prefix is shared. Furthermore, when more examples are shared (Fig. 16 (b)), vLLM achieves 3.58× higher throughput than Orca (Oracle).

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info Authors:

(1) Woosuk Kwon, UC Berkeley with Equal contribution;

(2) Zhuohan Li, UC Berkeley with Equal contribution;

(3) Siyuan Zhuang, UC Berkeley;

(4) Ying Sheng, UC Berkeley and Stanford University;

(5) Lianmin Zheng, UC Berkeley;

(6) Cody Hao Yu, Independent Researcher;

(7) Cody Hao Yu, Independent Researcher;

(8) Joseph E. Gonzalez, UC Berkeley;

(9) Hao Zhang, UC San Diego;

(10) Ion Stoica, UC Berkeley.

:::

\


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models


Print Share Comment Cite Upload Translate Updates
APA

Writings, Papers and Blogs on Text Models | Sciencx (2025-01-04T16:00:04+00:00) How Effective is vLLM When a Prefix Is Thrown Into the Mix?. Retrieved from https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/

MLA
" » How Effective is vLLM When a Prefix Is Thrown Into the Mix?." Writings, Papers and Blogs on Text Models | Sciencx - Saturday January 4, 2025, https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/
HARVARD
Writings, Papers and Blogs on Text Models | Sciencx Saturday January 4, 2025 » How Effective is vLLM When a Prefix Is Thrown Into the Mix?., viewed ,<https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/>
VANCOUVER
Writings, Papers and Blogs on Text Models | Sciencx - » How Effective is vLLM When a Prefix Is Thrown Into the Mix?. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/
CHICAGO
" » How Effective is vLLM When a Prefix Is Thrown Into the Mix?." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/
IEEE
" » How Effective is vLLM When a Prefix Is Thrown Into the Mix?." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/. [Accessed: ]
rf:citation
» How Effective is vLLM When a Prefix Is Thrown Into the Mix? | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2025/01/04/how-effective-is-vllm-when-a-prefix-is-thrown-into-the-mix/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.