This content originally appeared on DEV Community and was authored by Ed Miller
Note: Banner image generated by AI.
Are you curious about how to supercharge your application with AI while cutting costs? Discover how running Large Language Models on AWS Graviton can offer you the necessary performance at a fraction of the price.
It has been less than two years since ChatGPT changed the virtual face of AI. Since then, large language models (LLMs) have been all the rage. Adding a chatbot in your application may dramatically increase user interaction, but LLMs require complicated and costly infrastructure. Or do they?
After watching the “Generative AI Inference using AWS Graviton Processors” session from the AWS AI Infrastructure Day, I was inspired to share how you can run an LLM using the same Graviton processors as the rest of your application.
In this post, we will:
- Set up a Graviton instance.
- Following the steps (with some modifications) in "Deploy a Large Language Model (LLM) chatbot on Arm servers” from the Arm Developer Hub to:
- Download and compile llama.cpp
- Download a Llama 3.1 model using huggingface-cli
- Re-quantize the model using llama-quantize to optimize it for the target Graviton platform
- Run the model using llama-cli
- Evaluate performance
- Compare different instances of Graviton and discuss the pros and cons of each
- Point to resources for getting started
Subsequent posts will dive deeper into application use cases, costs, and sustainability.
Set up a Graviton instance
First, let’s focus on the Graviton3-based r7g.16xlarge. This is a common instance type these days. I’ll be running it in us-west-2. Using the console, navigate to EC2 Instances and select “Launch instances”. There are only a few fields necessary for a quick test:
- Name: this is up to you; I have called mine ed-blog-r7g-16xl
- Application and OS Images
- AMI: I am using Ubuntu Server 24.04 LTS (the default if you select Ubuntu)
- Architecture: Choose 64-bit (Arm)
- Instance type: r7g.16xlarge
- Key pair: Select an existing one or create a new one
- Configure storage: I’m bumping this up to 32 GiB to make sure I have room for the code and Llama models.
You can leave defaults for the rest, just click “Launch instance” after the Summary.
Once the instance has started, you can connect using your favorite method. For simplicity, I will use the EC2 Instance Connect method, which will provide a terminal in your browser window:
Build and Run Llama 3.1
To build and run Llama 3.1, we will follow the steps (with some modifications) in "Deploy a Large Language Model (LLM) chatbot on Arm servers” from the Arm Developer Hub to:
Download and compile llama.cpp
First, we install any prerequisites:
sudo apt update
sudo apt install make cmake -y
sudo apt install gcc g++ -y
sudo apt install build-essential -y
Then we clone llama.cpp and build it (the -j$(nproc)
flag will use all available vCPU cores to speed up compilation):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
Finally, we can test it using the help flag:
./llama-cli -h
Download Llama 3.1
Next, we’ll set up a virtual environment for Python packages:
sudo apt install python-is-python3 python3-pip python3-venv -y
python -m venv venv
source venv/bin/activate
Now install HuggingFace Hub and use it to download a 4-bit quantized version of Llama 3.1:
pip install huggingface_hub
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
Re-quantize the model
The model we downloaded is already 4-bit quantized (half-byte per weight). This gives us a 4x improvement in model size compared with the original bfloat16 (2-byte per weight). However, the width of the Scalable Vector Extension (SVE) is different for Graviton3 (2x256-bit SVE) and Graviton4 (4x128-bit SVE2). Graviton2 does not have SVE but will use 2x128-bit Arm Neon technology. To maximize the throughput for each generation, you should re-quantize the model with the following block layouts:
- Graviton2: 4x4 (Q4_0_4_4)
- Graviton3: 8x8 (Q4_0_8_8)
- Graviton4: 4x8 (Q4_0_4_8)
For the Graviton3 instance, we will re-quantize the model using llama-quantize
as follows:
./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
Run the model
Finally, we can run the model using llama-cli. There are a few arguments we will use:
- Model (-m): The optimized model for Graviton3, dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf
- Prompt (-p): As a test prompt, we’ll use “Building a visually appealing website can be done in ten simple steps”
- Response length (-n): We’ll ask for 512 characters
- Thread count (-t): We want to use all 64 of the vCPUs
Here’s the command:
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
When you run the command, you should see several parameters print out followed by the generated text (starting with the prompt) and finally performance statistics:
Evaluate Performance
The two lines highlighted above are the prompt evaluation time and the text generation time. These are two of the key metrics for user experience with LLMs. The prompt evaluation time relates to how long it takes the LLM to process the prompt and start to respond. The text generation time is how long it takes to generate the output. In both cases, the metric can be viewed in terms of tokens per second (T/s). For our run we see:
Evaluation: 278.2 T/s
Generation: 47.7 T/s
If you run the standard Q4_0 quantization with everything else the same, as with this command:
./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
You will see a decrease in performance:
Evaluation: 164.6 T/s
Generation: 28.1 T/s
Using the correct quantization format (Q4_0_8_8, in this case) you get close to 70% improvement on evaluation and generation!
When you are done with your tests, don’t forget to stop the instance!
Comparing Graviton-based instances
Using the process above, we can run the same model on similarly equipped Graviton2 and Graviton4-based instances. Using the optimum quantization format for each, we can see an increase in performance from generation to generation:
Generation | Instance | Quant | Eval (T/s) | Gen (T/s) |
---|---|---|---|---|
Graviton2 | r6g.16xlarge | Q4_0_4_4 | 175.4 | 25.1 |
Graviton3 | r7g.16xlarge | Q4_0_8_8 | 278.2 | 42.7 |
Graviton4 | r8g.16xlarge | Q4_0_4_8 | 341.8 | 65.6 |
The performance differences are due to vectorization extensions, caching, clock speed, and memory bandwidth. You may see some variation at lower vCPU/thread counts and when using different instance types: general purpose (M), compute optimized (C), etc. Graviton4 also has more cores per chip, with instances available up to 192 vCPUs!
Determining which instances meet your needs depends on your application. For interactive applications, you may want low evaluation latency and a text generation speed of more than 10 Tok/s. Any of the 64 vCPU instances can easily meet the generation requirement, but you may need to consider the expected size of prompts to determine evaluation latency. Graviton2 performance shows that serverless solutions using AWS Lambda may be possible, especially for non-time critical applications.
Get Started!
As you can see, running Llama on Graviton is straightforward. This is an easy way to test out models for your own applications. In many cases, Graviton may be the most cost-effective way of integrating LLMs with your application. I’ll explore this further in the coming months.
In the meantime, here are some resources to help you get started:
Have fun!
This content originally appeared on DEV Community and was authored by Ed Miller
Ed Miller | Sciencx (2024-08-28T20:19:35+00:00) Intro to Llama on Graviton. Retrieved from https://www.scien.cx/2024/08/28/intro-to-llama-on-graviton/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.