Hawk and Griffin Models: Superior Latency and Throughput in AI Inference

Here, we look at inference results for models of size 1B parameters. For our baseline, we compare against a MQA Transformer, which is significantly faster during inference than the standard MHA Transformer often used in the literature. The models that we compare are: i) MQA Transformer, ii) Hawk, and iii) Griffin. For comparing different models we report both latency and throughput.

\ Latency We compare the latency for models with a batch size of 16 with an empty prefill as well as a prefill of 4096 tokens as seen in Figure 4. Hawk and Griffin achieve faster sampling latency than MQA Transformers for long sequences. This is particularly noticeable as the sequence length and the prefill length (which affect the size of the KV cache) are increased. Griffin achieves similar latency to Hawk, demonstrating the excellent compatibility of linear recurrences and local attention.

\ Throughput We compare the maximum throughput (tokens/s) for the same models when sampling 512, 1024, 2048 and 4196 tokens following an empty prompt in Figure 1(b). We see that both Griffin and Hawk achieve significantly higher throughput than the MQA Transformer baseline. This is partially due to recurrent models having lower latency but also primarily occurs because Griffin and Hawk can fit larger batch sizes than the MQA Transformer on a single device, since their cache size is smaller. Hawk achieves higher throughputs than Griffin, since the size of the local attention cache eventually becomes comparable to the size of the parameters when the batch size is large.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Gating

Print Share Comment Cite Upload Translate Updates

APA

Gating | Sciencx (2025-01-14T16:15:05+00:00) Hawk and Griffin Models: Superior Latency and Throughput in AI Inference. Retrieved from https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/

MLA

" » Hawk and Griffin Models: Superior Latency and Throughput in AI Inference." Gating | Sciencx - Tuesday January 14, 2025, https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/

HARVARD

Gating | Sciencx Tuesday January 14, 2025 » Hawk and Griffin Models: Superior Latency and Throughput in AI Inference., viewed ,<https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/>

VANCOUVER

Gating | Sciencx - » Hawk and Griffin Models: Superior Latency and Throughput in AI Inference. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/

CHICAGO

" » Hawk and Griffin Models: Superior Latency and Throughput in AI Inference." Gating | Sciencx - Accessed . https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/

IEEE

" » Hawk and Griffin Models: Superior Latency and Throughput in AI Inference." Gating | Sciencx [Online]. Available: https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/. [Accessed: ]

rf:citation

» Hawk and Griffin Models: Superior Latency and Throughput in AI Inference | Gating | Sciencx | https://www.scien.cx/2025/01/14/hawk-and-griffin-models-superior-latency-and-throughput-in-ai-inference/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

5.2. Results

Related Posts