Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs

We encountered two main engineering challenges when developing and scaling our models. First, how to efficiently shard our models across multiple devices. Second, how to efficiently implement linear recurrences to maximize training efficiency on TPUs. We address both of these challenges in this section, before providing an empirical comparison of the training speed of Griffin and our MQA baseline.

4.1. Model parallelism for large scale training

As our model increases in size, we cannot fit the model on a single device during training, even with a batch size of 1 per-device. We therefore use model parallelism to shard our large models across devices during training. Since communication costs across different training devices are expensive, efficiently sharding the model is critical for fast training at scale.

\ MLP and MQA block For our gated-MLP block we use Megatron-style sharding (Shoeybi et al., 2019), which requires a single all-reduce operation in both the forward and the backward pass. Similarly, we apply the same strategy to the linear layers in the attention block, and additionally shard the attention mechanism over its heads (Narayanan et al., 2021).

\ Other considerations Optimizer states can consume significant memory, exceeding the size of the model parameters themselves. To address this, we employ ZeRO parallelism (Rajbhandari et al., 2020), distributing both optimizer states and model parameters across the batch shards. We also use bfloat16 representation for model parameters and activations, minimizing any data transfer overhead.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Gating

Print Share Comment Cite Upload Translate Updates

APA

Gating | Sciencx (2025-01-14T04:50:34+00:00) Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs. Retrieved from https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/

MLA

" » Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs." Gating | Sciencx - Tuesday January 14, 2025, https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/

HARVARD

Gating | Sciencx Tuesday January 14, 2025 » Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs., viewed ,<https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/>

VANCOUVER

Gating | Sciencx - » Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/

CHICAGO

" » Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs." Gating | Sciencx - Accessed . https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/

IEEE

" » Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs." Gating | Sciencx [Online]. Available: https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/. [Accessed: ]

rf:citation

» Efficient Training: Scaling Griffin Models for Large-Scale AI on TPUs | Gating | Sciencx | https://www.scien.cx/2025/01/14/efficient-training-scaling-griffin-models-for-large-scale-ai-on-tpus/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

4. Training Recurrent Models Efficiently on Device

4.1. Model parallelism for large scale training

Related Posts