Griffin Models: Outperforming Transformers with Scalable AI Innovation

We present our main scaling results in Figure 1(a). All three model families are trained at a range of model scales from 100M to 7B parameters, with an additional Griffin model with 14 billion parameters. We increase the number of training tokens to be roughly proportional to the number of parameters of the model, as prescribed by the Chinchilla scaling laws (Hoffmann et al., 2022). Models are trained on the MassiveText dataset (Hoffmann et al., 2022), previously used to train Gopher (Rae et al., 2021) and Chinchilla (Hoffmann et al., 2022), although we use a slightly different data subset distribution. A sequence length of 2048 tokens was used (see Section 6 for results with longer sequences.) All experiments use the AdamW optimizer (Loshchilov and Hutter, 2017). We tune the learning rate, weight decay and 𝛽2 parameters for small models, and use these runs to identify scaling rules for these hyper-parameters which predict their optimal values for the 7B and 14B models.

\ All three model families demonstrate a linear scaling relationship between the validation loss and training FLOPs (see Figure 1(a); note both axes are in log scale), as previously observed for Transformers by Brown et al. (2020). Notably, Griffin achieves lower validation loss than the Transformer baseline across all FLOPs budgets despite not using any global attention layers. Hawk on the other hand achieves slightly higher validation loss, but this gap appears to close as the training budget increases.

\ Table 1 | Character normalized accuracy. Hawk is competitive with our Transformer baseline, and exceeds the reported performance of Mamba despite being trained on half as many tokens. Griffin outperforms our Transformer baseline, and matches the performance of Llama-2 despite being trained on roughly 7 times fewer tokens. We report unnormalized accuracy with partial scoring for WinoGrande.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Gating

Print Share Comment Cite Upload Translate Updates

APA

Gating | Sciencx (2025-01-13T16:23:50+00:00) Griffin Models: Outperforming Transformers with Scalable AI Innovation. Retrieved from https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/

MLA

" » Griffin Models: Outperforming Transformers with Scalable AI Innovation." Gating | Sciencx - Monday January 13, 2025, https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/

HARVARD

Gating | Sciencx Monday January 13, 2025 » Griffin Models: Outperforming Transformers with Scalable AI Innovation., viewed ,<https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/>

VANCOUVER

Gating | Sciencx - » Griffin Models: Outperforming Transformers with Scalable AI Innovation. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/

CHICAGO

" » Griffin Models: Outperforming Transformers with Scalable AI Innovation." Gating | Sciencx - Accessed . https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/

IEEE

" » Griffin Models: Outperforming Transformers with Scalable AI Innovation." Gating | Sciencx [Online]. Available: https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/. [Accessed: ]

rf:citation

» Griffin Models: Outperforming Transformers with Scalable AI Innovation | Gating | Sciencx | https://www.scien.cx/2025/01/13/griffin-models-outperforming-transformers-with-scalable-ai-innovation/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

3.1. Scaling curves

Related Posts