Textbooks are All You Need: Model Architecture and Training

In this study, researchers from Microsoft introduce phi-1, a new large language model for code, with significantly smaller size than competing models.


This content originally appeared on HackerNoon and was authored by Knapsack

:::info Authors:

(1) Suriya Gunasekar, Microsoft Research;

(2) Yi Zhang, Microsoft Research;

(3) Jyoti Aneja, Microsoft Research;

(4) Caio C´esar Teodoro Mendes, Microsoft Research;

(5) Allie Del Giorno, Microsoft Research;

(6) Sivakanth Gopi, Microsoft Research;

(7) Mojan Javaheripi, Microsoft Research;

(8) Piero Kauffmann, Microsoft Research;

(9) Gustavo de Rosa, Microsoft Research;

(10) Olli Saarikivi, Microsoft Research;

(11) Adil Salim, Microsoft Research;

(12) Shital Shah, Microsoft Research;

(13) Harkirat Singh Behl, Microsoft Research;

(14) Xin Wang, Microsoft Research;

(15) S´ebastien Bubeck, Microsoft Research;

(16) Ronen Eldan, Microsoft Research;

(17) Adam Tauman Kalai, Microsoft Research;

(18) Yin Tat Lee, Microsoft Research;

(19) Yuanzhi Li, Microsoft Research.

:::

2.3 Model architecture and training

We use a decoder only transformer [VSP+ 17] model using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22]. We also use MHA and MLP layers in parallel configuration following some recent models like CodeGen [NPH+ 22], PaLM [CND+ 22], and GPT-NeoX [BBH+ 22]. The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each. The smaller 350M parameter phi1-small model consists of 20 layers, hidden dimension of 1024, MLP-inner dimension of 4096, and 16 attention heads of dimension 64 each. We also use a rotary position embedding [SLP+ 21] with rotary dimension 32. These architectural choices were adopted from [NPH+ 22]. We also use the same tokenizer as codegen-350M-mono [NPH+ 22]. Aside from FlashAttention, our models do not use other techniques like Fill-In-the-Middle (FIM) [BJT+ 22], or Multi-Query-Attention (MQA) [RSR+ 20] that could further boost performance and efficiency [LAZ+ 23].

\ For both pretraining and finetuning, we concatenate our respective datasets into a single dimensional array with “⟨∣endoftext∣⟩” token used for separating the files. We train our models on sequence length of 2048 sliced from our dataset array with next-token prediction loss. We use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1. We train on 8 Nvidia-A100 GPUs using deepspeed. Our pretrained base model phi-1-base was obtained in under 4 days of training. Finetuning to obtain phi-1 used an additional 7 hours on the same hardware.

\ Pretraining. phi-1-base was trained on the CodeTextbook dataset (filtered code-language corpus and synthetic textbooks). We use effective batch size 1024 (including data parallelism and gradient accumulation), maximum learning rate 1e-3 with warmup over 750 steps, and weight decay 0.1, for a total of 36,000 steps. We use the checkpoint at 24,000 steps as our phi-1-base – this is equivalent to ∼ 8 epochs on our CodeTextbook dataset for a total of little over 50B total training tokens. Despite the small size and computation, this model already achieves a 29% accuracy on HumanEval.

\ Finetuning. phi-1 is obtained by finetuning phi-1-base on the CodeExercises dataset. For finetuning, we use the same setup as pretraining, but different hyperparameters: we use effective batchsize of 256, maximum learning rate 1e-4 with 50 steps of warmup, and weight decay 0.01. We train for total of 6,000 steps and pick the best checkpoint (saved every 1000 steps).

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Knapsack


Print Share Comment Cite Upload Translate Updates
APA

Knapsack | Sciencx (2024-09-12T11:00:16+00:00) Textbooks are All You Need: Model Architecture and Training. Retrieved from https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/

MLA
" » Textbooks are All You Need: Model Architecture and Training." Knapsack | Sciencx - Thursday September 12, 2024, https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/
HARVARD
Knapsack | Sciencx Thursday September 12, 2024 » Textbooks are All You Need: Model Architecture and Training., viewed ,<https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/>
VANCOUVER
Knapsack | Sciencx - » Textbooks are All You Need: Model Architecture and Training. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/
CHICAGO
" » Textbooks are All You Need: Model Architecture and Training." Knapsack | Sciencx - Accessed . https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/
IEEE
" » Textbooks are All You Need: Model Architecture and Training." Knapsack | Sciencx [Online]. Available: https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/. [Accessed: ]
rf:citation
» Textbooks are All You Need: Model Architecture and Training | Knapsack | Sciencx | https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.