This content originally appeared on HackerNoon and was authored by Knapsack
:::info Authors:
(1) Suriya Gunasekar, Microsoft Research;
(2) Yi Zhang, Microsoft Research;
(3) Jyoti Aneja, Microsoft Research;
(4) Caio C´esar Teodoro Mendes, Microsoft Research;
(5) Allie Del Giorno, Microsoft Research;
(6) Sivakanth Gopi, Microsoft Research;
(7) Mojan Javaheripi, Microsoft Research;
(8) Piero Kauffmann, Microsoft Research;
(9) Gustavo de Rosa, Microsoft Research;
(10) Olli Saarikivi, Microsoft Research;
(11) Adil Salim, Microsoft Research;
(12) Shital Shah, Microsoft Research;
(13) Harkirat Singh Behl, Microsoft Research;
(14) Xin Wang, Microsoft Research;
(15) S´ebastien Bubeck, Microsoft Research;
(16) Ronen Eldan, Microsoft Research;
(17) Adam Tauman Kalai, Microsoft Research;
(18) Yin Tat Lee, Microsoft Research;
(19) Yuanzhi Li, Microsoft Research.
:::
Table of Links
- Abstract and 1. Introduction
- 2 Training details and the importance of high-quality data
- 2.1 Filtering of existing code datasets using a transformer-based classifier
- 2.2 Creation of synthetic textbook-quality datasets
- 2.3 Model architecture and training
- 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries
- 4 Evaluation on unconventional problems with LLM grading
- 5 Data pruning for unbiased performance evaluation
- 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis
- 6 Conclusion and References
- A Additional examples for Section 3
- B Limitation of phi-1
- C Examples for Section 5
2.3 Model architecture and training
We use a decoder only transformer [VSP+ 17] model using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22]. We also use MHA and MLP layers in parallel configuration following some recent models like CodeGen [NPH+ 22], PaLM [CND+ 22], and GPT-NeoX [BBH+ 22]. The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each. The smaller 350M parameter phi1-small model consists of 20 layers, hidden dimension of 1024, MLP-inner dimension of 4096, and 16 attention heads of dimension 64 each. We also use a rotary position embedding [SLP+ 21] with rotary dimension 32. These architectural choices were adopted from [NPH+ 22]. We also use the same tokenizer as codegen-350M-mono [NPH+ 22]. Aside from FlashAttention, our models do not use other techniques like Fill-In-the-Middle (FIM) [BJT+ 22], or Multi-Query-Attention (MQA) [RSR+ 20] that could further boost performance and efficiency [LAZ+ 23].
\ For both pretraining and finetuning, we concatenate our respective datasets into a single dimensional array with “⟨∣endoftext∣⟩” token used for separating the files. We train our models on sequence length of 2048 sliced from our dataset array with next-token prediction loss. We use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1. We train on 8 Nvidia-A100 GPUs using deepspeed. Our pretrained base model phi-1-base was obtained in under 4 days of training. Finetuning to obtain phi-1 used an additional 7 hours on the same hardware.
\ Pretraining. phi-1-base was trained on the CodeTextbook dataset (filtered code-language corpus and synthetic textbooks). We use effective batch size 1024 (including data parallelism and gradient accumulation), maximum learning rate 1e-3 with warmup over 750 steps, and weight decay 0.1, for a total of 36,000 steps. We use the checkpoint at 24,000 steps as our phi-1-base – this is equivalent to ∼ 8 epochs on our CodeTextbook dataset for a total of little over 50B total training tokens. Despite the small size and computation, this model already achieves a 29% accuracy on HumanEval.
\ Finetuning. phi-1 is obtained by finetuning phi-1-base on the CodeExercises dataset. For finetuning, we use the same setup as pretraining, but different hyperparameters: we use effective batchsize of 256, maximum learning rate 1e-4 with 50 steps of warmup, and weight decay 0.01. We train for total of 6,000 steps and pick the best checkpoint (saved every 1000 steps).
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
This content originally appeared on HackerNoon and was authored by Knapsack
Knapsack | Sciencx (2024-09-12T11:00:16+00:00) Textbooks are All You Need: Model Architecture and Training. Retrieved from https://www.scien.cx/2024/09/12/textbooks-are-all-you-need-model-architecture-and-training/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.