Chinchilla Explained: How to read DeepMind's paper on Compute-Optimal Scaling Laws

What are the Chinchilla scaling laws? How to read DeepMind's paper on Compute-Optimal Scaling Laws

Hi everyone šŸ‘‹

Today weā€™ll dive into DeepMindā€™s Chinchilla paper on scaling laws, whatā€™s on my reading list, and some fun Midjourney creations.

Letā€™s dive in!

DALL-E on a Chinchilla with a pearl earring (more of a necklace reallyā€¦)

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

šŸ—žļø News

Whatā€™s on my reading list?

  • What is RedPajama?

  • What about Falcon? (Link)

  • Magic Devā€™s 5M token concept window! Anthropic is 100k for reference (Link)

  • Chris Olah on AI safety methods (Link)

  • Stephen Wolfram on how ChatGPT works (Link)

  • Yoav Goldbergā€™s post on Reinforcement Learning (very readable!) (Link)

  • OpenAIā€™s Improving Mathmatical Reasoning with Process Supervision research (Link)

šŸ“š Concepts & Learning

Today weā€™re doing to dive into DeepMindā€™s Chinchilla paper: Training Compute-Optimal Large Language Models.

As LLMs get more sophisticated, larger and capable, itā€™s important to figure out the optimal mix between:

  • the size of your training dataset (number of training tokens)

  • the size of your model (number of parameters)

  • the compute budget (number of FLOPs)

Why?

Getting the right mix of these features affects the behavior and performance of the LLM, including how well it learns what desired output looks like.

Itā€™s also really expensive to train increasingly capable models, so making sure your hyperparameters are optimized can save you a lot of money and time.

Bigger model = better performance ?

Historically, researchers believed that a key to unlocking better model performance was by training larger models. For example, researchers expected that a 540B parameter model (like Googleā€™s PaLM model) would outperform a 280B parameter model (like DeepMindā€™s Gopher), all else being held the same (ceteris paribus).

šŸ¤” Researchers: ā€œI thinkā€¦ā€

540B model > 280B model

Researchers also believed that the size of the training datasets should roughly linearly follow the size of the model ~1:1, meaning you needed roughly 1 additional training token for every 1 additional parameter.

šŸ¤” Gopher Researchers: ā€œI thinkā€¦ā€

280B model needs 300B training tokens

So, 560B model probably needsā€¦ ~560B tokens? šŸ¤·ā€ā™€ļøĀ 

But DeepMindā€™s Chinchilla paper, released in March 2022, demonstrated a compelling alternative. The paper posited that increasing model size was not the only way to improve performance. In fact, one main conclusion of the paper is that models at that time were massively oversized and massively undertrained.

The results from the research paper center around DeepMindā€™s Gopher model, which was one of the best-in-class at the time. The researchers conclude that holding Gopherā€™s compute budget constant, the optimal size of the model would have been more than 4x smaller than Gopher, and trained on 4x more data!

šŸ¤” Chinchilla Researchers: ā€œI think we shouldā€¦ā€

Cut Gopherā€™s size by 75%

Train on 4x more data

Another key finding from the paper is that the size of the training dataset (training tokens) should scale proportionally with the rate of any increase in model size (number of parameters in a model), not with the absolute number of parameters. They also find that the baseline ratios of model parameters to training tokens are way off: models are too big for their designated compute budgets and size of their training datasets (both of which are too small).

  • Rate: If you double the size of your model (number of parameters), you should double the size of the training dataset (training tokens).

*not taking into account the optimal compute budget

  • Ratio: Instead of a ~1:1 token-to-parameter ratio, Chinchilla concludes the ratio of training tokens to parameter size should be ~21:1 (this is given Gopherā€™s compute budget).

*not taking into account the optimal compute budget

Okayā€¦ so how did the researchers come to these conclusions (and how do I read the paper graphics??).

I gotchu. Letā€™s dive in below.

Some definitions

Before we dive in, itā€™s worth defining some key terms. Loss is used a lot throughout the research, so I thought Iā€™d run through a quick explanation and example:

What is a loss function?

A simple explanation is: there is a desired output and your actual output. The difference between the two is your loss. Youā€™re trying to build models that minimize that loss as much as possible.

What makes that so hard?

Models are highly parameterized so in order to ā€œfit your curveā€ to the points in your dataset, youā€™ll need to do a lot of adjusting.

Hereā€™s a visual example:

Of course this is a toy example. Models are highly parameterized, into the billions or trillions, and this example has just two parameters: the x-axis and the y-axis.

The basic intuition to get across here is:

  • the red lines represent the modelā€™s prediction of the relationship between the x- and y-axes

  • the pink dots represent individual points in a training dataset

  • the blue dots represents the expected output given the red-line relationship between x and y variables

  • the green represents the loss (or difference between the expected blue dots and the actual pink dots)

In order to improve your model, you want the difference between the expected (blue dot) and the actual (pink dot) to be minimized. So in the graphics in todayā€™s post, when we see loss represented as a number, we want that loss number to be as low as possible.

What are tokens?

To give you a sense of size, 100k tokens is roughly equivalent to 75k words, which is roughly the size of 1 book.

- 5B tokens is roughly 50k books

- 400B tokens is roughly 4M books

Read OpenAIā€™s token-to-word estimate here

What is Gopher?

In short: Gopher was another SOTA model trained by DeepMind before the release of this Chinchilla paper that DeepMind used as a performance comparison to the Chinchilla model that came out of this paper.

- 280B parameters
- 300B training tokens
- ~10^24 FLOPs

Great! Now that weā€™ve defined that, letā€™s jump into the Chinchilla methods.

Testing methods

There are three testing methods outlined in the paper:

  1. Fixed model size

  2. Fixed compute budget

  3. Parametric loss function (not discussed in todayā€™s post)

In this post, weā€™ll touch upon method 1 and method 2, but leave method 3 for a future post.

There are also three key variables that the researchers adjust throughout in order to find the optimal mix. The three variables are:

  • compute (measured in FLOPs)

  • training dataset (measured in tokens)

  • model size (measured in parameters)

1ļøāƒ£ Method 1: Fixed model size, variable training dataset

Method 1 in two sentences:

The researchers started with a set of pre-determined model sizes and varied the training datasets sizes.

They then measured the compute used to train those models and plotted the relationships between model size, training data size, and compute used for each one.

In this method, the researchers decided to train models of pre-determined, fixed sizes. They tested on a set of models, ranging from 70M to 10B parameters. They set up 4 different training methods, which includes variable sizes of training datasets, variable learning rates, etc. From the results of these 4 discrete training methods, the researchers were able to extrapolate a function that represented the relationship between training loss and compute budget for each model size (see the graphic below).

Remember we want to minimize the loss (see definition above). The extrapolation allowed the researchers to create a graphic with continuous lines, rather than 4 discrete points for each training sequence per model size. Each line above corresponds to one model size (with a fixed parameter count). The researchers were then able to project what combinations of training and FLOP counts minimized the loss for each model size. They then graph those optimized combinations in grey (see the grey tips on the bottom of each of the curves). Weā€™ll walk through how to interpret this later.

Itā€™s worth noting that the researchers didn't decide in advance how much computing power to use for training. Instead, they conducted the training and measured the amount of computing power used. This means that the amount of computing power available does not limit this method. Therefore, the researchers can estimate the best model size and training dataset size for different computing budgets.

Okay letā€™s break down this graphic:

  • Training loss (which we want to minimize) is on the y-axis.

  • FLOPs (or compute budgets) are on the x-axis.

  • Yellow are larger model sizes, black and purple are smaller model sizes.

  • Each curve represents one model size. Each point represents an individual model.

  • Grey frontier denotes the final/optimized models.

  • Minimum loss for a small model (in black) at 75M parameters is ~3.2. It doesnā€™t perform as well as larger models, but it uses fewer FLOPs (compute) at ~10^18.

  • Minimum loss for a large model (in yellow) at 10B parameters is ~2.1. It performs better than smaller models, but uses more compute at ~10^22 FLOPs.

  • Since we want loss to be minimized: the smaller the loss number the better. Smaller compute budgets are also nice (that reduces costs). So ideally weā€™d want to be in the bottom left corner. We probably want more capable models over cheaper models overall, though, so bottom right corner is the second best place to be.

  • It makes sense that smaller models (in black) can only minimize to ~3.2 while larger models (in yellow) can minimize to ~2.1. Larger models are typically more capable than smaller models.

  • Note: Optimal sizes of the training datasets are not shown here.

All of this is just to motivate how we got these grey data points, which are the points that minimize the loss for each model size and training method.

Through this process, the researchers used fixed model sizes with a variety of training datasets with the goal of finding optimal matches between model sizes and dataset sizes. The optimal models demonstrate the lowest loss (aka best performance) compared to their counterparts. Having reached the final/optimized models in grey, the researchers can measure how much compute was used for each of those models. Now there is another variable we can use for comparison: the compute budgets.

This allows us to create and interpret these graphics:

Okay letā€™s break this one down:

  • We can now graph and evaluate the varying final/optimized models by the compute used to train each one. We split that comparison into two graphs: one to help us determine the optimal parameter size and one to help us determine the optimal training dataset size.

  • The grey data points come from the previous graphic, which denote the final/optimized models. These are the models that have the lowest loss (aka best performance) given a set of fixed model sizes and a variety of training datasets.

  • Graph 1:

    • Parameters (aka model size) are on the y-axis.

    • FLOPs (or compute budgets) are on the x-axis.

  • Graph 2:

    • Tokens (aka size of the training dataset) are on the y-axis.

    • FLOPs (or compute budgets) are on the x-axis.

  • The dotted red line denotes the linear best-fit line for the data given. This allows the researchers to predict what the best compute-to-model-size or compute-to-dataset-size pairs are for data points not yet tested. This allows the researcher to estimate projections.

  • The researchers can then estimate the optimal model size and training dataset for a new model trained on the same amount of compute used to create Gopher (~10^24 FLOPs).

    • Optimal model size projection: 67B instead of 280B (actual)

    • Optimal training dataset size: 1.5T instead of 300B (actual)

  • We can now see a clear relationship between the optimal model size or size of the training dataset based on a given compute budget.

  • Note: Optimal sizes of the training datasets are not shown in the previous graphic.

These graphs help us answer questions like:

  • What is the optimal compute budget for a 500M parameter model (aka model size)?

  • What is the optimal parameter size (aka model size) for 1e^19 FLOPs (aka compute)? Working backwards into this question: which pairs of datasets and parameters used this much compute?

  • What is the optimal parameter size (aka model size) for a Gopherā€™s compute budget at 10^24 FLOPs?

2ļøāƒ£ Method 2: Fixed compute budget, variable model sizes

In this test, researchers pre-determined 9 fixed compute budgets (see below) and tested on a different range of model sizes to determine the optimal model size for a given compute budget. In this approach, the researchers only considered the final training loss for each point, rather than losses along the way like they did in Method 1.

Here are the 9 pre-determined compute budgets the researchers tested on (in FLOPs):

Small ā†’ Big

These compute budgets increase in size from smaller (6e18) to larger (3e21).

Compute costs are a function of model size and training dataset size. Compute costs go up if the model is larger, the training dataset is larger, or if both are larger. Since the compute budget is fixed in advance and the model sizes are pre-determined, researchers can back into the corresponding training dataset sizes as well. In this paper, the researchers tested a variety of model sizes for each compute budget to ensure they could find a clean minimum in the loss function (finding the lowest point on the y-axis for each parabola below).

Okay letā€™s break this graphic down:

  • Training loss (which we want to minimize) is on the y-axis.

  • Parameters (or model sizes) are on the x-axis.

  • As noted above, the different colors represent the pre-determined compute budgets. Lighter ones are the smaller compute budgets and darker ones are the larger compute budgets.

  • Each colored curve represents an individual compute budget.

  • Yellow circles denote the models with the lowest loss for a given compute budget.

  • Minimum loss for a small compute budget (in light green) at 6e18 FLOPs is ~3.0. It doesnā€™t perform as well as models with larger compute budgets, but it can be achieved on a smaller model (fewer parameters).

  • Minimum loss for a large model (in black) at 3e21 FLOPs is ~2.2. It performs better than models with smaller compute budgets, but is optimized on a larger sized model (more parameters).

  • Note: Optimal sizes of the training datasets are not shown here.

Whatā€™s interesting about this graphic is that for a given compute budget, increasing the model size doesnā€™t always reduce the loss. There is an inflection point on each of these colored curves where increasing the parameter size while keeping the compute budget constant produces marginally worse results.

Okay so whatā€™s next?

We can now graph the relationship between the compute budget (FLOPs) and the optimal model size (graph on the left) or size of the training dataset (graph on the right). These data points in black come from the yellow circles in the last graphic:

  • The black data points come from the previous graphic, which denote the models with the lowest loss for a given compute budget.

  • Graph 1:

    • Parameters (aka model size) are on the y-axis.

    • FLOPs (or compute budgets) are on the x-axis.

  • Graph 2:

    • Tokens (aka size of the training dataset) are on the y-axis.

    • FLOPs (or compute budgets) are on the x-axis.

  • The dotted red line denotes the linear best-fit line for the data given. This allows the researchers to predict what the best compute-to-model-size or compute-to-training-size pairs are for data points not yet tested. This allows the researcher to estimate projections.

  • The researchers can then estimate the optimal model size and training dataset for a new model trained on the same amount of compute used to create Gopher (~10^24 FLOPs).

    • Optimal model size projection: 63B instead of 280B (actual)

    • Optimal training dataset size: 1.4T instead of 300B (actual)

  • We can now see a clear relationship between the optimal model size or size of the training dataset based on a given compute budget.

  • Note: Optimal sizes of the training datasets (tokens) are not shown in the previous graphic.

Now we can compare the projected optimal model sizes and training dataset sizes would be for Gopher given the compute actually used in training (~10^24 FLOPs):

So whatā€™s next?

The researchers used these results to build a compute-optimized model called Chinchilla. They used the same compute budget as Gopher (10^24 FLOPs) and used these projections to define the optimal model size and dataset size.

The results

ā

All three approaches suggest that as compute budget increases, model size and the amount of training data should be increased in approximately equal proportions.

Given these projections, the DeepMind researchers set out to train a compute-optimal LLM, and compare the results to Gopher. Since the goal is to do a head-to-head comparison on benchmark tests, the researchers use the Gopher compute budget as a guide to determine the right model size and training dataset to produce the competing model. The resulting model is called the Chinchilla model.

Chinchilla model

  • Model size: 70B parameters

  • Training size: 1.4T tokens (~14M books)

  • Compute budget: ~10^(24) FLOPs (same as Gopher)

  • Overall: Chinchilla is 4x smaller than Gopher (at 280B parameters) and trained on 4x more tokens (Gopher trained on 300B tokens) while the compute budget is held constant for both

  • Results: Chinchilla outperformed Gopher and achieved state-of-the-art (SOTA) results on several benchmarks (more details below). Since Chinchilla is smaller, it can perform inference (generating an output) faster and more cheaply than Gopher.

The orange denotes where Gopher outperformed Chinchilla, and the blue denotes where Chinchilla outperformed Gopher.

This result is significant because it demonstrated an alternative path to achieve SOTA results in LLMs besides increasing the model size. As an additional benefit, smaller models are also cheaper to run inference on (get an output from the model). So when customers query the model, itā€™s cheaper to get results from a smaller model than a larger one. This means smaller models may be cheaper to run and still achieve best-in-class results.

Some limitations

  • The researchers are only able to compare two models on benchmark performance tests: Chinchilla and Gopher. (Because itā€™s very expensive to optimally train LLMs).

  • The rest of the models included in the paper are run on only one epoch of training (one sequence of training on the training dataset).

    • DeepMind notes that future research could explore multiple epoch regimes to improve understanding. There arenā€™t additional tests at intermediate scales.

  • The researchers assume the relationship between compute budget, model size, and training dataset size follows a power-law pattern.

  • There is some curvature in the optimal model size at high compute budgets, which might mean the researchers might be overestimating the optimal size for large models.

Overall, this is fascinating research about the optimal compute budgets for different model sizes and training sequences.

šŸ§  Dig deeper

  • Prefer a visual walk-though of the paper? Check out this video by Edan Meyer:

  • Check out a short explainer here.

  • Or check out the full DeepMind paper here.

  • Check out the paper on Gopher here.

What do you think of Chinchilla? I know there are some newer scaling papers/blogs, so if you have a favorite, Iā€™d love to keep reading. Hit reply and drop me a line!

šŸŽ Miscellaneous

Just for fun

The future of QR codes and Iā€™m here for it! (Link)

Worth checking out: Midjourney meets Runway ML for storytime (Link)

Thatā€™s it! Have a great day and see you next week! šŸ‘‹

What did you think about todayā€™s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!