Llama 2 Explained: Training, Performance and Results

Diving into Meta's Llama 2 model and how it compares to SOTA open- and closed-source LLMs.

Hi everyone 👋

Today we’ll dive into Meta’s Llama 2 training process and results. We’ll also run through some posts to keep on your reading list and some fun visual analogies for how CNNs work.

Thanks for your patience with posts in July! We’ll get back into the regular swing of things in August.

Let’s dive in!

DALL-E on Llama running in a field of flowers

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

📚 Concepts & Learning

Today we’re going to dive into Meta’s Llama 2.

This LLM made waves in the news as the best open source language model out there to date. In today’s post, we’re going to dive into that and how Meta’s Llama actually stands compared to various open- and closed-source models. Today’s post might be interesting to you if you’re forming a product opinion about whether open- or closed-source models will win out in future products and where you should be building/investing/tinkering/focusing.

Let’s dive in.

Earlier this year (February 2023), Meta released a family of large language models called LLaMA, at 7B, 13B, 33B, and 65B parameters (the sizes of the models). Just a few weeks ago (July 2023), Llama 2 was released. The model sizes are 7B, 13B and 70B parameters large.

While Meta didn’t share much about the public data they used to train Llama 2, they did share details about the proprietary data they collected to train, fine-tune, do RLHF on, and do human evaluations on for this set of models. They also shared that the size of the training dataset they used in pre-training increased by 40% compared to LLaMA-1.

Some other notable takeaways:

  • They doubled the context window length.

  • They used Grouped-Query Attention in pre-training (paper here).

  • They introduce GAtt (Ghost Attention) in fine-tuning to improve attention across multiple turns of conversation (more on that later today).

  • They compare the performance of Llama 2 (a plain, pre-trained LLM) and Llama 2-Chat (the pre-trained LLM with SFT and RLHF fine-tuning… more on that later) to other open- and closed-source models.

First, let’s take a quick look at the performance of the Llama 2 models. In the graph below, you can see a plot of the loss functions, which we want to minimize as much as possible. The loss is the difference between how the model performs vs how you want the model to perform on a labeled dataset where you already know the answers. Think of it as the difference between 100% and Your Score on a math quiz (one that your teacher already knows all the answers for): the lower the ‘loss’ or difference, the better.

In this graph, you can see the different models, 7B, 13B, 34B, and 70B parameter size models. They created and tested the 34B model but aren’t releasing it with this batch. As expected, the larger the model is, the lower the loss at most points. The x-axis on this graph maps the amount of data each model is trained on (tokens) and the y-axis plots the model performance (loss). There’s a clear trend that as models are trained on more data, there is a lower loss that follows. You can predict that the trend would continue if the graph extended further.

Cool, but maybe a more informative way of thinking about Llama 2 performance is in comparison to other state-of-the-art (SOTA) LLMs.

The first headline here is: Llama 2 doesn’t perform as well as other SOTA LLMs. Here’s a comparison on closed LLMs:

Llama 2 loses to other LLMs in every major benchmark, with GPT-4 as a leader in all the benchmarks it’s tested in. (Winners in each category are bolded.)

That being said, the largest model in the Llama 2 family is 70B parameters, while PaLM is 540B and GPT-4 is rumored to be 1.76 Trillion parameters. I did a fuller deep dive on GPT-4’s model architecture (called Mixture of Experts) if you’d like to take a deeper dive here 🙂 

Alright, but one big benefit Llama 2 has over these other LLMs is that it’s open-source, meaning developers can build on top of Llama 2 directly and fine-tune it with their own proprietary data.

So, maybe the real comparison is how Llama 2 performs compared to other open-source models:

The second headline here is it seems compared to other open-source LLMs, Llama 2 wins across all categories tested. (Winners are bolded.) Nice stuff!

Well… actually not so fast. Only a few benchmarks are shown here, which begs the question, how did Llama 2 perform in benchmarks not shown? And Llama 2 is only compared to two other open-source models MPT and Falcon. Noticeably missing are Microsoft Research’s Orca (13B) and Phi-1 (1.3B) models which have performed comparatively well on these benchmarks. Phi-1 notably achieves an accuracy of 50.6% on the HumanEval benchmark on the first try (pass@1), and a 55.5% on MBPP (Mostly Basic Programming Problems, paper here). Orca significantly surpasses Vicuna-13B on Big Bench Hard (coding problems) and AGIEval. Vicuna-13B is another notable one that is missing from this evaluation.

It might be because Orca and Phi-1 are not open-source yet, but it begs the question about why much smaller models (13B and 1.3B) are able to surpass Llama 2 on benchmark evaluations.

Now with a few early evaluations out of the way, let’s dive into Llama 2’s high-level training, performance and evaluations.

So, what’s new?

GAtt: Ghost Attention

The researchers use GAtt (ghost attention) in fine-tuning to improve model attention across multiple lines of conversation with the user. I won’t do a full dive on GAtt today, but in short, GAtt helps LLMs focus on an instruction across multiple ‘turns’ in the conversation. It improves LLM performance by helping it focus on an initial instruction across the whole conversation.

Here’s an example, with the attention on ‘answering with a haiku’:

This is in contrast to models that might respond in haiku at first, and then break character and start responding in full length sentences.

(If you’d like me to do a deep dive on the GAtt paper, reply to this email or drop me a line! I’d love to hear from ya.)

More on Llama 2’s performance across multi-turn conversations

Interestingly, when human evaluators compare results from Llama 2 vs ChatGPT, they prefer Llama 2, even across many turns (up to 17 turns shown below).

Here’s a graph that shows that; dark blue is Llama 2 and light blue is ChatGPT:

Human evaluators trend towards preferring Llama 2 over ChatGPT responses (with the exception of 9 turns and 15 turns) even during longer conversations (17 turns). This is surprising because you might expect one model to perform better in short conversations (fewer turns) while the other LLM performs better in longer conversations (more turns). But we don’t see that trend. Instead, there does seem to be a bit more volatility between which model human evaluators prefer when the conversation gets longer (there are fewer ties).

What else is new?

Context distillation (after SFT and RLHF)

For Llama 2, researchers conducted 3 kinds of safety fine-tuning, to ensure the models produce safe and aligned outputs (avoid toxic content, bigotry, producing text that promotes illegal activities, etc.). It included:

  • Supervised Safety Fine-Tuning: Train the LLM on a labeled dataset of input-output/question-answer pairs that demonstrate safe responses to toxic questions. This is similar to red-teaming work.

  • Safety RLHF: Train the baseline Llama 2 model on a safety dataset, and use that to then guide and train the original Llama 2 model (to nudge it in the right direction). See this post here on how RLHF works.

  • Safety Context Distillation: Begin the prompt with a set of guidelines that help it produce output that aligns with your safety goals (see below).

Context distillation is a nice addition to include in the safety protocols for Llama 2 and improved the performance of the model.

Let’s talk about data

  • Meta collected 27,540 SFT annotation from a proprietary vendor. (If you’d like to read a short post on SFT vs RLHF data, check this out.)

  • They stopped collection there because they found the SFT model was able to produce outputs that performed competitively with data produced by human annotators. This test was done on 180 examples.

  • The removed millions of low quality annotations from benchmark training datasets (notably different from benchmark evaluation datasets, which tests how well the model performs). But still, perhaps they didn’t remove enough low quality data. We’ll get into that later today in how it affects LLM toxicity and its robustness against adversarial attack. In short: it’s a tricky balance. How much ‘bad’ data should you remove from pre-training?

So assuming the researchers have done pre-training (with open source benchmark datasets) and supervised fine-tuning (with high quality data) on the Llama 2 models, next comes RLHF.

Reinforcement Learning with Human Feedback (RLHF)

Dataset

Here you can see the dataset used in RLHF training (see post here on how RLHF works).

  • There are over 1.4M proprietary data points that Meta collected (from a data labeling vendor).

  • The proprietary dataset contains 4k prompts (each with unique outputs), which is large by academic and research standards. They do note that this won’t cover all real-world use cases.

  • 1.5M data points are also used from open-source datasets.

  • For safety-specific RLHF, they collected 2k adversarial prompts (prompts that are bad for the model to respond to).

    • 2/3 of those adversarial prompts were multi-turn, and 1/3 was single-turn.

    • For a fuller dive on red-teaming and adversarial prompts, check this out.

The RLHF model

  • The reward model (often called the RL model, aka reinforcement learning model) has the same knowledge as the main model it will ‘nudge’. That main model is aka. the final model.

  • It’s important that the RL model and main model have the same knowledge base. Why? It reduces hallucinations from information mismatch between the RL model and the main model. If I tell a kid not to do/say something based on information I have but she doesn’t have, she might make up reasons (that are incorrect) as to why her action was punished. This is similar to a hallucination.

  • The RL model architecture and hyper-parameters (like step size, learning rate, etc.) are identical to the main model.

Training details

  • The RL model is trained over one epoch (trained on the data one time), to avoid overfitting, which was found on longer training sequences.

How do Meta’s RL models perform?

  • On safety and helpfulness, Meta’s RL models perform better than Open Assistant and SteamSHP-XL across a number of safety benchmark tests.

Humans annotators decide which LLM they prefer

Fine-tuning with these safety and helpfulness reward models (RMs) results in Llama 2-Chat’s superior performance on safety/helpfulness tasks compared to other open-source models. But it’s also interesting to note, Llama 2-Chat is still only on par with ChatGPT:

Safety violations: lower is better

Safety and Helpfulness: higher is better and 5 is the maximum

These are human ratings, however. How does Llama 2-Chat perform on safety/helpfulness benchmark tests?

Safety and helpfulness benchmark tests

Llama 2 vs open-source models:

Interestingly, Llama 2 doesn’t outperform other open-source LLMs on toxicity metrics. Why not? Here’s a note from the researchers:

Llama 2 does not outperform other models on toxicity metrics, and we speculate that this may be because we refrained from aggressively filtering the pretraining data.

Recall that leaving pretraining data unfiltered may enable base models tuned to perform well on more downstream tasks (including hate speech detection), and it carries less risk of accidentally filtering out some demographic groups.

We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment.

We reiterate that this motivated choice does imply that additional safety mitigations should be applied before deployment of base Llama 2 models.

Llama 2 vs closed-source models:

Okay so Llama 2 wins out in truthfulness but not toxicity compared to other open source models. Let’s zoom in on truthfulness metrics and see how Llama 2 compares to closed-source models like ChatGPT:

  • ChatGPT performs much better than Llama 2 with a ~14% gap on TruthfulQA (on Llama’s largest 70B model).

  • But, there’s only ~1.5% gap in % (info) performance alone (not shown, but I thought I’d share because it’s interesting!).

It’s important when interpreting this table to take into account the different model sizes (parameter counts). This table compares a 70B model (Llama 2) with a rumored 1.76T parameter model (GPT-4 or 175B GPT-3.5). If the performance is comparable between models (see % info scores), smaller models are probably preferable because it reduces compute costs for fine-tuning and reduces costs for inference (generating an output from the model).

Is there a tension between Safety and Helpfulness?

If you want a deeper dive on the tension between Safety and Helpfulness, check out this post. Here’s a quick overview (you can skip this if you like):

Meta researchers empirically studied this tension with Llama 2 and shared some findings of their research. Here’s an example from the paper appendix:

Notice how the text changes as safety scores vs helpfulness scores change. The safer it gets (see 50%-100% safety data in column 1) the lower the helpfulness score is (from 0.65 to 0.38).

Insights from RLHF

The researchers share a particularly interesting insight on RLHF vs SFT alone. RLHF (perhaps surprisingly) produces significantly better final model performance, even when RLHF-trained models are comparatively more susceptible to hallucinations and have performance that is generally more unstable compared to SFT alone.

Check out this graph comparing SFT vs RLHF trained models. A distribution closer to the right (near 1.0) is preferable.

And check out this blurb from the paper. My favorite line is (emphasis mine):

Drawing a parallel, while we may not all be accomplished artists, our ability to appreciate and critique art remains intact.

Overall Takeaways

Llama 2 is a powerful LLM and provides a compelling alternative to other open-source models. It outperforms other open-source models on a number of metrics including general intelligence as well as alignment metrics.

However, if you’re looking for the most powerful LLM and don’t need/want an open-source one, GPT-4 still outperforms Llama 2 on several benchmarks.

Wanna check out the full model?

  • Llama 2 is available for research and commercial use here.

  • More of a visual learner? Check out this YT video breakdown.

  • If you’d like to check out Hugging Face’s LLM leaderboard, which compares open-source LLMs, find that here.

    • Note: Llama 2 is currently the front-runner.

  • More Llama 2 resources (Link)

That’s it! Let me know what you thought of this week’s post by dropping me a line or replying here. Now onto the news and some fun AI tidbits.

🗞️ News

  • Anthropic announces Claude-2 (I know I’m late on this news, but it’s big!) (Link)

  • Very interesting overview on the training details of GPT-4 (Link)

    • If you’re interested in Mixture of Experts, check out this post.

  • On my reading list: Google’s red-teaming process (Link)

  • Open problems in RLHF? This is on my reading list:

🎁 Miscellaneous

GPUs as debt collateral…

  • 2.3B debt is collaterlized by NVIDIA H100s. (Link)

AI-generated food QR codes that work (Link)

Visual CNNs

  • Check out this video and consider Alex Wang’s post on how it relates to CNNs (link below)

That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!