- Superfast AI Newsletter
- Posts
- Llama 2 Explained: Training, Performance and Results
Llama 2 Explained: Training, Performance and Results
Diving into Meta's Llama 2 model and how it compares to SOTA open- and closed-source LLMs.
Hi everyone š
Today weāll dive into Metaās Llama 2 training process and results. Weāll also run through some posts to keep on your reading list and some fun visual analogies for how CNNs work.
Thanks for your patience with posts in July! Weāll get back into the regular swing of things in August.
Letās dive in!
DALL-E on Llama running in a field of flowers
Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.
š Concepts & Learning
Today weāre going to dive into Metaās Llama 2.
This LLM made waves in the news as the best open source language model out there to date. In todayās post, weāre going to dive into that and how Metaās Llama actually stands compared to various open- and closed-source models. Todayās post might be interesting to you if youāre forming a product opinion about whether open- or closed-source models will win out in future products and where you should be building/investing/tinkering/focusing.
Letās dive in.
Earlier this year (February 2023), Meta released a family of large language models called LLaMA, at 7B, 13B, 33B, and 65B parameters (the sizes of the models). Just a few weeks ago (July 2023), Llama 2 was released. The model sizes are 7B, 13B and 70B parameters large.
While Meta didnāt share much about the public data they used to train Llama 2, they did share details about the proprietary data they collected to train, fine-tune, do RLHF on, and do human evaluations on for this set of models. They also shared that the size of the training dataset they used in pre-training increased by 40% compared to LLaMA-1.
Some other notable takeaways:
They doubled the context window length.
They used Grouped-Query Attention in pre-training (paper here).
They introduce GAtt (Ghost Attention) in fine-tuning to improve attention across multiple turns of conversation (more on that later today).
They compare the performance of Llama 2 (a plain, pre-trained LLM) and Llama 2-Chat (the pre-trained LLM with SFT and RLHF fine-tuningā¦ more on that later) to other open- and closed-source models.
First, letās take a quick look at the performance of the Llama 2 models. In the graph below, you can see a plot of the loss functions, which we want to minimize as much as possible. The loss is the difference between how the model performs vs how you want the model to perform on a labeled dataset where you already know the answers. Think of it as the difference between 100% and Your Score on a math quiz (one that your teacher already knows all the answers for): the lower the ālossā or difference, the better.
In this graph, you can see the different models, 7B, 13B, 34B, and 70B parameter size models. They created and tested the 34B model but arenāt releasing it with this batch. As expected, the larger the model is, the lower the loss at most points. The x-axis on this graph maps the amount of data each model is trained on (tokens) and the y-axis plots the model performance (loss). Thereās a clear trend that as models are trained on more data, there is a lower loss that follows. You can predict that the trend would continue if the graph extended further.
Cool, but maybe a more informative way of thinking about Llama 2 performance is in comparison to other state-of-the-art (SOTA) LLMs.
The first headline here is: Llama 2 doesnāt perform as well as other SOTA LLMs. Hereās a comparison on closed LLMs:
Llama 2 loses to other LLMs in every major benchmark, with GPT-4 as a leader in all the benchmarks itās tested in. (Winners in each category are bolded.)
That being said, the largest model in the Llama 2 family is 70B parameters, while PaLM is 540B and GPT-4 is rumored to be 1.76 Trillion parameters. I did a fuller deep dive on GPT-4ās model architecture (called Mixture of Experts) if youād like to take a deeper dive here š
Alright, but one big benefit Llama 2 has over these other LLMs is that itās open-source, meaning developers can build on top of Llama 2 directly and fine-tune it with their own proprietary data.
So, maybe the real comparison is how Llama 2 performs compared to other open-source models:
The second headline here is it seems compared to other open-source LLMs, Llama 2 wins across all categories tested. (Winners are bolded.) Nice stuff!
Wellā¦ actually not so fast. Only a few benchmarks are shown here, which begs the question, how did Llama 2 perform in benchmarks not shown? And Llama 2 is only compared to two other open-source models MPT and Falcon. Noticeably missing are Microsoft Researchās Orca (13B) and Phi-1 (1.3B) models which have performed comparatively well on these benchmarks. Phi-1 notably achieves an accuracy of 50.6% on the HumanEval benchmark on the first try (pass@1), and a 55.5% on MBPP (Mostly Basic Programming Problems, paper here). Orca significantly surpasses Vicuna-13B on Big Bench Hard (coding problems) and AGIEval. Vicuna-13B is another notable one that is missing from this evaluation.
It might be because Orca and Phi-1 are not open-source yet, but it begs the question about why much smaller models (13B and 1.3B) are able to surpass Llama 2 on benchmark evaluations.
Now with a few early evaluations out of the way, letās dive into Llama 2ās high-level training, performance and evaluations.
So, whatās new?
GAtt: Ghost Attention
The researchers use GAtt (ghost attention) in fine-tuning to improve model attention across multiple lines of conversation with the user. I wonāt do a full dive on GAtt today, but in short, GAtt helps LLMs focus on an instruction across multiple āturnsā in the conversation. It improves LLM performance by helping it focus on an initial instruction across the whole conversation.
Hereās an example, with the attention on āanswering with a haikuā:
This is in contrast to models that might respond in haiku at first, and then break character and start responding in full length sentences.
(If youād like me to do a deep dive on the GAtt paper, reply to this email or drop me a line! Iād love to hear from ya.)
More on Llama 2ās performance across multi-turn conversations
Interestingly, when human evaluators compare results from Llama 2 vs ChatGPT, they prefer Llama 2, even across many turns (up to 17 turns shown below).
Hereās a graph that shows that; dark blue is Llama 2 and light blue is ChatGPT:
Human evaluators trend towards preferring Llama 2 over ChatGPT responses (with the exception of 9 turns and 15 turns) even during longer conversations (17 turns). This is surprising because you might expect one model to perform better in short conversations (fewer turns) while the other LLM performs better in longer conversations (more turns). But we donāt see that trend. Instead, there does seem to be a bit more volatility between which model human evaluators prefer when the conversation gets longer (there are fewer ties).
What else is new?
Context distillation (after SFT and RLHF)
For Llama 2, researchers conducted 3 kinds of safety fine-tuning, to ensure the models produce safe and aligned outputs (avoid toxic content, bigotry, producing text that promotes illegal activities, etc.). It included:
Supervised Safety Fine-Tuning: Train the LLM on a labeled dataset of input-output/question-answer pairs that demonstrate safe responses to toxic questions. This is similar to red-teaming work.
Safety RLHF: Train the baseline Llama 2 model on a safety dataset, and use that to then guide and train the original Llama 2 model (to nudge it in the right direction). See this post here on how RLHF works.
Safety Context Distillation: Begin the prompt with a set of guidelines that help it produce output that aligns with your safety goals (see below).
Context distillation is a nice addition to include in the safety protocols for Llama 2 and improved the performance of the model.
Letās talk about data
Meta collected 27,540 SFT annotation from a proprietary vendor. (If youād like to read a short post on SFT vs RLHF data, check this out.)
They stopped collection there because they found the SFT model was able to produce outputs that performed competitively with data produced by human annotators. This test was done on 180 examples.
The removed millions of low quality annotations from benchmark training datasets (notably different from benchmark evaluation datasets, which tests how well the model performs). But still, perhaps they didnāt remove enough low quality data. Weāll get into that later today in how it affects LLM toxicity and its robustness against adversarial attack. In short: itās a tricky balance. How much ābadā data should you remove from pre-training?
So assuming the researchers have done pre-training (with open source benchmark datasets) and supervised fine-tuning (with high quality data) on the Llama 2 models, next comes RLHF.
Reinforcement Learning with Human Feedback (RLHF)
Dataset
Here you can see the dataset used in RLHF training (see post here on how RLHF works).
There are over 1.4M proprietary data points that Meta collected (from a data labeling vendor).
The proprietary dataset contains 4k prompts (each with unique outputs), which is large by academic and research standards. They do note that this wonāt cover all real-world use cases.
1.5M data points are also used from open-source datasets.
For safety-specific RLHF, they collected 2k adversarial prompts (prompts that are bad for the model to respond to).
2/3 of those adversarial prompts were multi-turn, and 1/3 was single-turn.
For a fuller dive on red-teaming and adversarial prompts, check this out.
The RLHF model
The reward model (often called the RL model, aka reinforcement learning model) has the same knowledge as the main model it will ānudgeā. That main model is aka. the final model.
Itās important that the RL model and main model have the same knowledge base. Why? It reduces hallucinations from information mismatch between the RL model and the main model. If I tell a kid not to do/say something based on information I have but she doesnāt have, she might make up reasons (that are incorrect) as to why her action was punished. This is similar to a hallucination.
The RL model architecture and hyper-parameters (like step size, learning rate, etc.) are identical to the main model.
Training details
The RL model is trained over one epoch (trained on the data one time), to avoid overfitting, which was found on longer training sequences.
How do Metaās RL models perform?
On safety and helpfulness, Metaās RL models perform better than Open Assistant and SteamSHP-XL across a number of safety benchmark tests.
Humans annotators decide which LLM they prefer
Fine-tuning with these safety and helpfulness reward models (RMs) results in Llama 2-Chatās superior performance on safety/helpfulness tasks compared to other open-source models. But itās also interesting to note, Llama 2-Chat is still only on par with ChatGPT:
Safety violations: lower is better
Safety and Helpfulness: higher is better and 5 is the maximum
These are human ratings, however. How does Llama 2-Chat perform on safety/helpfulness benchmark tests?
Safety and helpfulness benchmark tests
Llama 2 vs open-source models:
Interestingly, Llama 2 doesnāt outperform other open-source LLMs on toxicity metrics. Why not? Hereās a note from the researchers:
Llama 2 does not outperform other models on toxicity metrics, and we speculate that this may be because we refrained from aggressively filtering the pretraining data.
Recall that leaving pretraining data unfiltered may enable base models tuned to perform well on more downstream tasks (including hate speech detection), and it carries less risk of accidentally filtering out some demographic groups.
We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment.
We reiterate that this motivated choice does imply that additional safety mitigations should be applied before deployment of base Llama 2 models.
Llama 2 vs closed-source models:
Okay so Llama 2 wins out in truthfulness but not toxicity compared to other open source models. Letās zoom in on truthfulness metrics and see how Llama 2 compares to closed-source models like ChatGPT:
ChatGPT performs much better than Llama 2 with a ~14% gap on TruthfulQA (on Llamaās largest 70B model).
But, thereās only ~1.5% gap in % (info) performance alone (not shown, but I thought Iād share because itās interesting!).
Itās important when interpreting this table to take into account the different model sizes (parameter counts). This table compares a 70B model (Llama 2) with a rumored 1.76T parameter model (GPT-4 or 175B GPT-3.5). If the performance is comparable between models (see % info scores), smaller models are probably preferable because it reduces compute costs for fine-tuning and reduces costs for inference (generating an output from the model).
Is there a tension between Safety and Helpfulness?
If you want a deeper dive on the tension between Safety and Helpfulness, check out this post. Hereās a quick overview (you can skip this if you like):
Meta researchers empirically studied this tension with Llama 2 and shared some findings of their research. Hereās an example from the paper appendix:
Notice how the text changes as safety scores vs helpfulness scores change. The safer it gets (see 50%-100% safety data in column 1) the lower the helpfulness score is (from 0.65 to 0.38).
Insights from RLHF
The researchers share a particularly interesting insight on RLHF vs SFT alone. RLHF (perhaps surprisingly) produces significantly better final model performance, even when RLHF-trained models are comparatively more susceptible to hallucinations and have performance that is generally more unstable compared to SFT alone.
Check out this graph comparing SFT vs RLHF trained models. A distribution closer to the right (near 1.0) is preferable.
And check out this blurb from the paper. My favorite line is (emphasis mine):
Drawing a parallel, while we may not all be accomplished artists, our ability to appreciate and critique art remains intact.
Overall Takeaways
Llama 2 is a powerful LLM and provides a compelling alternative to other open-source models. It outperforms other open-source models on a number of metrics including general intelligence as well as alignment metrics.
However, if youāre looking for the most powerful LLM and donāt need/want an open-source one, GPT-4 still outperforms Llama 2 on several benchmarks.
Wanna check out the full model?
Llama 2 is available for research and commercial use here.
More of a visual learner? Check out this YT video breakdown.
If youād like to check out Hugging Faceās LLM leaderboard, which compares open-source LLMs, find that here.
Note: Llama 2 is currently the front-runner.
More Llama 2 resources (Link)
Thatās it! Let me know what you thought of this weekās post by dropping me a line or replying here. Now onto the news and some fun AI tidbits.
šļø News
Anthropic announces Claude-2 (I know Iām late on this news, but itās big!) (Link)
Very interesting overview on the training details of GPT-4 (Link)
If youāre interested in Mixture of Experts, check out this post.
On my reading list: Googleās red-teaming process (Link)
Open problems in RLHF? This is on my reading list:
New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread š§µ
ā Stephen Casper (@StephenLCasper)
3:30 PM ā¢ Jul 31, 2023
Paper about GPT-4 getting worse over time, and a rebuttal to that claim.
The main performance changes:
March GPT-4: 84% accuracy on identifying prime vs composite numbers; June GPT-4: 51% accuracy.
June GPT models perform worse than March GPT models on coding tasks
š Miscellaneous
GPUs as debt collateralā¦
2.3B debt is collaterlized by NVIDIA H100s. (Link)
AI-generated food QR codes that work (Link)
Visual CNNs
Check out this video and consider Alex Wangās post on how it relates to CNNs (link below)
Thatās it! Have a great day and see you next week! š
What did you think about todayās newsletter? Send me a DM on Twitter @barralexandra or reply to this email!
Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!