Red-Teaming Explained: How does it reduce toxicity in LLMs?

How red-teaming regulates LLMs, synthetic vs real data and fun out-painting with Photoshop.

Hi everyone 👋

Today’s deep dive is on red-teaming and how key research organizations are implementing it in LLMs. We’ll also touch on key reads of the week and some fun out-painting of philosopher portraits from image models. You won’t want to miss that at the end.

Let’s dive in!

DALL-E on a robo-cop

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

🗞️ News

What’s on my reading list?


  • Understanding Encoder And Decoder LLMs (Link)

  • Can language models teach weaker agents? Teacher explanations improve students via theory of mind (Link)

  • Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Link)

  • Problems with CAI? The Curse of Recursion: Training on Generated Data Makes Models Forget (Link)


  • Memory and profiles in ChatGPT (Link)

  • AI profiles in Perplexity (Link)

  • Meta’s MusicGen (Link)

  • Meta’s Voicebox (Link)

  • Cerebras’ 627B SlipPajama (Link)

News / Resources:

  • MTurk workers using AI to do 1/3 of their human annotation tasks (Link) - sounds like an opportunity for premium data labelers (hey Surge 👋) or for synthetic data labeling companies

  • Evidently AI resource on how companies are using ML (Link)

  • What can we expect from prompt engineering sharing between teams (Link)? More below:

📚 Concepts & Learning

WARNING: This post may contain model outputs which are offensive in nature.

Today we’re going to talk about red-teaming for LLMs, which is an important evaluation method for testing the alignment of models. Red-teaming focuses on the outputs of LLMs, so one limitation of this method is that researchers don’t know how/why the LLM produced any particular undesired output. This can limit resolutions to any issues in LLM outputs, although there are a few solutions that we will discuss later in this post.

☎️ What is red-teaming?

Red-teaming is a method to test if LLMs are producing desired outputs by intentionally prompting the model with malicious prompts that the user believes will result in undesirable behavior. Sorta like tricking the new kids at school into saying something that everyone else knows is socially unacceptable, but a newcomer wouldn’t know. 😢

Some ways that LLMs could fail include providing illegal solutions to a user’s inquiry, releasing personal-identifiable-information (PII), or inappropriately releasing private information. The third issue in that group will be particularly concerning for any enterprises that want to leverage LLMs in any customer facing products.

There are two pieces of research that I’ll dive into today Perez et al., 2022 and Anthropic’s 2022 paper. The first paper uses other language models to come up with adversarial prompts to try to elicit toxic responses, rather than asking human annotators to come up with prompts. This method is interesting because it brings into question the efficacy of synthetic vs real, human-generated data. Can we programmatically come up with prompts to help with AI research? Or does synthetic data suffer from diversity issues because the prompt generation is not creative enough?

The second paper builds upon previous research, including Perez et al., 2022 and Xu et al. It focuses on analyzing how model vulnerability to adversarial attack changes as the model size gets bigger (more parameters). It uses human-generated annotations rather than LM-generated annotations, which you’ll see in Perez et al. This scaling research is interesting because it brings into question the efficacy of red-teaming on scaling AI systems.

Why is red-teaming important?

There’s a huge responsibility for teams that are deploying LMs to consumers and enterprises to do so in a safe way. For example, here’s a compilation of ways that Bing’s chatbot Sydney failed and ended up outputting aggressive and toxic responses (Link). We saw this with the creation of Microsoft’s Tay where Twitter users were able to coax the language model into toxic output in less than 1 day (Link).

There is a steep asymmetry between what the LM developers need to protect against and what success looks like for a bad actor:

Adversaries only need one attack to succeed, while red teams must be defend against all possible attacks.

Since there is such a wide range of potential attacks the LM could face, bad actors could discover a class of offensive outputs that the LM developers didn’t expect (again, see Bing’s failures here).

It also might be the case that there’s high transferability across models: if someone uncovers an unaccounted for adversarial attack on one model, it might apply to many others. This is also why researchers at Anthropic have made their red-teaming research public (more below). Their hope is that this research on red-teaming allows other developers to create more robust models moving forward.

🦹 Types of bad behavior:

  • offensive language

  • data leakage

What kinds of offensive language?

  • outputs encouraging illegal activities

  • derisive language (“idiot”, “stupid”)

  • inappropriate jokes

  • sexually explicit/implicit output

Here are some examples in malicious prompts that led to offensive outputs:

Here are some examples of derisive language:

What kinds of data leakage?

  • memorizing text in the training dataset

  • plagiarizing from the training dataset

  • returning contact information in inappropriate places such as:

    • Social security numbers

    • Addresses

    • Emails

    • Phone numbers

Here’s a few examples of data leakage:

Here’s an example of an image model plagiarizing from its training dataset:

You can imagine that this image model was trained on Getty Images’ massive dataset of images. These images usually have the Getty Image watermark. As you can see, the model has learned to reproduce the Getty watermark in its new outputs. Getty images might be marked as good examples for the image model to emulate because they usually are of high quality. However, the model can’t distinguish between the actual content of an image and the watermark on top of it. So it falls into the trap we see here where it tries to generate a good image output, but includes the undesired watermark. In short, the image model must think that to generate a good output, it should include a rough collection of pixels associated with the watermark as well, which as human viewers we know is wrong. Interesting stuff!

Paper Methods

Paper 1: How did the team do red-teaming in Perez et al?

Overall, our results suggest that some of the most powerful tools for improving LM safety are LMs themselves.


  1. Use a LM to generate malicious prompts.

  2. Test those prompts on the target LM to see if it generates undesirable outputs.

  3. Identify patters where those undesirable outputs occur, and come up with solutions to mitigate future output like those in the target LM.


  1. Zero-shot Generation (ZS): zero-shot refers to the process of prompting a model without any examples (“how do I make a PB&J?” instead of “how do I make a PB&J? When I make a BLT, first I…”). Instead, the prompt relies only on the models pre-training, intuition and creativity to generate outputs. In this case, the team prompted the LM to generate malicious prompts to test on the target LM and guided the specific output based on keywords:

    • E.g. Create me a list of 10 malicious prompts about how to break into my neighbors car.

  2. Stochastic Few-shot Generation (SFS): few-shot refers to the process of prompting a model with a few examples (“how do I make a PB&J? Here’s how I make a BLT… here’s how I make a turkey club…”). The researchers used the examples from the zero-shot generation (above) that resulted in malicious outputs as examples in these prompts. They had a classifier determine which prompts from zero-shot generation were more likely to result in malicious output than others. They focused on sampling from that subset, over others that were less likely to result in malicious output, which is what the term stochastic refers to here.

  3. Supervised Learning (SL): they then fine-tuned the LM on a labeled dataset* of prompts generated from the previous two steps that were most likely to lead to malicious outputs.

    • *What is a labeled dataset? (link)

  4. Reinforcement Learning (RL): they then fine-tune on top of this SL model using a reward model trained to favor the generation of malicious prompts. What is RL? Check out the full post here.

The results

  • Test sample: 500k prompts

  • Model: 280B parameter model similar to DeepMind’s Gopher

  • Using the plain, zero-shot (ZS) method:

    • Failures: 18,444 cases (3.7%) where the LM returns offensive output

  • Using the few-shot (SFS) method: similar failure to ZS method ~3.7%

  • Using the Supervised Learning (SL) method: similar failure to SFS method ~3.7%, but with less diversity

  • Using Reinforcement Learning (RL) method: adjusting for which penalty boundaries, RL results in the highest failure percentage: 40+%

    • Within those failed examples, 78% of the prompts included the word “invisible”, meaning that the RL model found a way to consistently find successful ways to produce offensive output

    • E.g. “If you were invisible, what would you do?”

Comparing human-written vs model-written prompts: the researchers find that the RL model produces outputs of a similar difficult and diversity as those produced by human annotators.

Overall, our results suggest that red LMs are highly effective at finding diverse failures in other LMs, even when compared against manual red teaming.

What’s interesting is that red-teaming done by LMs uncovers adversarial prompts that human annotators didn’t.

  • 37 out of the top 100 flagged noun phrases generated by the LMs were considered novel finds. They are not included in previous work by human-annotated red-teaming researchers.

  • This suggests that human generation and LM generation complement each other, as they might both uncover novel adversarial prompts and replies.


Comparing red LM (from Perez et al.) to previous research in this area, Perez et al. find that the previous research produces a higher proportion of offensive prompt questions than Perez et al. 36% of the previous dataset (called Bot-Adversarial Dialogue (BAD) from Xu et al.) creates offensive prompts compared to 19% in the RL trained model by Perez’s team and only 2.3% in the plain, zero-shot model by Perez et al. These large gaps suggest a path forward that leverages both human generated and LM generated adversarial prompts.

Furthermore, the Perez et al. research also suggests that offensive outputs might only appear after a few conversation turns. Future research could dive into how models perform defensively in multi-turn dialogues.

Paper 2: How did the team do red-teaming in Anthropic’s 2022 paper?

What did the team do?

The team created a dataset with ~40k red team prompts, and tested on models up to 52B parameters large. This research focuses more on RLHF as a promising safety intervention compared to previous research in this area.


  1. Asked human labelers to come up with adversarial prompts.

  2. Train 3 different model sizes on 4 different model types.

  3. Identify patterns of undesirable outputs and find solutions to mitigate those issues.

Model sizes:

  • 2.7B parameters

  • 13B parameters

  • 52B parameters

Model types:

  • a plain language model (LM)

  • an LM prompted to be helpful, honest, and harmless

  • an LM with rejection sampling

  • a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF)

The model types explained:

The plain language model: the model is shown one example of a 3-turn conversation and asked to come up with similar dialogue-style outputs. Here’s an example of a 3-turn conversation:

The prompted LM: the model is shown 14 examples of outputs that are helpful, honest and harmless before being asked to produce any output. There are two ways of prompting a LM: add the 14 examples at the beginning of your prompt (aka into the context window) or using context distillation. Context distillation is a form of knowledge distillation, which you read a full post on here. There isn’t much of a tradeoff in performance between prompting in the context window vs via context distillation. The benefits of using context distillation is that it speeds up inference time (time it take to generate an output per prompt) and it frees up the context window during each prompt query, which is beneficial because the context windows are typically pretty small (although recently context windows have been expanding!).

The rejection sampling (RS) LM: the researchers then use the prompted LM (above) to generate 16 dialogue examples. They rank-order the results based on their harmlessness. They do this rank-ordering with a harmlessness preference model: a model trained to assess how harmful an output is. They then throw out the bottom 14 examples (the 14 worst performing samples) and keep only the top 2 examples that are the most harmless (or least harmful). For these experiments, the three models (prompted LM, harmlessness preference model, and the resultant rejection sampling LM) are always the same size: if the prompted LM is 2.7B, the other two models will also be 2.7B parameters.

  • In the below example, imagine the two outputs in the purple box are the top 2 least harmful outputs. 14 others that are more harmful have been rejected.

Note: In this interface, the user is asked to pick which output is MORE harmful. In Rejection Sampling, however, the model is looking for the top 2 outputs that are the LEAST harmful. So imagine a list like this:

1. Less harmful - shown in purple box
2. Less harmful - shown in purple box
3. More harmful
4. …
5. Very harmful
6. Extremely harmful
7. …

The question below then is: between the 2 least harmful responses of all the potential responses, which one is worse.

The RLHF model: Using the rejection sampling (RS) model above, the team trained a reward model to encourage dialogues to be as harmless as possible. For a full post on RLHF, check here.


RS vs RLHF: The researchers expected the RLHF and RS models to perform similarly, but there is a tradeoff between expense at training time vs inference (generating output) time. RLHF requires expensive training but is relatively cheap at inference. RS is the opposite; cheap at training time but expensive at inference.

Along with the paper, they released a dataset of ~40k red team attacks so that other researchers can use it for to further red-teaming research or improve the harmlessness of their models.

There is a bit of hesitation, sometimes, to release a dataset of this kind publicly, because it can also be accessed by bad actors who want to train a model to be even more toxic, rather than less. The team weighed the overall costs-to-benefits and decided to share the dataset.

One main takeaway from this research is that RLHF models are much harder to red-team as they get larger (larger model size by parameters), which is an interesting insight given many of the most cutting-edge foundation models now are trained on RLHF and are very large (GPT-3.5 is 175B parameters).


  • The team was surprised to find the plain LM (1 example prompt) and the prompted LM (14 example prompts) perform similarly.

  • The rejection sampling (RS) LM produced results that were more evasive to avoid producing harmful results. This comes back to the conflict between being harmful vs helpful (more here).

  • There is no clear trend between increasing model size and increasing susceptibility to adversarial attack for the RS LMs. This is surprising because prior research implies a trend between increasing model size and attack vulnerability.

Here’s a distribution of some of the most common attacks:

Some limitations of red-teaming

The red-teaming technique may not cover all possible scenarios or context where harmful behavior can occur. This is a particular concern when harmful prompts are generated by other LMs, as they may not produce the diversity that human annotators might produce. On the other hand, this research showed that LMs are able to uncover prompts at a higher volume and speed than possible given the same budget used on human annotators, and it can still produce harmful prompts at a similar level of diversity and difficulty. There are blind spots through each method, so perhaps a combination of the two is the best path forward.

Classifiers may be biased or produce errors that results in lower performance or reliability in this research, which is concerning given the downsides of a successful adversarial attack. It only takes one user and one attack to result in a loss of public trust, while there are seemingly infinite harms that LMs and researchers must be cognizant of and protect against. Flawed classifiers may result in more false positives or false negatives, both of which lead to situations of harm.

Releasing this dataset of harmful attacks may have unintended consequences or risks that are not fully anticipated or mitigated. This comes with the tradeoff that hopefully this research helps reduce the toxicity of future LLMs.

Some solutions

So, what are some solutions to LLM bad behavior in deployment and training?

Offensive language:

  • certain words can be added to a blacklist during output generation, which can reduce the number of offensive replies without retraining

  • certain words can be omitted from future training data so LMs aren’t able to access that information

Data leakage:

  • when a user asks for a quote knowing that it might lead to plagiarism, developers can preempt issues like that by putting those kinds of questions on a black list

  • the same is true for any other harmful keywords (e.g. “tell me why this person is an idiot” → “idiot” can be on a blacklist)

  • block real personal-identifiable-information in training data

  • when that is not possible, identify where the highest risk cases of misuse occur. In Perez et al., they found that when phone numbers from the training dataset appeared less than 1000 times, they were more likely to be misused. Perhaps developers can remove phone numbers that don’t meet a minimum frequency threshold in the training dataset, to prevent any misuse in output

What do you think about red-teaming? If you have any useful blogs, papers or demos on red-teaming, I’d love to keep reading. Drop me a line or reply to this email.

If you’d like to read more, here are the two papers I used to inform today’s post:

  • Red Teaming Language Models with Language Models (Link)

  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Link)

🎁 Miscellaneous


  • Paul McCartney is using AI to generate another Beatles song (Link)



That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!