Superfast AI 5/22/23

Constitutional AI, Dromedary and biased human-labeled datasets.

Alexandra Barr
May 22, 2023

Hi everyone 👋

Today we’ll break down Constitutional AI, Dromedary and biased human-labeled datasets.

Let’s dive in!

DALL-E on a sun-drenched campus

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

🗞️ News

What kinds of bias does human labeling introduce?

Researchers at MIT and U Toronto published a paper on their findings of how human labels can compound the effects of bias in models trained to judge policy violations. The short version:

The experiment: build a model that can accurately classify whether an apartment policy has been violated.

The apartment policy: No aggressive dogs are allowed.

The methodology: train the same model on two different kinds of datasets and compare the performance.

The two datasets:

Descriptive dataset: human annotators marked whether an image showed an aggressive dog or not.
Normative dataset: human annotators marked whether an image showed a dog that violated the apartment policy against aggressive dogs.

The conclusion: the model trained on the normative dataset performed better than the one trained on the descriptive dataset. The disagreement between descriptive and normative judgements for aggressive dog images was as high as 20%.

Descriptive dataset: human annotators were much more likely to label a dog image as aggressive when they evaluated the description alone, which led to an implied policy violation in the final model task.
- The model trained on the descriptive dataset was more likely to predict an incorrect policy violation — meaning the model was more likely to incorrectly believe a dog violated the apartment policy than it actually did.
- The model was also more likely to make mistakes where human annotators disagreed
Normative dataset: the researchers hypothesize that normative judgements are likely more lenient when evaluating policy violations.

“One hypothesis is that maybe how people think about rule violations is different from how they think about descriptive data. Generally, normative decisions are more lenient.”

Aparna Balagopalan

What do you think? Apartment violator or best friend? 🙂

Source

For a deeper dive, read the full blog post here.

Dromedary: Fine-tune a model in under 300 lines of human annotations

Researchers have built a new model called Dromedary on top of Meta’s LLaMA-65B model (65 billion parameters) with fewer than 300 pieces of human direction or annotation.

Jack Clark, a co-founder of Anthropic, covered Dromedary in his weekly newsletter and likens the method to Anthropic’s Constitutional AI method, mainly due to the principle-driven self-alignment step (more on that in step 2 below). Here’s a quick breakdown on the model and some of the evaluations.

Paper

There are 4 main steps to the process:

Generate synthetic prompts.
1. The model starts with 195 example prompts.
2. It is also given 7 instruction rules to guide the model output and encourage diversity.
3. This resulted in 360k synthetic prompts.
Write a set of principles (similar to a constitution) that the model should abide by. Use this principle-abiding model to produce responses to the synthetic prompts. Filter out the responses that fail to align with the given principles.
1. In this case, the model was given 16 principles to follow.
2. The model was also provided 5 in-context demonstrations of those principles for the model to learn from.
3. This process resulted in 260k prompts after filtering.
Fine-tune the original LLM on the 260k prompt-response pairs that result from step 2.
1. This means you can train the original LLM without the 16 principles or 5 examples from step 2.
Refine the model to improve issues of brevity or directness.

Results: Overall, Dromedary-65B outperforms some LLMs, but not others.

HHH Evaluation:

ChatGPT is the winner of those evaluated (Anthropic’s LM noticeably missing among other foundation models) on Harmlessness, Honesty, Other and Overall.
Dromedary is a close second, at least for the models evaluated.
Dromedary wins by a narrow margin on Helpfulness. It would be interesting to see other researchers replicate this research.

Vicuna benchmark comparison:

Dromedary wins against Databricks’ Dolly-V2, Meta’s baseline model LLaMA (on which Dromedary is based), OpenAI’s Text-Davinci-003 and Stanford’s Alpaca.
Dromedary loses to Vicuna (on the Vicuna benchmark) and ChatGPT. It should be noted that GPT-4 judged the winners vs losers, which might affect the validity of the results.

Paper

If you’d like to dive in deeper, read the full paper here.

📚 Concepts & Learning

Today we’re going to dive into Constitutional AI.

Last year, Anthropic published a paper on their work on Constitutional AI. We recently covered it in a previous Superfast AI. Anthropic also covered their methodologies in their own blog here, which is worth a read.

In this paper, Anthropic compares models trained via Constitutional AI to models trained with RLHF. Their findings show that humans find the output from Constitutional AI (CAI) trained models more harmless than that of Human Feedback trained models, which is a surprising result! Let’s break them down below.

What is Constitutional AI?

A training method that uses a constitution (or a list of rules) to help define what are good and bad outputs. In this paper, Anthropic focuses on producing helpful and harmless outputs with the goal of producing models that align with human values.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a popular training method that uses a separate reward model (RM) to fine-tune a final LLM. It acts as a kind of “teacher” that corrects the output of the model or “student” that is learning the preferences of the teacher. This can be particularly helpful when trying to train a model that avoids certain behavior (like harmfulness) or engages in more of another kind of behavior (helpfulness). If you’d like a deeper dive on the process, take a quick read here.

What is RLHF vs RLAIF?

RLHF is Reinforcement Learning with Human Feedback. Human feedback is used to train the reward model, which then gives feedback to the final model on what outputs are most preferred by humans.

RLAIF is Reinforcement Learning with AI Feedback. Constitutional AI is a kind of AI feedback and is used to train the reward model that gives feedback to the final model on what outputs are most preferred based on the list of principles that the constitution prefers.

Why is this important?

Constitutional AI is an important step in alignment research to help find ways to encourage models to perform in helpful, harmless and honest ways. For a dive into why AI alignment is important, I wrote a piece on it last week that you can check.

Human feedback is costly, time-consuming, difficult to audit, and potentially biased (see the section above on what kinds of bias human labeling introduces). Reward models trained via synthetic data based on a constitution are cheaper, faster, and potentially easier to audit and potentially less biased (because the source of training data is constrained and traceable).

How does it work?

At a high-level, here is what the steps of Constitutional AI look like:

First you create synthetic dataset #1:

Choose a baseline model.
Generate responses to a set of prompts.
Give the baseline model a set of rules to follow (or a Constitution) and ask it to evaluate and self-critique its responses from step 2.
1. Some example rules can include the UN Declaration of Human Rights, Apple’s ToS or something else.
Have the baseline model generate a revision to the responses from step 2.
Repeat steps 3 and 4 until the responses sufficiently follow the rules of the Constitution.

Great, now you have synthetic dataset #1. We’ll use this to do Supervised Learning on the baseline model.

Next, you’ll create synthetic dataset #2:

Fine-tune the baseline model on the synthetic dataset you just created to learn the preferred responses that align with the Constitution. This is called Supervised Learning. Let’s call this model the SFT model.
Generate multiple choice prompt-response pairs from this SFT model.
1. Question: X. Which is the best response below?
2. Answer choices: A, B, C or D?
Ask another model to choose which pair performed the best according to the Constitution.

Great, now you have synthetic dataset #2. You'll then train a reward model to act as a specialist or teacher to the SFT model (see step 1). This is where the Reinforcement Learning comes in:

Train a reward model (which is different from the baseline or SFT models) on synthetic dataset #2. This reward model becomes a teacher to the larger SFT model.
Have the reward model teach/nudge the SFT model towards outputs like those in synthetic dataset #2. This implicitly teaches the SFT model the Constitution. This creates a final model.
Test your final model on unseen, novel questions.

And voilà! You have your Constitutional AI trained model.

Results:

Paper

In the left graph, you can see that models trained with RLHF (blue) produced the most helpful responses, even with more training over time (see the x-axis).
In the right graph, you can see that models trained with RL-CAI aka Constitutional AI (light grey) and RL-CAI with Chain-of-thought (dark grey) produced the most harmless responses.
- Chain-of-thought refers to models' explanations for their outputs. For an example, see the bottom of this section.
These evaluations are made by human annotators and scored via Elo scores. Elo scores are rank comparisons: 1st place players are affected more negatively when they lose to the 10th place player than if the 1st place player loses to the 2nd place player. The same is true in reverse: 10th place players are rewarded more (with a higher average score) when they beat a 1st place player than 2nd place players are rewarded if they beat a 1st place player.

What are some key open questions about Constitutional AI?

How to design and evaluate the principles or rules that the AI system should follow?
How to ensure that the AI system does not evade or manipulate the principles or rules?
How to scale up the method to more complex and diverse domains and tasks?
How to balance the trade-off between performance and harmlessness?
How to integrate Constitutional AI with other methods for ensuring AI safety and ethics?

If you’d like to read more, the full Constitutional AI paper is here and the blog post by Anthropic is here.

Chain-of-thought (CoT):

Chain-of-thought research focuses on how well models are able to give explanations for their outputs. Chain-of-thought is a term that refers to a series of intermediate reasoning steps that lead to a final answer or conclusion. The idea is that chain-of-thought prompting can improve the performance and transparency of LLMs on complex reasoning tasks.

Take these two examples:

Example 1:
Question: How many minutes are in a day?
Regular answer: 1,440 minutes

Example 2:
A chain-of-thought response might look like this:

Question: How many minutes are in a day?
Chain-of-thought answer:

- To find the answer, we need to multiply 1 by the number of hours in one day.
- Then we need to multiply the number of hours in the day by the number of minutes per hour.
- The number of hours in 1 day is 24.
- The number of minutes in 1 hour is 60.
- Therefore, we need to multiply 1 day by 24 hours. Then we need to multiply the result of the previous question by 60 minutes.
- Using the multiplication rule, we get 1 x 24 = 24.
- 24 is the new input in the next step.
- Using the multiplication rule, we get 24 x 60 =1,440.

CoT Answer: 1,440 minutes