Superfast AI 5/8/23

Inflection AI’s chatbot Pi, RLHF, and blending fashion and fine art.

Alexandra Barr
May 08, 2023

Hi everyone 👋

Today we’ll dive into Inflection AI’s chatbot Pi, a breakdown of RLHF, and Midjourney’s blend of fashion and fine art.

Let’s dive in!

DALL-E on modern outfits inspired by Dali, fashion photoshoot

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

🗞️ News

Inflection AI: Hey Pi

This week, Inflection AI dropped their chatbot, Pi, which aims to be your conversational AI partner
Check out the Inflection press release here

The Pi experience is intended to prioritize conversations with people, where other AIs serve productivity, search, or answering questions. Pi is a coach, confidante, creative partner, or sounding board.

Check out a review of Pi here:

Going to synthesize my thoughts on Pi from @inflectionAI here.
It was just announced 3 hours ago.
— Allie K. Miller (@alliekmiller)
11:30 PM • May 2, 2023

Overall, it seems that Pi isn’t living up to user expectations
Inflection has raised $225M to date, and many expected more from their debut product

BREAKING: After raising $225 million, @inflectionAI has launched Pi, a personal AI.
You can try it at .
— Pete (@nonmayorpete)
7:43 PM • May 2, 2023

Replit: ReplitLM

ReplitLM released an open-source coding LM on HuggingFace available for commercial use. The model was trained on 525B tokens over 20 coding languages. Check out the announcement here and a demo page here!

ChatGPT: Code interpreter

Researchers at UPenn have leveraged ChatGPT code interpreter to turn their data into data visualizations. Check it out here. While it’s not perfect or very sophisticated, it’s an interesting start to the data analytics use case. Hopefully, we’ll see strong improvements over time.

Stability AI: DeepFloyd IF

Stability AI released an upgrade to their text-to-image model with DeepFloyd IF, a cascaded pixel diffusion model.
- Historically, image models have struggled to output legible or coherent text.
- DeepFloyd makes makes significant improvements in this direction with their latest release.
- Check out the full announcement here.

📚 Concepts & Learning

Today we’re going to dive into what Reinforcement Learning with Human Feedback is and why it’s a popular method for training language models.

Let’s start with what Large Language Models (LLMs) are and how they work. Language models are a class of algorithms that successfully generate useful text outputs given a particular text input.

In the early days, LLMs would predict the most plausible completion of some input text. These days, language models are more sophisticated and generate an instruction-following outputs, like answers to queries, summarizations, classifications and more.

Source

Perhaps the most powerful LLM to date is OpenAI’s GPT-4. One of the keys to GPT-4’s linguistic ability is Reinforcement Learning from Human Feedback (RLHF). RLHF is just one component of GPT-4’s training process, which has three main stages:

Pre-training: Train on a massive, unfiltered, unlabeled corpus of data, i.e. internet text on Reddit, Arxiv, forums, in books, and more.
Supervised Fine-Tuning (SFT): Train on a large, filtered and labeled corpus of data, i.e. instruction-following Question and Answer pairs, full text and summary pairs, domain-specific data like medical, legal or expert-specific, and more.
Reinforcement Learning with Human Feedback (RLHF): Train a separate reward model to predict which LLM answer(s) to a given prompt are the best based on surveys of human preferences. Then use that reward model to reinforce desired behavior in the final LLM.

Source

So, at a high level, what does RLHF look like?

First a user will prompt a baseline or foundation model. The model will have a number of options for the best output. Let’s say it looks like this:

User Question: What’s the best way to get from NYC to Boston?

LLM Answers:
A. By foot: Walking or Running
B. By plane: Flying
C. By car: Driving
D. By boat: Sailing

While all of these are legitimate answers (in some sense), some are more preferable than others. Humans evaluating these answers would vote on which answer they think is best. Let’s say 10 people vote on this Q&A and this is the distribution of answers:

Which is the best answer to this question?

Question: What’s the best way to get from NYC to Boston?

Answers:
A. By foot: Walking or Running (0%)
B. By plane: Flying (60%)
C. By car: Driving (30%)
D. By boat: Sailing (10%)

Looks like flying is the winner in this scenario, although there is ambiguity in the question (what does best mean? A sailor might argue that the best way to get from NYC → Boston is via the gentle, high seas.)

Now a separate reward model is trained on this labeled dataset of human-preferred answers. The reward model will then learn that humans prefer flying from NYC → Boston 60% over other choices. The reward model will even draw general inferences about what humans prefer on unseen questions.

Now that this reward model knows (approximately) what humans prefer, it steps in as a regulator when training the production LLM. When the LLM is given the question “What’s the best way to get from NYC to Boston?”, it may believe on the basis of its pre-training corpus that people tend to drive 70% of the time when they’re traveling NYC → Boston. It will then select “Driving” as the most probable output and give that answer to the reward model. The reward model knows the preferred correct answer, however, is “Flying”, so it will have the LLM update towards that answer by responding negatively to “Driving” as an answer and positively to “Flying” as the final answer. Over time, through this training mechanism, the LLM learns to output answers that align with the reward model, and therefore roughly align with human preferences, even for questions that both model have never seen before.

So, why does RLHF work, and why is RLHF used in addition to SFT?

RLHF provides negative feedback, letting the model know when people prefer less of this, rather than just more of that. Training datasets usually have a lot of positive examples to emulate, but lack negative examples to stray away from.
RLHF provides an LLM more data about human values, which often aren’t captured by SFT alone.
RLHF can be used to instill specific values in an LLM, like helpfulness or non-toxicity. For example, if a human prefers some LLM outputs over others because of some intuitive value, that value can be learned by the LLM using RLHF.

If you want to dive deeper, here’s a good primer of RLHF and another here.

🎁 Miscellaneous

A worthy read

Training Stable Diffusion from Scratch for <$50k with MosaicML (Part 2)

We've replicated Stable Diffusion 2 for less than $50k, and we've open-sourced the training code so you can too! This is a 3x cost reduction from our last blog post and an 8x reduction from the original Stable Diffusion 2, making training large-scale diffusion models from scratch more accessible than ever before.

www.mosaicml.com/blog/training-stable-diffusion-from-scratch-part-2