Superfast AI 5/29/23

Knowledge Distillation, Neuralink’s Human-Trials, and QLoRA.

Hi everyone 👋

Today we’ll break down Knowledge Distillation, Neuralink’s Human-Trials, and QLoRA.

Let’s dive in!

DALL-E on Neuralink’s human trials

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

🗞️ News

Last week, Elon Musk’s company, Neuralink, announced it was granted FDA approval to run human-trials for it’s brain-chip implant!

The main use cases in these initial trials are helping bridge the connection between the brain and body, focusing on physical movements. A few cases include:

  • a paralyzed man fist-bumping President Barack Obama with a robotic hand;

  • a patient with ALS typing by thinking about keystrokes;

  • a tetraplegic patient managing to walk with a slow but natural stride

Check out the WashPo post here.

QLoRA

Recent research dives into how to do conduct more efficient fine-tuning on models. (Link)

What is it?

QLoRA Is a training method to do fine-turning on top of a baseline LLM. The researchers in this paper created a 65B parameter LLM called Guanaco, which is a fine-tuned LLaMA model (from Meta) via QLoRA. They trained this model on the OASST1 dataset (9k dataset). The OASST1 dataset is surprisingly small at a 9,000 sample vs 450,000 sample in FLAN v2 (subsampled).

Why is this important for researchers?

This research demonstrates that it’s possible to reduce the size of the model (to quantize the model) to a 4-bit model without reducing the quality of the output. What does 4-bit model refer to? Typically LLMs are 32-bit or 16-bit and it refers to the precision or amount of information each parameter can hold. By squeezing (or quantizing) the model to 4-bits per parameter, it reduces the size of the model so that it’s cheaper and faster to run inference (or get an output).

Outside of research, why is it important?

QLoRA demonstrates an easy way to fine-tune models, which enables developers who are interested in accelerating AI development and capabilities. There are pros and cons to this, depending on your perspective. Previously, there were several barriers to creating and running capable models: time, compute, data and resources (aka $money$). Now, it’s easier to fine-tune and run models with fewer resources.

  • Pros: enable more researchers and developers to participate in the development of AI

  • Cons: AI alignment researchers and ethicists worry about AI safety practices of these practitioners once the ability to create very capable models potentially gets in the hands of bad or negligent actors

The results:

Researchers conducted an evaluation of the Guarnaco model vs. GPT-4 vs. ChatGPT-3.5 Turbo to see which outputs annotators most preferred. In these evaluations, researchers used Elo ranking scores to determine the winners:

  • 65B Guarnaco: 1,023

  • GPT-4: 1,176

  • ChatGPT-3.5 Turbo: 916

In this case, GPT-4 is still the winner, but Guarnaco is a close second, and beats out ChatGPT 3.5. An interesting result and an indication that it’s worth paying attention to this research!

Guanaco models set new states-of-the-art standard in a comparative evaluation versus GPT-4, coming closer than other systems (e.g, Alpaca, FLANv2, Open Assistant) at approximating its performance.

Jack Clark

What do you think of QLoRA?

A quick aside on Elo ranking scores:

The Elo score is calculated by asking human evaluators to choose which of two outputs they prefer. A key feature to note is if an underdog beats a high-ranking candidate, the underdog benefits by a large margin, and the high-ranking candidate loses by a high-margin than if they were competing with candidates with a similar ranking to themselves. A quick example:
- If a 1st place candidate loses to a 10th place candidate, then their scores are adjusted by a large margin: the 1st place candidate loses more points overall, and the 10th place candidate wins more points overall).

- In contrast, if a 1st place candidate loses to a 2nd place candidate, then their scores are adjusted by a small margin: the 1st place candidate loses a small number of points overall, and the 2nd place candidate wins a small number of points overall.

📚 Concepts & Learning

Knowledge Distillation (KD)

Today we’re going to dive into Knowledge Distillation. At a high level, Knowledge Distillation is the process of transferring knowledge from a larger model to a smaller model. A few reasons you might be interested in doing this is because it can speed up inference time, take up less memory space, and may allow the smaller model to be run on smaller devices (like mobile phones).

Teacher and student models

A simple way to think about knowledge distillation is to think about the larger, main model as a teacher model, and the smaller, faster model as the student. The teacher model is intelligent, pre-trained, and should perform well on the task you’ve set out. The student model is faster but doesn’t come with any training in advance. This teacher model will guide the student model to learn the right patterns between input and output so that the student can quickly and accurately execute on tasks once deployed.

Why do we care about this?

As we move to smaller devices, like mobile devices, running AI models on these devices is a challenge due to computational capacity and memory limitations. Think of running a computer game on your laptop: we’ll often close out browser tabs and other desktop programs to free up memory and capacity to run the game. Running large models on our smaller devices face similar problems.

What’s the goal?

The goal is to compress a larger model (by parameter size) into a smaller model with minimal drop in accuracy.

How does it work?

The three components are: knowledge, distillation model and the teacher-student architecture.

There are a few kinds of distillation. Here are two main ones

  1. Responses - the student model tries to mimic the outputs of the teacher model. This is done by comparing each model’s output for the same input. Then, the goal is to minimize the difference between those outputs.

  2. Features - the student model tries to mimic the “thought-process” and the outputs of the teacher model. This is done by comparing which parameters “light up” (or activate) in the intermediate layers as well as comparing the final output of the each model for the same input. Then, the goal is to minimize the difference between those activated parameters in the intermediate layers between the student and teacher model, as well as minimize the difference between each of their outputs.

Training methods:

There are a few methods of transferring knowledge from the teacher model to the student model:

  • Offline - A large, teacher model comes pre-trained (with knowledge baked in) and teaches a student model. This method is often cited as the easiest to implement.

  • Online - A large, teacher model is updating in parallel with the student model. The teacher model is typically already pre-trained (with relevant knowledge baked-in). However, if you’d like to update both models (with fresh data, for example) this method allows you to run their fine-tuning in parallel.

  • Self-Distillation - The same model acts as both the teacher and the student model. In this case, deeper neural layers may be used to train shallower layers, or knowledge gained in earlier training (earlier epochs) may transfer knowledge to the model in later training (later epochs). This method is a sub-set of the Online training methods.

What are some typical architectures of teacher-student models?

Typically, the transfer knowledge from flows from deeper and wider neural networks to shallower and thinner neural networks. Three of the most common types of student models include:

  • a simplified, shallower version of the teacher model with fewer neural layers and fewer neurons (or nodes) per layer

  • a quantized version of the teacher model. (squeezing more information through fewer nodes). This method typically reduces the amount of information that can be passed on. Check out the example below for a more intuitive glance at what this might look like.

  • a student model that is the same as the teacher model (we won’t dive into this in this post)

What is a quantized model?

One simple analogy I like to use to describe the transfer of knowledge through a quantized model: imagine transforming one large, high-resolution image into a small, low-resolution image. Typically, you’re squeezing more color pixels through fewer spots on the page. Take these examples below:

  • Each high-resolution image (left-side) has 9 pixels, while each low-resolution image (right-side) has 1 pixel

  • In order to transform from high-rez to low-rez, you have to make a choice about which color you want to move on to the low-rez image: green or grey?

  • In all four cases, the winning color is the one that represents > 51% of the high-rez image

Information is lost in all four examples below, but the dominant color is always represented (even for slightly dominant colors).

Source: Information is lost going from high-rez to low-rez

How does the student model learn from the teacher model?

There are a few ways this is done, including Adversarial Distillation and Multi-Teacher Distillation.

Adversarial Distillation

The student model learns the correct policy by seeing bad examples. These bad examples might be misled the student model into thinking the input-output pair is correct, but it is not.

One way to deploy this method is with Generative Adversarial Networks (GANs):

  1. Start with a ground-truth dataset: the input-output pairs are labeled (if you want to learn more about labeled datasets, check out the section below).

  2. Ask a generative model to create fake data that looks like the ground-truth dataset.

  3. Combine both datasets: now you have a dataset with three columns: input, output, and real/fake.

  4. Show the student model an input-output pair and ask whether it is a real or fake example.

  5. Depending on how many of those input-output questions the student model correctly classifies, the student model may go back and do additional training to improve it’s classification in the future.

A quick aside about labeled datasets:

What is a labeled dataset? Let’s say you have a bunch of arithmetic questions:
❓ What is 2+2?
❓ What is 10-4?

A labeled dataset provides an answer for every question:
❓ Input: 2+2 | 💡 Output: 4
❓ Input: 10-4 | 💡 Output: 6

Great, now you have a labeled dataset!

Multi-Teacher Distillation

Another way for the student model to learn is to gather information from multiple teachers, as well as the data itself. The student can combine knowledge from multiple teachers by averaging their responses. Here is a quick graphic of what that might look like:

Applications

There are many ways that knowledge distillation can be used, including:

  • Vision - image classification, facial recognition, autonomous driving, video captioning and more

  • Natural Language Processing (NLP) - text generation, classification, content moderation, Q&As and more

  • Speech - speech recognition and synthesis, speech enhancement, speech classification and more

Overall, Knowledge Distillation is a powerful way to (hopefully) reduce inference time and cost. It’s an interesting field of research as it will help us answer questions that are relevant to discussions on open- vs. closed-source models, as well as questions around application layer fine-tuning. Interesting stuff!

If you’d like to read more, you can dive into this blog post or do a deeper dive on this survey of KD on Arxiv.

🎁 Miscellaneous

An artificial nose!

AI generated art installation

Check out this AI generated art installation: (Link)

Adobe Photoshop

Check out this example of out-painting and masking in Adobe:

That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!