Embedding Models Explained: Performance and Usage

What are embeddings and how do they work? Let's dive in.

Hi everyone 👋

To start off — sorry for the lack of posts in August and September! In August I was traveling (pictured below 🏔️) and in September I was exploring something new in the AI space (more to come soon). Time can fly in the blink of an eye!

Anyway, today I’m sharing a post I’ve been meaning to cover for a while now: Embedding Models. It’ll be a fun one with a TLDR and rough-drawn Figma sketches. Let’s dive in!

But first…

My version of a summer escape vs DALL-E’s version:

Mine

DALL-E’s

Truly heavenly 🤗

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

📚 Concepts & Learning

Want the TLDR? Skip to the bottom of this section.

Want the longer version? Read on :) 👇

What are embedding models?

First of all, why do we need embedding models? The short answer is: machine learning models know how to interpret numbers, not words. So you need a way of translating natural language text (aka words) into numbers or vectors.

You can turn a single word, or even a part of a word (a token), into a number/vector, but you can also turn whole sentences, whole paragraphs or blocks of text into vectors. The more information rich your vector is, the better it will perform on ML tasks.

How can you turn whole paragraphs or images into a single vector?

Let’s try out this visual example:

Starting with the first image on the left, you can assign a color number/label to each pixel, as seen in the middle image. Once each pixel is labeled, the whole image will have color labels, as seen in the second and third images. You can turn that list of numbers into a string which can now be called a vector. VoilĂ ! You now have a vector for this image.

If you do the same pixel-by-pixel color labeling for thousands of images, you’ll have thousands of vectors you can plot in space. You can then search that space for vectors clustered in a similar space with the original image. In theory, images that cluster together should also look visually similar, or be semantically similar by some other dimension.

Here’s another way of thinking about vector space with words:

Imagine you have a way of encoding words into vectors that can be plotted in vector space.

You could imagine the vectors looking like this for different words (visualized with colors):

One way of interpreting the above image is that for particular blocks, some are closer to “on” (red) than to “off” (blue). You can assume that a block represent a dimension in vector space. Each block might represent a different concept (color, size, mobility, flammability, etc.).

Let’s say block 1 represents fire-resistance. Sea and ocean are pretty fire-resistant, so they turn this dimension “on” or red. Footballs and soccer may be more susceptible to fire, so their fire-resistance is low — they report “off” for fire-resistance, or blue. And so on and so forth with each block, or new dimension, introduced.

As FT says: “A pair of words like sea and ocean, for example, may not be used in identical contexts (‘all at ocean’ isn’t a direct substitute for ‘all at sea’), but their meanings are close to each other, and embeddings allow us to quantify that closeness.”

As you can see, sea and ocean look pretty similar (they’re red in similar blocks and blue in similar blocks), but they’re not identical.

FT continues:

A word embedding can have hundreds of values, each representing a different aspect of a word’s meaning. Just as you might describe a house by its characteristics — type, location, bedrooms, bathrooms, storeys — the values in an embedding quantify a word’s linguistic features.

The way these characteristics are derived means we don’t know exactly what each value represents, but words we expect to be used in comparable ways often have similar-looking embeddings.

FT, Source

Here’s an example of how we could view embeddings if they were plotted in a 2-D space:

As FT says: “We might spot clusters of pronouns [in dark red], or modes of transportation [in blue]… being able to quantify words in this way is the first step in a model generating text.”

What can you do with that?

As you saw above, you can plot these words in vector space. Once you do that, you can then search for similarity between words/concepts. Let’s run through another example: Imagine you have plotted the words for ‘Man’ and ‘Woman’ in vector space. Let’s say we wanted to add the vector for ‘English Royalty’ to ‘Man’ and ‘Woman’ respectively. The ‘English Royalty’ vector is displayed in purple. The difference between ‘Man’ and ‘Woman’ should be similar to the difference between ‘King’ and ‘Queen’. The difference between ‘Man’ and ‘King’ should be the same as the distance between ‘Woman’ and ‘Queen’. In other words, [(Man + English Royalty) - (Woman + English Royalty)] = [Man - Woman].

This is a simplification of vector space, of course.

Okay why is that useful?

Here are a few use cases where information-rich embeddings are helpful:

  • Classification for product recommendations and filtering

  • Spam and content moderation

  • Sentiment analysis to determine customer satisfaction

  • Customer support such as ticket management

(Many of these examples will just be extensions and applications of an accurate and fast classifier.)

Classification:

Vector embeddings are dense vectors — they contain a lot of information. That means we can do things like:

  • Assess the similarity between vectors (which can represent words, sentences or full paragraphs, etc.)

  • Calculate the distance between vectors

As noted above, classification can help you do things like monitor content for anomalies or service violations. Maybe you’ve identified a no-go list of products for your marketplace (such as illegal items). You can classify how close new posts are to those no-go items and block those posts from ever reaching your customers (i.e. you can calculate the distance between two items in vector space).

Maybe you’re curating a list of recommended books, and you want to see which top 10 books are most similar to other books the user has read (see more below in the recommendation section).

Maybe you need a ticket management system to direct customer tickets between the team that fixes bugs, the team that builds new features, and the team that conducts upsells.

Classification will help in all of these settings.

Recommendation:

Imagine you’re building the recommendation algorithm at YouTube. You’ll want your future recommendations to learn from previous videos users have enjoyed. Maybe they watched the whole video. Maybe they’ve watched some videos multiple times. Once you’ve identified which videos that User A enjoys, you can classify those videos in vector space and search for other videos that are clustered in a similar area. If User A likes funny cat videos, maybe they’ll enjoy funny kitten videos too.

There are many other use cases too, and we’ve just named a few here.

What are some common embedding models?

  • Google Research’s word2vec

  • GloVe (global vectors for word representations)

  • BERT (bidirectional encoder representations from transformers)

We won’t get into these papers in today’s post, but if you want me to take a deeper dive on any of these papers, drop me a line and let me know!

What are some ways you can transform words into numbers?

  • One-hot encoding

    • Each object represents a single dimension. So if you have 100 words, you have 100 dimensions.

    • Downside: there is no way to relate concepts to one another. bag and dog are as similar as cat and dog in this vector space.

    • Ideally, you want a vector space that will allow you to draw similarities between concepts, allowing you to do things like classification, that unlocks other applications as we’ve discussed above.

  • Count-based approach (bag-of-words, n-gram approach, TF-IDF)

    • Downside: these methods don’t capture semantic meaning as well as vector embeddings.

    • To learn more about these options I would check out this video [timestamp: 1:30-2:53].

  • Vector embeddings

    • The topic of today’s post! :)

    • The quality of embeddings will differ depending on how they were produced. Finding one that understands your semantic space well matters.

    • Ideally, you’d find an information-rich set of embeddings. It’s said GPT-3’s embeddings have 12,288 dimensions, while GPT-1 only had 768 (source). That’s a 16x difference, meaning GPT-3’s embeddings can capture a whole lot more implicit information about words/concepts than GPT-1’s.

      • For example, “If a model learns something about the relationship between Paris and France (for example, they share a language), there’s a good chance that the same will be true for Berlin and Germany…”

      • Only information-rich embeddings might capture more nuanced similarities like these.

Want the TLDR? 

Here’s one the focuses on OpenAI’s embeddings model:

What’s a use case?

Further reading:

  • Google’s word2vec paper (Link)

  • More reading on optimal embedding size (Link)

  • A quick and accessible YouTube video on embeddings from AssemblyAI (Link)

  • Quick read from Feature Form (Link)

  • Great visual explainer from FT (Link)

🗞️ News

Some slow news, but some goodies:

  • Aug 2023: State of AI by Nathan Benaich (Link)

🎁 Miscellaneous

DALL-E 3 vs Midjourney: side-by-side comparisons (Link)

Create your own AI hip-hop album, from Sprite (Link). Here’s mine (lol):

  • Very excited about the DALL-E 3 release (Link)

  • Books by vibe (Link)

  • Midjourney showcase (Link) - I subscribe to the MJ mag btw! Very good 👌

  • AI town by a16z (Link)

  • Food for thought on prompt engineering libraries (Link)

That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!