- Superslow AI Newsletter
- Posts
- Embedding Models Explained: Performance and Usage
Embedding Models Explained: Performance and Usage
What are embeddings and how do they work? Let's dive in.
Hi everyone đ
To start off â sorry for the lack of posts in August and September! In August I was traveling (pictured below đď¸) and in September I was exploring something new in the AI space (more to come soon). Time can fly in the blink of an eye!
Anyway, today Iâm sharing a post Iâve been meaning to cover for a while now: Embedding Models. Itâll be a fun one with a TLDR and rough-drawn Figma sketches. Letâs dive in!
But firstâŚ
My version of a summer escape vs DALL-Eâs version:
Mine
DALL-Eâs
Truly heavenly đ¤
Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.
đ Concepts & Learning
Want the TLDR? Skip to the bottom of this section.
Want the longer version? Read on :) đ
What are embedding models?
First of all, why do we need embedding models? The short answer is: machine learning models know how to interpret numbers, not words. So you need a way of translating natural language text (aka words) into numbers or vectors.
You can turn a single word, or even a part of a word (a token), into a number/vector, but you can also turn whole sentences, whole paragraphs or blocks of text into vectors. The more information rich your vector is, the better it will perform on ML tasks.
How can you turn whole paragraphs or images into a single vector?
Letâs try out this visual example:
Starting with the first image on the left, you can assign a color number/label to each pixel, as seen in the middle image. Once each pixel is labeled, the whole image will have color labels, as seen in the second and third images. You can turn that list of numbers into a string which can now be called a vector. VoilĂ ! You now have a vector for this image.
If you do the same pixel-by-pixel color labeling for thousands of images, youâll have thousands of vectors you can plot in space. You can then search that space for vectors clustered in a similar space with the original image. In theory, images that cluster together should also look visually similar, or be semantically similar by some other dimension.
Hereâs another way of thinking about vector space with words:
Imagine you have a way of encoding words into vectors that can be plotted in vector space.
You could imagine the vectors looking like this for different words (visualized with colors):
One way of interpreting the above image is that for particular blocks, some are closer to âonâ (red) than to âoffâ (blue). You can assume that a block represent a dimension in vector space. Each block might represent a different concept (color, size, mobility, flammability, etc.).
Letâs say block 1 represents fire-resistance. Sea and ocean are pretty fire-resistant, so they turn this dimension âonâ or red. Footballs and soccer may be more susceptible to fire, so their fire-resistance is low â they report âoffâ for fire-resistance, or blue. And so on and so forth with each block, or new dimension, introduced.
As FT says: âA pair of words like sea and ocean, for example, may not be used in identical contexts (âall at oceanâ isnât a direct substitute for âall at seaâ), but their meanings are close to each other, and embeddings allow us to quantify that closeness.â
As you can see, sea and ocean look pretty similar (theyâre red in similar blocks and blue in similar blocks), but theyâre not identical.
FT continues:
A word embedding can have hundreds of values, each representing a different aspect of a wordâs meaning. Just as you might describe a house by its characteristics â type, location, bedrooms, bathrooms, storeys â the values in an embedding quantify a wordâs linguistic features.
The way these characteristics are derived means we donât know exactly what each value represents, but words we expect to be used in comparable ways often have similar-looking embeddings.
Hereâs an example of how we could view embeddings if they were plotted in a 2-D space:
As FT says: âWe might spot clusters of pronouns [in dark red], or modes of transportation [in blue]⌠being able to quantify words in this way is the first step in a model generating text.â
What can you do with that?
As you saw above, you can plot these words in vector space. Once you do that, you can then search for similarity between words/concepts. Letâs run through another example: Imagine you have plotted the words for âManâ and âWomanâ in vector space. Letâs say we wanted to add the vector for âEnglish Royaltyâ to âManâ and âWomanâ respectively. The âEnglish Royaltyâ vector is displayed in purple. The difference between âManâ and âWomanâ should be similar to the difference between âKingâ and âQueenâ. The difference between âManâ and âKingâ should be the same as the distance between âWomanâ and âQueenâ. In other words, [(Man + English Royalty) - (Woman + English Royalty)] = [Man - Woman].
This is a simplification of vector space, of course.
Okay why is that useful?
Here are a few use cases where information-rich embeddings are helpful:
Classification for product recommendations and filtering
Spam and content moderation
Sentiment analysis to determine customer satisfaction
Customer support such as ticket management
(Many of these examples will just be extensions and applications of an accurate and fast classifier.)
Classification:
Vector embeddings are dense vectors â they contain a lot of information. That means we can do things like:
Assess the similarity between vectors (which can represent words, sentences or full paragraphs, etc.)
Calculate the distance between vectors
As noted above, classification can help you do things like monitor content for anomalies or service violations. Maybe youâve identified a no-go list of products for your marketplace (such as illegal items). You can classify how close new posts are to those no-go items and block those posts from ever reaching your customers (i.e. you can calculate the distance between two items in vector space).
Maybe youâre curating a list of recommended books, and you want to see which top 10 books are most similar to other books the user has read (see more below in the recommendation section).
Maybe you need a ticket management system to direct customer tickets between the team that fixes bugs, the team that builds new features, and the team that conducts upsells.
Classification will help in all of these settings.
Recommendation:
Imagine youâre building the recommendation algorithm at YouTube. Youâll want your future recommendations to learn from previous videos users have enjoyed. Maybe they watched the whole video. Maybe theyâve watched some videos multiple times. Once youâve identified which videos that User A enjoys, you can classify those videos in vector space and search for other videos that are clustered in a similar area. If User A likes funny cat videos, maybe theyâll enjoy funny kitten videos too.
There are many other use cases too, and weâve just named a few here.
What are some common embedding models?
We wonât get into these papers in todayâs post, but if you want me to take a deeper dive on any of these papers, drop me a line and let me know!
What are some ways you can transform words into numbers?
One-hot encoding
Each object represents a single dimension. So if you have 100 words, you have 100 dimensions.
Downside: there is no way to relate concepts to one another. bag and dog are as similar as cat and dog in this vector space.
Ideally, you want a vector space that will allow you to draw similarities between concepts, allowing you to do things like classification, that unlocks other applications as weâve discussed above.
Count-based approach (bag-of-words, n-gram approach, TF-IDF)
Downside: these methods donât capture semantic meaning as well as vector embeddings.
To learn more about these options I would check out this video [timestamp: 1:30-2:53].
Vector embeddings
The topic of todayâs post! :)
The quality of embeddings will differ depending on how they were produced. Finding one that understands your semantic space well matters.
Ideally, youâd find an information-rich set of embeddings. Itâs said GPT-3âs embeddings have 12,288 dimensions, while GPT-1 only had 768 (source). Thatâs a 16x difference, meaning GPT-3âs embeddings can capture a whole lot more implicit information about words/concepts than GPT-1âs.
For example, âIf a model learns something about the relationship between Paris and France (for example, they share a language), thereâs a good chance that the same will be true for Berlin and GermanyâŚâ
Only information-rich embeddings might capture more nuanced similarities like these.
Want the TLDR?
Hereâs one the focuses on OpenAIâs embeddings model:
Whatâs a use case?
Further reading:
đď¸ News
Some slow news, but some goodies:
Aug 2023: State of AI by Nathan Benaich (Link)
đ Miscellaneous
DALL-E 3 vs Midjourney: side-by-side comparisons (Link)
Create your own AI hip-hop album, from Sprite (Link). Hereâs mine (lol):
Thatâs it! Have a great day and see you next week! đ
What did you think about todayâs newsletter? Send me a DM on Twitter @barralexandra or reply to this email!
Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!