• Superslow AI Newsletter
  • Posts
  • What is a Convolutional Neural Network? Google's StyleDrop and Poe's Prompt Engineering

What is a Convolutional Neural Network? Google's StyleDrop and Poe's Prompt Engineering

Today we’re going to break down what CNNs are, how they work, and what applications we’re seeing them in so far.

Hi everyone 👋

Today we’ll dive into Convolution Neural Networks, Google's StyleDrop and Poe's image prompt engineer. Let me know if you’re liking the new format of deep dives — happy to experiment and would love to hear your feedback. Reply here or drop me a line to let me know.

Okay, we have a lot to cover in this one — let’s dive in.

DALL-E on a portal to another dimesion

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

🗞️ News

There’s so much AI news these days it’s hard to keep up! Jim Fan compiled a great list of news this week, which I highly recommend checking out below:

I’m hoping to do a deeper dive on these research projects in a future post, but as a start, I wanted to share some links to their work:

  • DeepMind’s work on novel AI risks (Link)

  • Meta: Less is more for alignment (LIMA) (Link)

  • OpenAI: Mathematical reasoning in LMs (Link)

    • Previous research by OpenAI on Mathematical Reasoning (Link)

  • Anthropic: A preview of their Interpretability research (Link)

📚 Concepts & Learning

Convolution Neural Networks (CNNs) are a popular training method for machine learning models. Today we’re going to break down what CNNs are, how they work, and what applications we’re seeing them in so far. The most intuitive way to think about CNNs is how they apply to image recognition, so that will drive a lot of the examples in today’s section. Let’s dive in.

What is a Convolutional Neural Network?

Imagine you're trying to understand an image. You look at different parts of the image and notice specific patterns, like edges or textures, that help you recognize objects. CNNs work in a similar way by analyzing images and finding important features. A CNN consists of an input layer, one or more hidden layers, and an output layer. The hidden layers are where a lot of the work by the model is done, so we’ll dive into how those work. A CNN typically has three types of layers: convolutional layers, pooling layers, and fully connected layers.

The Convolutional Layer

The convolutional layer is the core building block of a CNN. It contains a set of filters (aka kernels) and parameters which are learned throughout training. The size of the filters (or kernels) is usually smaller than the actual image. In the examples below, the filter is 3×3. The goal of this layer is to create an Activation Map which results from performing point-wise multiplication between the input layer and the filter (simple version: 1:1 multiplication between matrices). For example, if the filter is designed to detect horizontal edges, it will have high activations where there are horizontal edges in the image, and low activations elsewhere.

❓ Why is the activation map important?

The activation map shows how much the filter matches with the image, and where those matches occur.

Imagine you have a stencil of a turtle and an image that contains multiple turtles, but some areas are just plain sand.

To determine if there are turtles in the image, you place the turtle stencil in different positions and slide it across the image. As you move the stencil, you focus on a small portion of the image at a time.

When you position the stencil over a turtle in the image, the pattern on the stencil matches the shape and color of the turtle perfectly. This creates a strong activation or response, indicating that a turtle is present in that area of the image.

However, when you place the stencil over a section of plain sand in the image, there is no match between the stencil and the sand. The activation or response in that area is weak or low because it doesn't resemble a turtle.

By repeating this process with different stencils, each with their own unique patterns, you can detect various features in the image. These features may include the edges, textures, or colors that are significant for recognizing turtles.

In a convolutional neural network (CNN), the stencil represents a kernel or filter. The CNN applies multiple kernels to different parts of the image, just like you slide the turtle stencil across the image. By analyzing the responses or activations from these kernels, the CNN can learn to identify objects, detect patterns, and make predictions based on the features it extracts from the image.

In the example below, the activation map corresponds to the grey squares. The filter corresponds to the dark blue squares.

To visualize this transformation in numbers, let’s go through this example:

As you can see there is a filter applied on the matrix, which is conventionally called a kernel. In this example, the kernel looks like this:

See the x1 or x0 in the bottom right corner of the moving matrix

The resulting point-wise multiplication will look like this (I only included arrows to the four corners for simplicity, but the multiplication would apply to all boxes):

Yellow is the kernel, Green is the part of the image covered by the kernel at its first position (top left in the GIF above)

So the equation for this kernel would look like this (from top left to top right, then middle row, then bottom row):

Which you can see in the top left corner of the pink matrix below:

This is repeated for all sections of the image. The resulting convolved features are known as an activation or feature map. For a visual demonstration of the point-wise multiplication, check out the following:

The convolutional layer can have multiple filters, each one detecting a different feature. The output of the convolutional layer is a stack of feature maps, one for each filter. The feature maps are also called channels, because they can be seen as different views of the same image.

The convolutional layer has two main advantages over a fully-connected layer. First, it reduces the number of parameters to be learned, because each filter is shared across the whole input. This makes the network more efficient and less prone to overfitting. Second, it preserves the spatial structure of the input, because each filter only operates on a local region of the input. This makes the network more sensitive to local features and invariant to global transformations.

The Pooling Layer

The pooling layer is another type of layer that is often used after a convolutional layer. It helps reduce the size of the feature maps while keeping important information. It does this by performing a downsampling operation. Two common downsampling operations are by average or by maximum value. This pooling layer is key to making the network faster and more robust to noise and small variations in the input.

One common type of pooling is called max pooling. It divides the feature map into smaller regions and keeps only the maximum value from each region. Check out this visual example of max pooling, which is one of the most common types of pooling:

The pooling layer has two main benefits. First, it reduces the computational cost and memory usage of the network, because it reduces the number of features to be processed by subsequent layers. Second, it introduces some translation invariance, because it makes the network less sensitive to the exact location of features in the input.

Repeating Convolutional and Pooling Layers

You may have noticed in visualizations of CNNs that there are several convolutional and pooling layers before getting to the flattened and fully connected layer. There are a few reasons why you’d want to repeat those two processes before moving onto the final classification.

  1. A bird’s eye view: As you move deeper into the network, each neuron in the later convolutional layers can capture information from a larger portion of the input image. For example, imagine the first layer is only looking for rounded curves, edges or points. The next layer might be looking for circles, squares or triangles… and so on. Combine this with other image features (like color or texture) and you can see how more robust features begin appearing once you combine all of this knowledge.

  2. Recognize objects anywhere in the image: since this process starts in small regions and grows to larger areas over time, this method is particularly useful at recognizing objects in many parts of the image. This is also useful if objects are backwards or upside down.

  3. Efficiency: Pooling layers help reduce the complexity of feature maps while preserving important information. This process reduces the computational complexity of the network. This process also tends to reduce the likelihood of overfitting.

Overall, the repetition of convolutional and pooling layers in a CNN enables the network to gradually learn and extract complex and abstract features from the input data.

The Fully Connected Layer

While the convolutional and pooling layers usually map to smaller features of the image, the fully connected (FC) layer aims to compile all of the information from these individual layers into a final output.

Each neuron in the fully connected layer receives inputs from all the neurons in the previous layer. It combines these inputs using a weighted sum, where each input is multiplied by a corresponding weight. These weights represent the learned importance of each feature in making accurate predictions. The neuron also includes a bias term, which allows for additional flexibility in the decision-making process.

For example, let’s say you are trying to classify images of dogs. There might be nodes that correspond to the features of animals: nose, tail, fur, etc. For dogs, a “pointed nose”, “floppy ears” and “shaggy fur” are likely a strong indicators that the image represents a dog. Compare that to a “round nose”, “pointed ears” and “striped fur” of a cat. Therefore, the nodes that correspond to dog-like features will have stronger weights in the final classification than those nodes that correspond with cat-like features (see the image below).

The purple connections are contributing the most information to the Dog classification.

After the weighted sum, an activation function is applied to introduce non-linearity into the network. The activation function determines the output of the neuron based on the weighted sum. It can help the network learn complex relationships between the features and the target output.

In the animal classification example, the fully connected layer neurons might learn to recognize specific patterns or combinations of features that are indicative of certain animal classes. For instance, certain combinations of fur texture, eye shape, and ear position might be strong indicators of a dog, while a different combination might indicate a cat.

After this, the outputs are passed through a softmax function. We’ll get into softmax in a future newsletter, but the short version is: softmax is a function that normalizes the outputs in the final layer to fall between 0 and 1 as probabilities so they are easily interpretable and allow for comparison between other models and networks.

The FC layer plays a crucial role in the decision-making process of a neural network. It combines the extracted features and transforms them into meaningful predictions or classifications. By learning the appropriate weights and biases during the training process, the fully connected layer enables the network to make accurate predictions on unseen data.

It's important to note that fully connected layers are not always present in all neural network architectures. For certain tasks, such as object detection or segmentation, fully connected layers may be replaced by other specialized layers that are better suited to handle spatial information.

So what are the components of the FC layer?

There is a flattening step and a connection step. First, the pooling layers are flattened into a column of numbers:

Flattening step

The connection step follows flattening, and involves a matrix multiplication. The matrix in the example below will be a 4×5 (4 pink circles by 5 blue circles).

Connection step: an example for the first blue node only

The numbers in the pink circles are the activations of each node. The numbers above the purple connections to the blue circles correspond to weights — how important is the pink circle to the first blue node? The numbers in the pink circles will not change for each blue node, but the connections (represented in purple, red, yellow, green and black) will be different numbers for each blue circle.

The fully connected layer is just a vanilla neural network layer, except that it benefits from the features learnt by the convolutional layers.

Applications

The significance of CNNs lies in their ability to automatically learn and extract meaningful features from images. They are especially good at capturing spatial relationships, which makes them powerful for tasks like image classification, object detection, and image generation. Some applications of CNNs include:

  1. Image Classification: self-driving cars, facial recognition, and medical image analysis.

  2. Object Detection: autonomous vehicles and AR/VR applications.

  3. Medical Applications: classification of x-rays, MRIs, CT scans, and more for medical research and diagnosis.

  4. Robotics: object recognition, navigation and more.

  5. Generative AI: video and image captioning.

If you’re more of a visual learner and want to learn more about neural networks in general, I highly recommend 3Blue1Brown’s explanation here:

🎁 Miscellaneous

Stylized text-to-image generation from a single image with StyleDrop (Link)

Powered by Google Research and Muse:

Take your image prompts from broke 😔 to woke 🧐 with Poe (Link)

A great 101 crash course on AI in 2023

From an Applied Researcher at OpenAI:

That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!