In the “*Deep Learning bits*” series, we don’t see how to use deep learning to solve complex problems end-to-end as we do in *A.I. Odyssey*** . **We rather look at different techniques, along with some

**examples and applications.**Don’t forget to check out

*Deep Learning bits #1*!

If you like Artificial Intelligence, make sure tosubscribe to the newsletterto receive updates on articles and much more!

### Introduction

Last time, we have seen what autoencoders are, and how they work. Today, we will see how they can help us **visualize the data** in some *very* cool ways. For that, we will work on images, using the Convolutional Autoencoder architecture (*CAE*).

#### What’s the latent space again?

An autoencoder is made of two components, here’s a quick reminder. The ** encoder **brings the data from a high dimensional input to a

**layer, where the number of neurons is the smallest. Then, the**

*bottleneck***takes this encoded input and converts it back to the original input shape — in our case an image. The**

*decoder***the space in which the data lies in the bottleneck layer.**

*latent space is*The latent space contains a **compressed** representation of the image, which is **the only information** the decoder is allowed to use to try to reconstruct the input** as faithfully** **as possible**. To perform well, the network has to learn to extract the **most relevant** features in the bottleneck.

*Let’s see what we can do!*

### The dataset

We’ll change from the datasets of last time. Instead of looking at my eyes or blue squares, we will work on probably the *most famous for computer vision: *the MNIST dataset of *handwritten digits*. I usually prefer to work with **less conventional** datasets just for diversity, but MNIST is **really convenient** for what we will do today.

** Note:** Although MNIST visualizations are

*pretty common*on the internet, the images in this post are 100% generated

**from the code,**so you can use these techniques with your own models.

### Baseline — Performance of the autoencoder

To understand what kind of features the encoder is capable of extracting from the inputs, we can first look at **reconstructed of images. **If this **sounds familiar**, it’s normal, we already did that last time. However, this step is **necessary** because it sets the baseline for our *expectations* of the model.

** Note: **For this post, the bottleneck layer has only

**32 units**, which is some

*really*,

*really*brutal dimensionality reduction. If it was an image, it

**wouldn’t even be 6×6**pixels.

We can see that the autoencoder **successfully** reconstructs the digits. The**reconstruction is blurry** because the input is **compressed** at the bottleneck layer. The reason we need to take a look at *validation samples *is to be sure we are not *overfitting* the training set.

** Bonus**:

*Here’s the training process animation*

**training**(left) and

**validation**(right) samples at each step

### t-SNE visualization

#### What’s t-SNE?

The first thing we want to do when working with a dataset is to **visualize** the data in a *meaningful* way. In our case, the **image ***(or pixel)*** space** has 784 dimensions (28**28*1*), and we clearly *cannot* plot that. The challenge is to squeeze all this dimensionality into something we can grasp, in *2D* or *3D*.

Here comes t-SNE, an algorithm that maps a **high dimensional space **to a **2D or 3D space, **while trying to **keep the distance** between the points **the same**. We will use this technique to plot embeddings of our dataset, *first* directly from the **image space**, and *then* from the **smaller** **latent space**.

*Note:** t-SNE is better for visualization than it’s cousins **PCA** and **ICA**.*

#### Projecting the pixel space

Let’s start by plotting the t-SNE embedding of our dataset (from image space) and see what it looks like.

**image space**representations from the validation set

We can already see that some numbers are *roughly* **clustered** together. That’s because the dataset is really simple*, and we can use simple *heuristics *on pixels to classify the samples. Look how there’s no cluster for the digits **8, 5, 7 and 3**, that’s because they are all made of the **same pixels**, and only minor changes differentiates them.

**On more complex data, such as** RGB images**, the **only**clusters** would be of images of the **same general color**.*

#### Projecting the latent space

We know that the *latent space* contains **a** **simpler representation **of our images than the pixel space**, **so we can hope that t-SNE will give us an interesting **2-D projection of the latent space**.

**latent space**representations from the validation set

Although *not perfect*, the projection shows **denser** clusters. This shows that in the latent space, the same digits are close to one another. We can see that the digits **8, 7, 5 and 3 **are now easier to distinguish, and appear in *small* clusters.

### Interpolation

Now that we know what **level of detail** the model is capable of extracting, we can *probe* the structure of the latent space. To do that, we will compare how **interpolation** looks in the *image space*, versus *latent space*.

#### Linear interpolation in image space

We start off by taking **two images from the dataset**, and linearly interpolate between them. Effectively, this *blends* the images in a kind of **ghostly** way.

**pixel space**

The reason for this messy transition is the **structure of the pixel space itself. **It’s simply not possible to go smoothly from one image to another in the image space. This is the reason why blending the image of *an empty glass *and the image of an *full glass *will not give the image of a *half-full glass*.

#### Linear interpolation in latent space

Now, let’s do the same in the latent space. We take the same start and end images and **feed them to the encoder** to obtain their *latent space representation. *We then interpolate between the two latent vectors, and feed these to the **decoder**.

**latent space**

The result is much **more convincing**. Instead of having a *fading* *overlay* of the two digits, we clearly see the shape slowly *transform* from one to the other. This shows how well the latent space **understands the structure **of the images.

** Bonus:** here’s a few animations of the interpolation in both spaces

### More techniques & examples

#### Interpolation examples

On **richer** datasets, and with **better** model, we can get *incredible* visuals.

**Latent space**interpolation for

**faces**

**3D shapes**

#### Latent space arithmetics

We can also do **arithmetics **in the latent space**. **This means that **instead of****interpolating, we can add ****or**** subtract** latent space representations.

*For example with faces, man with glasses – man without glasses + woman without glasses = woman with glasses.* This technique gives mind-blowing results.

**3D shapes**

** Note:** I’ve put a function for that in the code, but it looks terrible on MNIST.

### Conclusions

In this post, we have seen several techniques to visualize the **learned** features *embedded* in the latent space of an autoencoder neural network. These visualizations help understand *what* the network is learning. From there, we can exploit the latent space for ** clustering**,

**, and many other applications.**

*compression*

If you like Artificial Intelligence, make sure tosubscribe to the newsletterto receive updates on articles and much more!

You can play with the code over there:

Thanks for reading this post, stay tuned for more !

## Leave a Reply