The due date for this codelet is Wednesday, Mar 26 at 11:59PM.
The aim of this codelet is to build on your PyTorch skills by implementing a sparse autoencoder on MNIST data. You will also gain hands-on practice building a custom loss function that mixes multiple factors. You should draw on your prior codelets and the class lecture on autoencoders.
Important: Only use PyTorch, numpy, pandas, matplotlib, PIL, and in-built Python for this codelet. The use of any other libraries, including the books d2l library results in an automatic unsatisfactory grade for this assignment. A core goal of this class is to build your competencies as a machine learning engineer. I want to minimize abstractions from other libraries so that you build these skills.
Your task is to:
codelet7.zip
from the course website and open
it. You will find these instructions, utils.py
, which
includes a basic plotting function, a folder figures
, which
includes sample visualizations prior to training, and
codelet7.py
, which has scaffolding for you.codelet7.py
and include a file called codelet7.pdf
with your answers to
the written questions.When assessing your work, satisfactory achievement is demonstrated, in part, by:
You will implement a sparse autoencoder using Pytorch. By the end, you will visualize your model’s reconstructed images and conduct some initial analysis of what the sparse features represent. You have three tasks, of varying difficulty:
To orient you to your problem, and to scaffold the later
visualization task, a plotting function, show
has been
provided with in utils.py
and imported into
codelet7.py
. When first running codelet7.py
,
the MNIST will be loaded and put in a PyTorch dataloader
(train_dataloader
). A transformation is applied to the
image to normalize the pixel values around 0 and to make them PyTorch
tensors. Below, I’ve provided a small bit of code that will plot 64
images, in 4 rows of 16 images, so that you can see the data. It uses
make_grid
from torchvision
which takes a list
of images and a desired number of images per row, and returns a grid
which we can pass to show
. You should add this code to
codelet7.py
and confirm you can generate plots.
= next(iter(train_dataloader))
batch = batch
images, _ = [image for image in images]
g = make_grid(g, 16)
grid show(grid)
In this codelet, the aim is to build a simple sparse autoencoder and
apply it to MNIST data. Some scaffolding is provided in
codelet7.py
. Your task is to complete the
__init__
, forward
,
batched_sparsity_penalty
, and loss_function
methods of the SparseAutoencoder
class. The sparse
autoencoder we are building is a simple encoder-decoder structure. The
encoder should be a torch Linear
layer mapping
inputDims
to hDims
which a
Sigmoid
activation function is applied to. The decoder
should be a torch Linear
layer mapping hDims
to inputDims
and applying a Tanh
activation
function. After creating the linear layers, you should use
init_weights
, which takes a linear layer, to initialize the
weights. We are using Xavier Initialization, which helps our model
converge faster and more consistently. A brief overview of this
initialization strategy, if you are curious, can be found here.
The sparseness of the sparse autoencoder is enforced by the use of
regularization, discussed in class. In
batched_sparsity_penalty
, you should calculate the
following penalty (which uses KL divergence):
\[\begin{equation} \sum_{j=0}^{|h|} \rho \textrm{log} \frac{\rho}{\hat{\rho_j}} + (1 - \rho) \textrm{log} \frac{1 - \rho}{1 - \hat{\rho_j}} \end{equation}\]
where \(\rho\) is \(0.05\), \(j\) ranges over the number of hidden nodes,
and \(\hat{\rho_j}\) is the average
activation of hidden node \(j\) for
input. That is, we want the average activation of a node to approach
\(0.05\) (i.e., it should selectively
“fire” for a subset of the input). Note that we have two edge cases that
pose a problem for this penalty, when \(\hat{\rho_j}\) is \(0\) and when it is \(1\) (because log is not defined for \(0\)). To avoid this, you should use
torch.clamp
to clamp the values between \(1e^{-8}\) and \(1
- 1e^{-8}\).
For example, given the following hidden representations of 5 inputs
(where the hidden representation has 3 nodes), we should get an output
of 1.1490
.
\[\begin{equation} \begin{pmatrix} 0.3 & 0. & 0.1 \\ 0.2 & 0. & 0.2 \\ 0.5 & 0. & 0.1 \\ 0.5 & 0. & 0.3 \\ 0.5 & 0. & 0.3 \\ \end{pmatrix} \end{equation}\]
Finally, the loss should combine mean squared error and the sparsity
penalty, in loss_function
. In particular you should add to
the MSE between the true and predicted pixel values to the sparsity
penalty, weighted by \(1e^{-4}\).
You evidence completion of this task with successfully training a model.
You should write code to generate two visualizations. One plots 16
rows of input images and their reconstructed output using your model.
Another visualizes the relationship between a given hidden node and
input. In particular, you should write code that will plot the 100 input
images that have the highest hidden activation for a given node. Each
row of your plot should have 10 images. Examples of both plots for a
model prior to training with 10 hidden units is provided in
figures
.
You evidence completion of this task with successfully training a model.
Finally, you should train a model for 20 epochs using the Adam
optimizer with a learning rate of 0.0001
. Your model should
have 10 hidden dimensions. For comparison purposes, below is my initial
and final loss after 20 epochs.
Epoch: 1/20 - Train Loss: 0.8029
Epoch: 20/20 - Train Loss: 0.5329
To evidence completion of this task, you should add to the pdf accompanying your code your initial and final loss values and the two visualization figures for your trained model. Additionally, you should look at a few nodes and comment on, at a high level, what you think is being learned as the features in the hidden representation.