These are notes for Stanford’s CS231N, taught by Fei-Fei Li, Ehsan Adeli, Justin Johnson, and Zane Durante in Fall 2025. I mainly used this to revise my knowledge.
_Course description:_ Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement and train their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Additionally, the final assignment will give them the opportunity to train and apply multi-million parameter networks on real-world vision problems of their choice. Through multiple hands-on assignments and the final course project, students will acquire the toolset for setting up deep learning tasks and practical engineering tricks for training and fine-tuning deep neural networks.
Link: https://cs231n.stanford.edu/
## Lecture 1: Introduction
First part was about brief history of computer vision and deep learning.
Evolutions Big Bang: Cambrian Explosion, 530-540 million years BC. We see the explosion of animal species. Compelling theory: onset of eyes (simple pinhole). Without sensors, life is metabolism, very passive. With sensor, becomes part of the environment that we might want to change.
Where did we come from (better to watch the lecture: https://www.youtube.com/watch?v=2fq9wYslV0A&list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16&t=1s for the full picture)
![[Pasted image 20251015131037.png]]
2012 - Present: Deep Learning is everywhere (image classification, image retrival, object detection, image segmentation, video classification, ... see slides: https://cs231n.stanford.edu/slides/2025/lecture_1_part_1.pdf#page=48.00)
Computer Vision: enabling machine to see and understand images.
CS231n overview
- Deep Learning basics (lecture 2-4)
- Perceiving and Understanding the Visual World (lecture 5-12): classification, semantic segmentation, object detection, instance segmentation, video clasification, multimodal video understanding, visualization and understanding. also models beyond MLP (CNN, attention) and large scale distributed training (data parallelism, model parallelism, sync vs async gradient updates)
- Generative and Interative Visual Intelligence (lecture 13-17): style transfer, text to image, diffusion models, VLM, self-supervised learning
- Human-Centered Appliations and Implications (lecture 18)
![[Pasted image 20251017150400.png]]
### Objectives
Formalise computer vision application into tasks
- Formalize inputs and outputs for vision-related problems
- Understand what data and computational requirements you need to train a model
Develop and train vision models
- Learn to code, debug, and train CNN
- Learn how to use Pytoch
Gain an understanding of where the field is and where it is headed
- New research has come out in last 0-5 years
- Open research challenges?
## Lecture 2: Image classification with Linear Classifiers
Challenges with the semantic gap between how we perceive the image vs how machine perceive the image (tensor of intergers): viewpoint variation (all pixels change), illumination, background clutter, occlusion, deformation, intraclass variation, context
![[Pasted image 20251018171040.png]]
Nearest neighbor classifier: train O(1), predict O(n) (bad), even with good algorithm https://github.com/facebookresearch/faiss
### Interpreting a linear classifier
Visual viewpoint
![[Pasted image 20251017234113.png]]
Geometric viewpoint
![[Pasted image 20251017234121.png]]
#### Difference lenses to look at softmax function. Also can be think of as cross entropy since H(p) = 0 for one hot encoding
![[Pasted image 20251017234220.png]]
## Lecture 3: Regularization and Optimization
### Regularization
$
L(W)=\underbrace{\frac{1}{N} \sum_{i=1}^N L_i\left(f\left(x_i, W\right), y_i\right)}_{\begin{array}{l}\text { Data loss: Model predictions } \\ \text { should match training data }\end{array}}+\underbrace{\lambda R(W)}_{\begin{array}{l}\text { Regularization: Prevent the model } \\ \text { from doing too well on training data }\end{array}}
$
L2 regularization: $R(W)=\sum_k \sum_l W_{k, l}^2$
L1 regularization: $R(W)=\sum_k \sum_l\left|W_{k, l}\right|$
Elastic net (L1 + L2): $R(W)=\sum_k \sum_l \beta W_{k, l}^2+\left|W_{k, l}\right|$
L1 gradient: sign($w$) --> constant , L2 gradient: $2w$. --> proportional pull towards 0
More complex: dropout, batch normalization, stochastic depth, etc
Gradient (1D case)
$
\frac{d f(x)}{d x}=\lim _{h \rightarrow 0} \frac{f(x+h)-f(x)}{h}
$
In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension
The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient
```python
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
```
however, full sum expensive when N is large --> Stochastic Gradient Descent
$
\begin{aligned}
L(W) & =\frac{1}{N} \sum_{i=1}^N L_i\left(x_i, y_i, W\right)+\lambda R(W) \\
\nabla_W L(W) & =\frac{1}{N} \sum_{i=1}^N \nabla_W L_i\left(x_i, y_i, W\right)+\lambda \nabla_W R(W)
\end{aligned}
$
```python
# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update
```
### Problems with gradient descent
1.
- What if loss changes quickly in one direction and slowly in another?
- What does gradient descent do?
- Very slow progress along shalow direction, jittering along steep direction
2. Saddle point or local minima
3. Gradients come from mini batch --> can be noisy
![[Pasted image 20251019223647.png]]
#### RMSProp
Intuition: when current grad value and past gradients are large --> `grad_squared` are large --> the step becomes effective less in that direction (vice versa).
```python
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
```
#### Adam
![[Pasted image 20251019225914.png]]
#### Learning rate decay
![[Pasted image 20251019230533.png]]
![[Pasted image 20251019230632.png]]
In practice:
- Adam(W) is a good default choice in many cases; it often works ok even with constant learning rate
- SGD+Momentum can outperform Adam but may require more tuning of LR and schedule
- If you can afford to do full batch updates then look beyond $1^{\text {st }}$ order optimization ( $2^{\text {nd }}$ order and beyond)
## Lecture 4: Neural Networks and Backpropagation
(Before) Linear score function: $f=W x$ $,x \in \mathbb{R}^D, W \in \mathbb{R}^{C \times D}$ ($D$ is the number of features of input $x$, $C$ is the number of classes)
(Now) 2-layer neural network: $f=W_2 \max \left(0, W_1 x\right)$, $x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$
($H$ is the number of neurons in the hidden layer)
$max$ is a non-linear operator
![[Pasted image 20251022111136.png]]
Full implementation of training a 2-layer neural network needs ~20 lines
```python
import numpy as np
from numpy.random import randn
# define the network
N, D_in, H, D_out = 64, 1000, 100, 10
x, y = randn(N, D_in), randn(N, D_out)
w1, w2 = randn(D_in, H), randn(H, D_out)
# forward pass
for t in range(2000):
h = 1 / (1 + np.exp(-x.dot(w1)))
y_pred = h.dot(w2)
loss = np.square(y_pred - y).sum()
print(t, loss)
# calculate the analytical gradient
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h *(1 - h))
# gradient descent
W1 -= 1e-4 * grad_w1
w2 -= 1e-4 * grad_w2
```
![[Pasted image 20251022114734.png]]
![[Pasted image 20251022114843.png]]
### Computational graphs and Backpropagation
A computational graph is a structured way to represent a mathematical computation as a directed graph, where:
- Nodes represent operations (like addition, multiplication, ReLU) or variables/constants (like inputs, weights, biases).
- Edges represent the flow of data — specifically, the output of one node becoming the input to another.
![[Pasted image 20251022120703.png]]
*==TODO:==* Add pseudocode for computation graph and backprop here!
### Backpropagation
A simple example
![[Pasted image 20251022122305.png]]
Recap: Vector derivatives
![[Pasted image 20251022124057.png]]
Backprop with Matrices
![[Pasted image 20251022142730.png]]
## Lecture 5: Image Classification with CNNs
### Convolution Layer
Every point in the activation plane correspond to how much the corresponding input align with the filter.
![[Pasted image 20251024153327.png]]
### ConvNet
![[Pasted image 20251024154257.png]]
### Pooling layers
Another way to downsample
![[Pasted image 20251024161316.png]]
### Translation equivarance
Features of images don't depend on their location in the image
![[Pasted image 20251024161925.png]]
## Lecture 6: CNN Archiectures
### Components of CNNs
![[Pasted image 20251025174912.png]]
### Normalization layer
High-level idea: Learn parameters that let us scale / shift the input data
1. Normalize the input data
2. Scale / shift using learned parameter
![[Pasted image 20251025181841.png]]
*==TODO:==* write down thoughts on normalization layers.
Resources

Papers:
- EDM2 https://arxiv.org/abs/2312.02696,
- Transformers without Normalization https://arxiv.org/abs/2503.10622
### Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter
![[Pasted image 20251025185458.png]]
```python
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations
out = np.dot(W3, H2) + b3
```
At test time all neurons are active always --> We must always scale the activations so that for each neuron: output at test time = expected output at training time
### Activation functions
Sigmoid
![[Pasted image 20251026135647.png]]
ReLU
![[Pasted image 20251026135700.png]]
GELU
![[Pasted image 20251026135715.png]]
### CNN Architectures
VGGNet (why use smaller filters: same effective receptive field, fewer parameters, more non-linearities)
![[Pasted image 20251026141009.png]]
![[Pasted image 20251026141019.png]]
ResNet
What happens when we continue to stack deeper layers on "plain" CNN
![[Pasted image 20251026141314.png]]
![[Pasted image 20251026141805.png]]
### Weight Initialization
Values too small --> all activations tend to go to zero for deeper network layers
![[Pasted image 20251026144800.png]]
Values too large --> activations blow up quickly
![[Pasted image 20251026144848.png]]
One solution: Kaiming
![[Pasted image 20251026144907.png]]
*==TODO:==* Add how EDM2 initialize the weights here, and how does they derive the expected output distribution
## Lecture 7: Recurrent Neural Networks
### Content
- Discuss sequence modeling (assumed fixed-length inputs so far)
- Simple models commonly used before the era of transformers
- Relation to modern state-space models
![[Pasted image 20251027092049.png]]
one to one: vanilla neural network
one to many: image captioning
many to one: sequence of video frames -> class label
many to many: video captioning, video classification on every frame
### RNN update
hidden state
![[Pasted image 20251027092950.png]]
output generation
![[Pasted image 20251027092959.png]]
### Vanilla RNN
![[Pasted image 20251027094457.png]]
Concrete example
![[Pasted image 20251027094516.png]]
![[Pasted image 20251027095020.png]]
```python
import numpy as np
h_0 = np.array([[0], [0], [1]])
w_xh = np.array([[1], [0], [0]])
w_hh = np.array([[0, 0, 0], [1, 0, 0], [0, 0, 1]])
w_yh = np.array([1, 1, -1])
x_in = [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1]
x_seq = np.array([[x] for x in x_in]) # Format properly for matmul ops
h_t_prev = h_0
def relu(x):
return np.maximum(0, x)
y_seq = []
for t, x in enumerate(x_seq):
h_t = relu(w_hh @ h_t_prev + (w_xh @ x).reshape(3,1))
y_t = relu(w_yh @ h_t)
y_seq.append(y_t)
h_t_prev = h_t
print("Inputs: ", [int(x) for x in x_seq.flatten()])
print("Outputs:", [int(y[0]) for y in y_seq])
```
### RNN: Computational graph
![[Pasted image 20251027101609.png]]
![[Pasted image 20251027101619.png]]
![[Pasted image 20251027101629.png]]
*==TODO==* Figuring out the backprop through time properly (perferably through code - see: https://gist.github.com/karpathy/d4dee566867f8291f086, blog post: https://karpathy.github.io/2015/05/21/rnn-effectiveness/, https://machinelearningmastery.com/gentle-introduction-backpropagation-time/)
### RNN tradeoffs
RNN Advantages:
- Can process any length of the input (no context length)
- Computation for step $t$ can (in theory) use information from many steps back
- Model size does not increase for longer input
- The same weights are applied on every timestep, so there is symmetry in how inputs are processed.
RNN Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
*==TODO:==* Reimplement LSTM https://colah.github.io/posts/2015-08-Understanding-LSTMs/, and maybe read https://www.deeplearningbook.org/contents/rnn.html
## Lecture 8: Attention and Transformers
### Sequence to sequence with RNNs
![[Pasted image 20251028110752.png]]
### Sequence to sequence with RNNs and Attention
![[Pasted image 20251028112928.png]]
### Attention layer
![[Pasted image 20251028121622.png]]
### Self Attention layer
![[Pasted image 20251028122251.png]]
### Three ways of processing sequences
![[Pasted image 20251028155012.png]]
### The transformer
![[Pasted image 20251028160419.png]]
The annotated transformers: [link](https://nlp.seas.harvard.edu/annotated-transformer/)
## Lecture 11: Large Scale Distributed Training
### Inside Nvidia H100
![[Pasted image 20251028164500.png]]
![[Pasted image 20251028165852.png]]
![[Pasted image 20251028170812.png]]
![[Pasted image 20251028172032.png]]
### How to train lots of GPUs
Data parallelism
![[Pasted image 20251028174731.png]]
![[Pasted image 20251028175608.png]]
![[Pasted image 20251028221307.png]]
### Activation checkpointing
![[Pasted image 20251028222750.png]]
### How to train on lots of gpus
![[Pasted image 20251028223225.png]]
### Pipeline paralellism
![[Pasted image 20251028231109.png]]
### Tensor paralellism
![[Pasted image 20251028231139.png]]
### Summary
![[Pasted image 20251028231317.png]]