Deep Learning Cheat Sheet

ai-mlGrades 11-128 sections

Visual Overview: Neural Network Architecture

Forward pass: Input flows left→right through weighted connections and activations

Neural Network Architectures

Architecture	Input	Use Case	Key Feature
MLP (Fully Connected)	Vectors	Tabular data	Dense layers
CNN	Images	Computer Vision	Convolution, pooling
RNN	Sequences	Text, time-series	Recurrent connections
LSTM	Sequences	Long dependencies	Memory cells, gates
GRU	Sequences	Faster LSTM alternative	Simplified gates
Transformer	Sequences	NLP, machine translation	Self-attention, parallel
Vision Transformer	Images	Image classification	Attention on patches
Autoencoder	Any	Dimensionality reduction	Encoder-decoder
GAN	Noise	Image generation	Generator-Discriminator

Activation Functions

Function	Formula	Range	When to Use
ReLU	max(0, x)	[0, ∞)	Hidden layers (default)
Leaky ReLU	x if x>0, else 0.01x	(-∞, ∞)	Avoid dying ReLU
ELU	x if x>0, else α(e^x-1)	(-α, ∞)	Smooth negative
Sigmoid	1/(1+e^-x)	(0, 1)	Binary classification output
Tanh	(e^2x - 1)/(e^2x + 1)	(-1, 1)	Centered activation
Softmax	e^xi / Σe^xj	(0, 1)	Multi-class output (probabilities)
Swish	x × sigmoid(βx)	(-∞, ∞)	Better than ReLU
GELU	x × Φ(x)	(-∞, ∞)	Transformers (BERT, GPT)

Loss Functions

// Classification
Cross-Entropy (Binary): -[y*log(ŷ) + (1-y)*log(1-ŷ)]
Cross-Entropy (Multi-class): -Σ yi*log(ŷi)
Focal Loss: Handles class imbalance, focuses on hard examples

// Regression
Mean Squared Error (MSE): (1/n)Σ(y - ŷ)²
Mean Absolute Error (MAE): (1/n)Σ|y - ŷ|
Huber Loss: Combines MSE & MAE, robust to outliers

// Distance/Ranking
Contrastive Loss: Bring similar samples close, push apart different
Triplet Loss: Anchor, positive, negative samples
ArcFace, CosFace: Face recognition losses

// Regularization
L1: Σ|w| (sparse)
L2: Σw² (weight decay, small weights)
Combined (Elastic Net): L1 + L2

Code example (PyTorch):
criterion = nn.CrossEntropyLoss()
loss = criterion(predictions, targets)

Optimizers

Optimizer	Update Rule	When to Use	Notes
SGD	w = w - lr×∇L	Simple baseline	Can oscillate
SGD+Momentum	Accelerates in consistent direction	Faster convergence	Popular choice
Nesterov	Look-ahead gradient	Faster than momentum	Better convergence
AdaGrad	Per-parameter learning rate	Sparse data	Learning rate decreases over time
RMSprop	Adaptive learning rate	RNNs, good default	Divides by root of squared gradients
Adam	Momentum + RMSprop	Most popular	Works well, default choice
AdamW	Adam + weight decay	Better than Adam+L2	Recommended for modern DL
LAMB	Layer-wise adaptive learning	Large batch training	Scales better to large batches

Regularization Techniques

// Dropout
Randomly deactivate neurons during training
Prevents co-adaptation, reduces overfitting
p = 0.5 is common (50% dropout)

// Batch Normalization
Normalize layer inputs during training
Reduces internal covariate shift
Allows higher learning rates
Usually before or after activation

// Layer Normalization
Normalize across features (not batch)
Better for RNNs, Transformers
Batch-size independent

// Weight Decay (L2 Regularization)
Penalizes large weights
Encourages simpler models
In PyTorch: optimizer with weight_decay parameter

// Early Stopping
Monitor validation loss
Stop when it stops improving
Prevents overfitting

// Data Augmentation
Image: Rotation, flip, crop, color jitter
Text: Paraphrase, back-translation
Time-series: Scaling, noise

// Mix-Up
Create virtual samples: x' = λx_i + (1-λ)x_j
Smooths decision boundary
λ ~ Beta(α, α)

// CutMix
Cut and paste regions between images
Improves robustness

Convolutional Neural Networks (CNN)

// Convolution Operation
Filter size: 3x3, 5x5, 7x7 common
Stride: How much filter moves (1 or 2)
Padding: Add zeros around input
Output size = (input - kernel + 2×padding) / stride + 1

// Pooling
Max Pooling: Take maximum value
Average Pooling: Take average
Stride usually 2, pooling size 2x2

// Common Architectures
AlexNet: Deep CNN, ImageNet breakthrough (2012)
VGG: Deeper, simpler (3x3 filters)
ResNet: Skip connections, very deep (50-152 layers)
Inception: Multi-scale features
MobileNet: Efficient, mobile-friendly
EfficientNet: Scales depth/width/resolution

// Typical CNN Structure
Input → Conv → ReLU → Pool → ... (repeat) → FC → Softmax

// Example (PyTorch)
import torch.nn as nn
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64*56*56, 10)
)

Recurrent & Attention

// LSTM (Long Short-Term Memory)
Cell state: Carries info across time
Gates: Input, output, forget
Mitigates vanishing gradient problem
Good for sequences with long dependencies

// GRU (Gated Recurrent Unit)
Simpler than LSTM (2 gates vs 3)
Slightly faster
Similar performance

// Transformer (Self-Attention)
Query (Q), Key (K), Value (V) vectors
Attention = softmax(Q×K^T / √d)×V
Parallel processing (vs sequential RNN)
Basis for BERT, GPT

// Positional Encoding
Sin/cos functions based on position
Helps model learn position information
Absolute or relative positions

// Multi-Head Attention
Multiple attention heads in parallel
Captures different types of relationships
Concatenate and project results

// Sequence to Sequence (Seq2Seq)
Encoder: Process input sequence
Decoder: Generate output sequence
With attention: Decoder attends to encoder

// Example use cases
LSTM: Time-series forecasting, speech recognition
Transformer: Machine translation, text generation, classification

Deep Learning Cheat Sheet

Visual Overview: Neural Network Architecture

Neural Network Architectures

Activation Functions

Loss Functions

Optimizers

Regularization Techniques

Convolutional Neural Networks (CNN)

Recurrent & Attention

More Cheat Sheets

Machine Learning Cheat Sheet

Neural Network Math Cheat Sheet

NLP Essentials Cheat Sheet