Forward pass: Input flows left→right through weighted connections and activations
Deep Learning Cheat Sheet
Visual Overview: Neural Network Architecture
Neural Network Architectures
| Architecture | Input | Use Case | Key Feature |
|---|---|---|---|
| MLP (Fully Connected) | Vectors | Tabular data | Dense layers |
| CNN | Images | Computer Vision | Convolution, pooling |
| RNN | Sequences | Text, time-series | Recurrent connections |
| LSTM | Sequences | Long dependencies | Memory cells, gates |
| GRU | Sequences | Faster LSTM alternative | Simplified gates |
| Transformer | Sequences | NLP, machine translation | Self-attention, parallel |
| Vision Transformer | Images | Image classification | Attention on patches |
| Autoencoder | Any | Dimensionality reduction | Encoder-decoder |
| GAN | Noise | Image generation | Generator-Discriminator |
Activation Functions
| Function | Formula | Range | When to Use |
|---|---|---|---|
| ReLU | max(0, x) | [0, ∞) | Hidden layers (default) |
| Leaky ReLU | x if x>0, else 0.01x | (-∞, ∞) | Avoid dying ReLU |
| ELU | x if x>0, else α(e^x-1) | (-α, ∞) | Smooth negative |
| Sigmoid | 1/(1+e^-x) | (0, 1) | Binary classification output |
| Tanh | (e^2x - 1)/(e^2x + 1) | (-1, 1) | Centered activation |
| Softmax | e^xi / Σe^xj | (0, 1) | Multi-class output (probabilities) |
| Swish | x × sigmoid(βx) | (-∞, ∞) | Better than ReLU |
| GELU | x × Φ(x) | (-∞, ∞) | Transformers (BERT, GPT) |
Loss Functions
// Classification
Cross-Entropy (Binary): -[y*log(ŷ) + (1-y)*log(1-ŷ)]
Cross-Entropy (Multi-class): -Σ yi*log(ŷi)
Focal Loss: Handles class imbalance, focuses on hard examples
// Regression
Mean Squared Error (MSE): (1/n)Σ(y - ŷ)²
Mean Absolute Error (MAE): (1/n)Σ|y - ŷ|
Huber Loss: Combines MSE & MAE, robust to outliers
// Distance/Ranking
Contrastive Loss: Bring similar samples close, push apart different
Triplet Loss: Anchor, positive, negative samples
ArcFace, CosFace: Face recognition losses
// Regularization
L1: Σ|w| (sparse)
L2: Σw² (weight decay, small weights)
Combined (Elastic Net): L1 + L2
Code example (PyTorch):
criterion = nn.CrossEntropyLoss()
loss = criterion(predictions, targets)
Optimizers
| Optimizer | Update Rule | When to Use | Notes |
|---|---|---|---|
| SGD | w = w - lr×∇L | Simple baseline | Can oscillate |
| SGD+Momentum | Accelerates in consistent direction | Faster convergence | Popular choice |
| Nesterov | Look-ahead gradient | Faster than momentum | Better convergence |
| AdaGrad | Per-parameter learning rate | Sparse data | Learning rate decreases over time |
| RMSprop | Adaptive learning rate | RNNs, good default | Divides by root of squared gradients |
| Adam | Momentum + RMSprop | Most popular | Works well, default choice |
| AdamW | Adam + weight decay | Better than Adam+L2 | Recommended for modern DL |
| LAMB | Layer-wise adaptive learning | Large batch training | Scales better to large batches |
Regularization Techniques
// Dropout
Randomly deactivate neurons during training
Prevents co-adaptation, reduces overfitting
p = 0.5 is common (50% dropout)
// Batch Normalization
Normalize layer inputs during training
Reduces internal covariate shift
Allows higher learning rates
Usually before or after activation
// Layer Normalization
Normalize across features (not batch)
Better for RNNs, Transformers
Batch-size independent
// Weight Decay (L2 Regularization)
Penalizes large weights
Encourages simpler models
In PyTorch: optimizer with weight_decay parameter
// Early Stopping
Monitor validation loss
Stop when it stops improving
Prevents overfitting
// Data Augmentation
Image: Rotation, flip, crop, color jitter
Text: Paraphrase, back-translation
Time-series: Scaling, noise
// Mix-Up
Create virtual samples: x' = λx_i + (1-λ)x_j
Smooths decision boundary
λ ~ Beta(α, α)
// CutMix
Cut and paste regions between images
Improves robustness
Convolutional Neural Networks (CNN)
// Convolution Operation
Filter size: 3x3, 5x5, 7x7 common
Stride: How much filter moves (1 or 2)
Padding: Add zeros around input
Output size = (input - kernel + 2×padding) / stride + 1
// Pooling
Max Pooling: Take maximum value
Average Pooling: Take average
Stride usually 2, pooling size 2x2
// Common Architectures
AlexNet: Deep CNN, ImageNet breakthrough (2012)
VGG: Deeper, simpler (3x3 filters)
ResNet: Skip connections, very deep (50-152 layers)
Inception: Multi-scale features
MobileNet: Efficient, mobile-friendly
EfficientNet: Scales depth/width/resolution
// Typical CNN Structure
Input → Conv → ReLU → Pool → ... (repeat) → FC → Softmax
// Example (PyTorch)
import torch.nn as nn
model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(64*56*56, 10)
)
Recurrent & Attention
// LSTM (Long Short-Term Memory)
Cell state: Carries info across time
Gates: Input, output, forget
Mitigates vanishing gradient problem
Good for sequences with long dependencies
// GRU (Gated Recurrent Unit)
Simpler than LSTM (2 gates vs 3)
Slightly faster
Similar performance
// Transformer (Self-Attention)
Query (Q), Key (K), Value (V) vectors
Attention = softmax(Q×K^T / √d)×V
Parallel processing (vs sequential RNN)
Basis for BERT, GPT
// Positional Encoding
Sin/cos functions based on position
Helps model learn position information
Absolute or relative positions
// Multi-Head Attention
Multiple attention heads in parallel
Captures different types of relationships
Concatenate and project results
// Sequence to Sequence (Seq2Seq)
Encoder: Process input sequence
Decoder: Generate output sequence
With attention: Decoder attends to encoder
// Example use cases
LSTM: Time-series forecasting, speech recognition
Transformer: Machine translation, text generation, classification