Computer Vision Cheat Sheet

ai-mlGrades 11-125 sections

Image Fundamentals

// Images as arrays
Grayscale: 2D array (height × width)
RGB: 3D array (height × width × 3)
Pixel values: 0-255 (uint8) or 0-1 (float)

// Convolution operation
Filter (kernel): Small matrix (3×3, 5×5)
Slide filter over image
At each position: sum(element-wise product)
Output: Feature map

Example (3×3 filter on 5×5 image):
Without padding: Output 3×3
With padding=1: Output 5×5

// Stride & Padding
Stride=1: Move filter 1 pixel at a time
Stride=2: Move 2 pixels (downsampling)
Padding: Add zeros around edges
Same padding: Keep size
Valid padding: No padding

Output size = (input - kernel + 2×padding) / stride + 1

// Pooling
Max pooling: Take maximum
Average pooling: Take average
Typical: 2×2 with stride 2
Reduces size, keeps important features

Object Detection (YOLO, SSD, Faster R-CNN)

// YOLO (You Only Look Once)
Single pass detection
Divide image into grid
Each cell: bounding box + class confidence
Fast (real-time), less accurate than R-CNN

Outputs per cell:
- x, y, w, h: bounding box
- confidence: P(object)
- class probabilities: P(class | object)

Loss: Localization + confidence + classification

// Faster R-CNN
Region-based (2-stage)
1. Region proposal network (RPN)
2. Classification & bbox refinement
More accurate but slower

// SSD (Single Shot MultiBox Detector)
Multi-scale feature maps
Balance between speed & accuracy
Detects at different scales

// IoU (Intersection over Union)
IoU = Area(intersection) / Area(union)
Threshold 0.5 for detection match
mAP: Average precision across IoU

// NMS (Non-Maximum Suppression)
Remove overlapping boxes
Sort by confidence
Iteratively remove lower-confidence overlaps
Reduces duplicate detections

Semantic Segmentation (FCN, U-Net, DeepLab)

// Semantic Segmentation
Classify each pixel
Output: Same size as input
Each pixel: Class label

// U-Net
Encoder-decoder with skip connections
Encoder: Downsampling (features)
Decoder: Upsampling + skip connections
Popular: Medical imaging, dense prediction

Architecture:
Contracting path: Conv → ReLU → Pool
Expanding path: Upconv → Concat skip → Conv

// Fully Convolutional Networks (FCN)
Replace fully connected with convolutions
Preserve spatial information
Upsampling: Transposed convolution

// DeepLab
Atrous (dilated) convolution
Captures multi-scale context
Atrous spatial pyramid pooling (ASPP)
State-of-art segmentation

// Loss functions
Dice loss: 2×TP/(2×TP + FP + FN)
Focal loss: Handles class imbalance
Cross-entropy + dice: Combined

// Instance Segmentation (Mask R-CNN)
Combines detection + segmentation
Faster R-CNN + Mask head
Outputs: Bounding box + pixel mask

Pre-trained Models & Transfer Learning

// Architectures
ResNet: Skip connections, 50/101/152 layers
VGG: Simple, uniform (3×3 convs)
Inception/InceptionV4: Multi-scale features
MobileNet: Lightweight, mobile-friendly
EfficientNet: Scales depth/width/resolution
Vision Transformer: Pure attention-based

// Transfer Learning
Task: Use pre-trained model on new task
1. Load pretrained weights (ImageNet usually)
2. Remove classification head
3. Add task-specific head
4. Fine-tune on new data

Advantages:
- Faster convergence
- Better generalization
- Works with small datasets

// Fine-tuning strategies
Feature extraction: Freeze backbone, train head only
Fine-tune: Lower learning rate for backbone
Progressive unfreezing: Unfreeze layers gradually

// Layer-wise learning rates
Early layers: Low LR (keep ImageNet features)
Deep layers: Higher LR (task-specific)
Typical: 10× difference between layers

// Data augmentation (critical!)
Random crop, flip, rotate, color jitter
Cutout, mixup, augmentation policies
AutoAugment, RandAugment for best results

Code (PyTorch):
model = torchvision.models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, num_classes)  # New head
# Freeze backbone
for param in model.parameters():
    param.requires_grad = False
model.fc.requires_grad = True

Common CV Tasks

Task	Input	Output	Metrics	Models
Classification	Image	Class	Accuracy, F1	ResNet, EfficientNet
Detection	Image	Boxes + classes	mAP, IoU	YOLO, Faster R-CNN
Segmentation	Image	Pixel labels	IoU, Dice	U-Net, DeepLab
Pose estimation	Image	Joint positions	PCK, OKS	OpenPose, HRNet
Face recognition	Face	Identity	Accuracy, AUC	FaceNet, ArcFace
Optical flow	2 frames	Motion vectors	EPE, Acc	FlowNet, RAFT
3D reconstruction	Multiple views	3D model	Chamfer, F-score	NeRF, MVSNet
Depth estimation	Image/stereo	Depth map	RMSE, δ	MiDaS, Monodepth

Computer Vision Cheat Sheet

Image Fundamentals

Object Detection (YOLO, SSD, Faster R-CNN)

Semantic Segmentation (FCN, U-Net, DeepLab)

Pre-trained Models & Transfer Learning

Common CV Tasks

More Cheat Sheets

Machine Learning Cheat Sheet

Deep Learning Cheat Sheet

Neural Network Math Cheat Sheet