// Images as arrays
Grayscale: 2D array (height × width)
RGB: 3D array (height × width × 3)
Pixel values: 0-255 (uint8) or 0-1 (float)
// Convolution operation
Filter (kernel): Small matrix (3×3, 5×5)
Slide filter over image
At each position: sum(element-wise product)
Output: Feature map
Example (3×3 filter on 5×5 image):
Without padding: Output 3×3
With padding=1: Output 5×5
// Stride & Padding
Stride=1: Move filter 1 pixel at a time
Stride=2: Move 2 pixels (downsampling)
Padding: Add zeros around edges
Same padding: Keep size
Valid padding: No padding
Output size = (input - kernel + 2×padding) / stride + 1
// Pooling
Max pooling: Take maximum
Average pooling: Take average
Typical: 2×2 with stride 2
Reduces size, keeps important features
Computer Vision Cheat Sheet
Image Fundamentals
Object Detection (YOLO, SSD, Faster R-CNN)
// YOLO (You Only Look Once)
Single pass detection
Divide image into grid
Each cell: bounding box + class confidence
Fast (real-time), less accurate than R-CNN
Outputs per cell:
- x, y, w, h: bounding box
- confidence: P(object)
- class probabilities: P(class | object)
Loss: Localization + confidence + classification
// Faster R-CNN
Region-based (2-stage)
1. Region proposal network (RPN)
2. Classification & bbox refinement
More accurate but slower
// SSD (Single Shot MultiBox Detector)
Multi-scale feature maps
Balance between speed & accuracy
Detects at different scales
// IoU (Intersection over Union)
IoU = Area(intersection) / Area(union)
Threshold 0.5 for detection match
mAP: Average precision across IoU
// NMS (Non-Maximum Suppression)
Remove overlapping boxes
Sort by confidence
Iteratively remove lower-confidence overlaps
Reduces duplicate detections
Semantic Segmentation (FCN, U-Net, DeepLab)
// Semantic Segmentation
Classify each pixel
Output: Same size as input
Each pixel: Class label
// U-Net
Encoder-decoder with skip connections
Encoder: Downsampling (features)
Decoder: Upsampling + skip connections
Popular: Medical imaging, dense prediction
Architecture:
Contracting path: Conv → ReLU → Pool
Expanding path: Upconv → Concat skip → Conv
// Fully Convolutional Networks (FCN)
Replace fully connected with convolutions
Preserve spatial information
Upsampling: Transposed convolution
// DeepLab
Atrous (dilated) convolution
Captures multi-scale context
Atrous spatial pyramid pooling (ASPP)
State-of-art segmentation
// Loss functions
Dice loss: 2×TP/(2×TP + FP + FN)
Focal loss: Handles class imbalance
Cross-entropy + dice: Combined
// Instance Segmentation (Mask R-CNN)
Combines detection + segmentation
Faster R-CNN + Mask head
Outputs: Bounding box + pixel mask
Pre-trained Models & Transfer Learning
// Architectures
ResNet: Skip connections, 50/101/152 layers
VGG: Simple, uniform (3×3 convs)
Inception/InceptionV4: Multi-scale features
MobileNet: Lightweight, mobile-friendly
EfficientNet: Scales depth/width/resolution
Vision Transformer: Pure attention-based
// Transfer Learning
Task: Use pre-trained model on new task
1. Load pretrained weights (ImageNet usually)
2. Remove classification head
3. Add task-specific head
4. Fine-tune on new data
Advantages:
- Faster convergence
- Better generalization
- Works with small datasets
// Fine-tuning strategies
Feature extraction: Freeze backbone, train head only
Fine-tune: Lower learning rate for backbone
Progressive unfreezing: Unfreeze layers gradually
// Layer-wise learning rates
Early layers: Low LR (keep ImageNet features)
Deep layers: Higher LR (task-specific)
Typical: 10× difference between layers
// Data augmentation (critical!)
Random crop, flip, rotate, color jitter
Cutout, mixup, augmentation policies
AutoAugment, RandAugment for best results
Code (PyTorch):
model = torchvision.models.resnet50(pretrained=True)
model.fc = nn.Linear(2048, num_classes) # New head
# Freeze backbone
for param in model.parameters():
param.requires_grad = False
model.fc.requires_grad = True
Common CV Tasks
| Task | Input | Output | Metrics | Models |
|---|---|---|---|---|
| Classification | Image | Class | Accuracy, F1 | ResNet, EfficientNet |
| Detection | Image | Boxes + classes | mAP, IoU | YOLO, Faster R-CNN |
| Segmentation | Image | Pixel labels | IoU, Dice | U-Net, DeepLab |
| Pose estimation | Image | Joint positions | PCK, OKS | OpenPose, HRNet |
| Face recognition | Face | Identity | Accuracy, AUC | FaceNet, ArcFace |
| Optical flow | 2 frames | Motion vectors | EPE, Acc | FlowNet, RAFT |
| 3D reconstruction | Multiple views | 3D model | Chamfer, F-score | NeRF, MVSNet |
| Depth estimation | Image/stereo | Depth map | RMSE, δ | MiDaS, Monodepth |