// Tokenization
Word tokenization: Split on spaces/punctuation
Subword tokenization: BPE, WordPiece, SentencePiece
Character-level: Split into characters
Sentence tokenization: Split on periods, etc.
Popular tokenizers:
spaCy: Rule-based, fast
NLTK: Classic, many options
Hugging Face tokenizers: Fast, BERT-compatible
Code example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Hello world!")
# Output: ['hello', 'world', '!']
// Special tokens
[CLS]: Classification token (start)
[SEP]: Separator between sentences
[PAD]: Padding token
[UNK]: Unknown token
[MASK]: Masked token (for pretraining)
// Padding & Truncation
Pad shorter sequences to max length
Truncate longer sequences
Attention mask: 1 for real tokens, 0 for padding
NLP Essentials Cheat Sheet
Text Processing
Word Embeddings
// Embeddings: Dense vector representation
Word2Vec: Skip-gram, CBOW
GloVe: Combines matrix factorization + context window
FastText: Subword embeddings, handles OOV
ELMo: Contextual from biLSTM
BERT/GPT: Contextual from Transformer
// Word2Vec (Skip-gram)
Input: Center word → Output: Context words
Uses negative sampling for efficiency
Learns: P(context | center_word)
Result: 300-dim vectors capturing semantics
// Properties
Similar words → close vectors
Vector arithmetic:
king - man + woman ≈ queen
paris - france + germany ≈ berlin
// Dimensionality
100-300 for most tasks
Larger: More capacity, slower
Smaller: Faster, less overfitting
// Similarity metrics
Cosine similarity: cos(θ) = A·B / (|A||B|)
Euclidean distance: √(Σ(aᵢ - bᵢ)²)
Manhattan distance: Σ|aᵢ - bᵢ|
Code example:
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=300, window=5)
vector = model.wv['king']
similar = model.wv.most_similar('king', topn=5)
Transformer Architecture
// Components
Multi-Head Attention: Learn different relationships
Feed-Forward: Dense layers for transformation
LayerNorm: Normalize before each sublayer
Residual connections: x + f(x)
// Self-Attention
Q = W_Q × X [queries]
K = W_K × X [keys]
V = W_V × X [values]
Attention = softmax(Q×K^T / √d_k) × V
Output = linear(Attention)
// Multi-Head
Divide into h heads, apply attention separately
Concatenate and project
Allows learning different relationships
// Positional Encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Add to embeddings, helps learn position
// Encoder (BERT, RoBERTa)
Stack of encoder blocks
Each: MultiHead Attention → Add & Norm → FFN → Add & Norm
Bidirectional: Can attend to left AND right
// Decoder (GPT, LLaMA)
Stack of decoder blocks
Causal attention: Can only attend to previous tokens
Left-to-right: Generate one token at a time
Unidirectional
// Encoder-Decoder (T5, BART)
Encoder: Process input
Decoder: Generate output, attends to encoder
Cross-attention: Decoder attends to encoder
BERT, GPT & Fine-tuning
// BERT (Bidirectional Encoder Representations)
Pretraining objectives:
1. Masked Language Model: Predict [MASK] tokens
2. Next Sentence Prediction: Is sentence B after A?
Architecture:
12-24 layers, 768-1024 hidden, 12 attention heads
Usage:
[CLS] → Classification pooling
[CLS] sent1 [SEP] sent2 → Sentence pair tasks
Fine-tuning:
Add task-specific layer on top
Train on labeled data (small datasets OK)
// GPT (Generative Pre-trained Transformer)
Pretraining: Language modeling (predict next token)
Unidirectional: Left-to-right only
Larger models → better performance
GPT-2: 1.5B params
GPT-3: 175B params (few-shot learning)
GPT-4: Even larger, multimodal
Usage:
Text generation, few-shot learning, instruction-following
// Fine-tuning Best Practices
Learning rate: Lower (1e-5 to 5e-5)
Batch size: Small (8, 16)
Epochs: 3-5 (avoid overfitting)
Warmup: Linear warmup helpful
Early stopping: Monitor validation loss
Code example (HuggingFace):
from transformers import AutoModelForSequenceClassification, Trainer
model = AutoModelForSequenceClassification.from_pretrained("bert-base")
trainer = Trainer(model, train_dataset, eval_dataset)
trainer.train()
Seq2Seq & Attention
// Sequence-to-Sequence
Encoder: RNN/Transformer reads input
Context vector: Final hidden state
Decoder: RNN/Transformer generates output
Uses context vector and previous tokens
// Attention Mechanism
Problem: Context vector bottleneck
Solution: Decoder attends to all encoder states
Bahdanau attention (additive):
score = tanh(W[h_dec; h_enc])
Luong attention (multiplicative):
score = h_dec^T × h_enc
// Beam Search (decoding)
Don't greedily pick top-1 token
Keep k candidates (beams) at each step
Final: Return best by log-probability
// Temperature sampling
Softmax(logits / T)
T < 1: Sharp (argmax-like)
T = 1: Normal
T > 1: Flat (random-like)
Higher T → More diverse but less coherent
// Applications
Machine translation: English → Spanish
Summarization: Long → Short
Question answering: Question → Answer
Image captioning: Image → Description
Evaluation Metrics
// BLEU (Bilingual Evaluation Understudy)
Compares n-grams with reference translation
BLEU = BP × exp(Σ w_n × log(p_n))
Score 0-1, higher is better
Common in machine translation
Issues: Doesn't capture semantics
Code:
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "cat"]]
hypothesis = ["the", "cat"]
score = sentence_bleu(reference, hypothesis)
// ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence
Common in summarization
// METEOR (Metric for Evaluation of Translation)
Considers synonyms, stemming
Correlates better with human judgment than BLEU
// Perplexity
PP = exp(-1/N × Σ log(P(y_i)))
Lower is better
Measures how well probability distribution predicts text
// Exact Match & F1
Used in QA tasks
Exact Match: Did we predict exact answer?
F1: Token-level precision/recall
// Human Evaluation
Best but expensive
Fluency, adequacy, relevance ratings
Always valuable for final assessment