← Home Apply Now

Understanding Large Language Models

A deep dive into the mathematics and engineering behind modern AI. From raw text to transformers, from training to advanced techniques like RLHF and multimodal models. The goal is not just to use these models, but to understand how and why they work.

Part 1: Introduction

From Symbols to Numbers: Fundamentals

Everything starts with a simple idea: computers only see numbers. Characters, words, images, and even "meaning" all need to be represented as vectors and matrices before a model can work with them. In this section, you'll master the linear algebra and data representations that power modern AI.

$$\begin{bmatrix} a & b & c \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = ax + by + cz$$

Key Concepts

  • Vectors, matrices, and matrix multiplication
  • Text as bytes, integers, and one-hot vectors
  • Neural network layers as matrix transformations
  • The geometry of high-dimensional spaces

Questions to Explore

  • How does "Hello" become computable?
  • Why high-dimensional spaces for language?
  • What is mathematical "understanding"?
  • How do embeddings capture meaning?

Part 2: The Transformer Architecture

Transformers are built from a small set of powerful components: tokenization, embeddings, attention, and feed-forward layers. In this part, you'll dissect each piece and see how they work together to process sequences of text (and later, other modalities).

Tokenization: How Words Become Numbers

Before a model can "read" anything, text must be broken into tokens and mapped to integer IDs. Tokenization choices affect everything: model performance, multilingual behavior, and even what the model struggles with.

πŸ‡¬πŸ‡§ English
87
Characters
14
Tokens
You will learn why asking the same question in different languages yields different results
1639, 481, 2193, 1521, 4737, 262, 976, 1808, 287, 1180, 8950, 19299, 1180, 2482
πŸ‡ΏπŸ‡¦ Xhosa
103
Characters
50
Tokens
Uya kufunda ukuba kutheni ukubuza umbuzo ofanayo iilwimi ezahlukeneyo kunika iziphumo ezahlukeneyo
52, 3972, 479, 3046, 46535, 334, 74, 22013, 479, 315, 831, 72, 334, 74, 549, 84, 4496, 20810, 10277, 78, 286, 272, 323, 78, 1312, 346, 86, 25236, 304, 89, 993, 2290, 74, 1734, 8226, 479, 403, 9232, 220, 528, 13323, 43712, 304, 89, 993, 2290, 74, 1734, 8226

Key Insight

Xhosa needs 3.5Γ— more tokens than English for the same meaning. This tokenization bias directly impacts model performance, cost, and speed for underrepresented languages.

Core Concepts

  • Subword tokenization (BPE, WordPiece)
  • Speed vs. vocabulary size trade-offs
  • Multilingual tokenization challenges
  • Character vs. word-level approaches

Critical Questions

  • Why not just split on spaces?
  • How does tokenization affect fairness?
  • What happens with unseen languages?
  • Can we build universal tokenizers?

Embeddings: From Tokens to Meaning

Once we have token IDs, we map each token to a vector in continuous space. These embedding vectors capture semantic relationships that let the model reason about similarity and analogy.

$$\text{King} - \text{Man} + \text{Woman} \approx \text{Queen}$$ $$\text{Paris} - \text{France} + \text{Japan} \approx \text{Tokyo}$$

Core Concepts

  • Token β†’ Vector mapping matrices
  • 768-4096 dimensional spaces
  • Positional encodings for word order
  • Semantic geometry in vector space

Deep Questions

  • What is geometric "similarity"?
  • Why do analogies work as vector math?
  • How do embeddings capture meaning?
  • Can embeddings be universal?

Attention: The Breakthrough That Changed Everything

Attention lets the model dynamically focus on relevant parts of the input. Every position can look at every other position with learned weights - the key innovation that made modern LLMs possible.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Core Mechanisms

  • Queries, Keys, Values (Q, K, V)
  • Self vs cross-attention
  • Multi-head parallel processing
  • O(nΒ²) scaling challenges

Key Insights

  • How attention tracks relationships
  • Why heads specialize differently
  • Attention patterns as interpretability
  • Long-context limitations

Feed-Forward Networks: The Hidden Brain

After attention mixes information, each token passes through feed-forward networks. These layers add non-linearity and act as the model's "memory store."

$$\text{ReLU: } f(x) = \max(0, x) \quad \text{GELU: } f(x) = x \cdot \Phi(x) \quad \text{SiLU: } f(x) = x \cdot \sigma(x)$$

Architecture

  • Why non-linearity is essential
  • 4Γ— width expansion pattern
  • Activation function choices
  • FFN as key-value memory

Key Questions

  • Without non-linearity?
  • Do neurons = concepts?
  • Why always 4Γ— wider?
  • Memory or computation?

Part 3: Training

Now that we have the architecture, how do we make it learn? This part covers the complete training pipeline: data preparation, optimization algorithms, regularization techniques, and the practical tricks that make the difference between success and failure.

Datasets and Evaluation

Data quality determines model capability. Master curation, evaluation metrics, and avoiding pitfalls that cause real-world failures.

Data Engineering

  • Quality filtering techniques
  • Train/val/test splitting
  • Perplexity, BLEU, accuracy
  • Preventing data leakage

Critical Issues

  • Domain-specific data needs
  • Overfitting vs generalization
  • Hidden dataset biases
  • Evaluation metric gaming

Optimizers and Learning Rates

Plain gradient descent is rarely enough for modern deep learning. You'll master advanced optimizers that adapt learning rates per-parameter, momentum techniques that accelerate convergence, and scheduling strategies that help navigate complex loss landscapes.

Optimization Algorithms:

  • SGD, Momentum, and Nesterov acceleration
  • Adam, AdamW, and why weight decay matters
  • Learning rate schedules: warmup, cosine annealing, exponential decay
  • Gradient clipping and dealing with exploding gradients

Practical Insights:

  • What happens if your learning rate is 10Γ— too high? 10Γ— too low?
  • Why do we use warmup at the start of training large transformers?
  • How do you diagnose and fix training instabilities?

Regularization & Preventing Overfitting

Large models can easily memorize training data instead of learning generalizable patterns. You'll learn the techniques that encourage models to learn robust representations that transfer to new data.

Regularization Techniques:

  • Dropout and its variants (DropConnect, DropBlock)
  • Weight decay and L2 regularization
  • Data augmentation strategies
  • Early stopping and checkpoint selection

Diagnosing Overfitting:

  • Reading loss curves: what do they tell you?
  • The generalization gap: train vs validation performance
  • When is memorization actually desirable?

Part 4: Advanced Techniques

Once you can build and train base models, the real power comes from adaptation and alignment. This part covers the frontier techniques used in production systems: fine-tuning, RLHF, multimodal models, and massive scale.

Advanced Techniques

Modern AI goes beyond basic training - fine-tuning, RLHF, and multimodal architectures enable powerful, aligned systems at massive scale.

Fine-Tuning & Adaptation

  • LoRA & parameter-efficient methods
  • Instruction tuning strategies
  • Avoiding catastrophic forgetting
  • When to fine-tune vs train

RLHF & Alignment

  • Human preference learning
  • Reward model training
  • PPO policy optimization
  • Helpfulness vs safety balance

Multimodal Models

  • Vision transformers & CLIP
  • Cross-modal attention
  • Text + image + audio fusion
  • Unified representation learning

Mixture of Experts

  • Sparse routing mechanisms
  • Trillion parameters, lower cost
  • Emergent capabilities at scale
  • Distributed training strategies