Understanding Large Language Models

A deep dive into the mathematics and engineering behind modern AI. From raw text to transformers, from training to advanced techniques like RLHF and multimodal models. The goal is not just to use these models, but to understand how and why they work.

From Symbols to Numbers: Fundamentals

Everything starts with a simple idea: computers only see numbers. Characters, words, images, and even "meaning" all need to be represented as vectors and matrices before a model can work with them. In this section, you'll master the linear algebra and data representations that power modern AI.

$$\begin{bmatrix} a & b & c \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = ax + by + cz$$

Key Concepts

Vectors, matrices, and matrix multiplication
Text as bytes, integers, and one-hot vectors
Neural network layers as matrix transformations
The geometry of high-dimensional spaces

Questions to Explore

How does "Hello" become computable?
Why high-dimensional spaces for language?
What is mathematical "understanding"?
How do embeddings capture meaning?

Part 2: The Transformer Architecture

Transformers are built from a small set of powerful components: tokenization, embeddings, attention, and feed-forward layers. In this part, you'll dissect each piece and see how they work together to process sequences of text (and later, other modalities).

Tokenization: How Words Become Numbers

Before a model can "read" anything, text must be broken into tokens and mapped to integer IDs. Tokenization choices affect everything: model performance, multilingual behavior, and even what the model struggles with.

🇬🇧 English

Characters

Tokens

You will learn why asking the same question in different languages yields different results

1639, 481, 2193, 1521, 4737, 262, 976, 1808, 287, 1180, 8950, 19299, 1180, 2482

🇿🇦 Xhosa

103

Characters

Tokens

Uya kufunda ukuba kutheni ukubuza umbuzo ofanayo iilwimi ezahlukeneyo kunika iziphumo ezahlukeneyo

52, 3972, 479, 3046, 46535, 334, 74, 22013, 479, 315, 831, 72, 334, 74, 549, 84, 4496, 20810, 10277, 78, 286, 272, 323, 78, 1312, 346, 86, 25236, 304, 89, 993, 2290, 74, 1734, 8226, 479, 403, 9232, 220, 528, 13323, 43712, 304, 89, 993, 2290, 74, 1734, 8226

Key Insight

Xhosa needs 3.5× more tokens than English for the same meaning. This tokenization bias directly impacts model performance, cost, and speed for underrepresented languages.

Core Concepts

Subword tokenization (BPE, WordPiece)
Speed vs. vocabulary size trade-offs
Multilingual tokenization challenges
Character vs. word-level approaches

Critical Questions

Why not just split on spaces?
How does tokenization affect fairness?
What happens with unseen languages?
Can we build universal tokenizers?

Embeddings: From Tokens to Meaning

Once we have token IDs, we map each token to a vector in continuous space. These embedding vectors capture semantic relationships that let the model reason about similarity and analogy.

$$\text{King} - \text{Man} + \text{Woman} \approx \text{Queen}$$ $$\text{Paris} - \text{France} + \text{Japan} \approx \text{Tokyo}$$

Core Concepts

Token → Vector mapping matrices
768-4096 dimensional spaces
Positional encodings for word order
Semantic geometry in vector space

Deep Questions

What is geometric "similarity"?
Why do analogies work as vector math?
How do embeddings capture meaning?
Can embeddings be universal?

Attention: The Breakthrough That Changed Everything

Attention lets the model dynamically focus on relevant parts of the input. Every position can look at every other position with learned weights - the key innovation that made modern LLMs possible.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Core Mechanisms

Queries, Keys, Values (Q, K, V)
Self vs cross-attention
Multi-head parallel processing
O(n²) scaling challenges

Key Insights

How attention tracks relationships
Why heads specialize differently
Attention patterns as interpretability
Long-context limitations

Feed-Forward Networks: The Hidden Brain

After attention mixes information, each token passes through feed-forward networks. These layers add non-linearity and act as the model's "memory store."

$$\text{ReLU: } f(x) = \max(0, x) \quad \text{GELU: } f(x) = x \cdot \Phi(x) \quad \text{SiLU: } f(x) = x \cdot \sigma(x)$$

Architecture

Why non-linearity is essential
4× width expansion pattern
Activation function choices
FFN as key-value memory

Key Questions

Without non-linearity?
Do neurons = concepts?
Why always 4× wider?
Memory or computation?

Part 3: Training

Now that we have the architecture, how do we make it learn? This part covers the complete training pipeline: data preparation, optimization algorithms, regularization techniques, and the practical tricks that make the difference between success and failure.

Datasets and Evaluation

Data quality determines model capability. Master curation, evaluation metrics, and avoiding pitfalls that cause real-world failures.

Data Engineering

Quality filtering techniques
Train/val/test splitting
Perplexity, BLEU, accuracy
Preventing data leakage

Critical Issues

Domain-specific data needs
Overfitting vs generalization
Hidden dataset biases
Evaluation metric gaming

Optimizers and Learning Rates

Plain gradient descent is rarely enough for modern deep learning. You'll master advanced optimizers that adapt learning rates per-parameter, momentum techniques that accelerate convergence, and scheduling strategies that help navigate complex loss landscapes.

Optimization Algorithms:

SGD, Momentum, and Nesterov acceleration
Adam, AdamW, and why weight decay matters
Learning rate schedules: warmup, cosine annealing, exponential decay
Gradient clipping and dealing with exploding gradients

Practical Insights:

What happens if your learning rate is 10× too high? 10× too low?
Why do we use warmup at the start of training large transformers?
How do you diagnose and fix training instabilities?

Regularization & Preventing Overfitting

Large models can easily memorize training data instead of learning generalizable patterns. You'll learn the techniques that encourage models to learn robust representations that transfer to new data.

Regularization Techniques:

Dropout and its variants (DropConnect, DropBlock)
Weight decay and L2 regularization
Data augmentation strategies
Early stopping and checkpoint selection

Diagnosing Overfitting:

Reading loss curves: what do they tell you?
The generalization gap: train vs validation performance
When is memorization actually desirable?

Advanced Techniques

Modern AI goes beyond basic training - fine-tuning, RLHF, and multimodal architectures enable powerful, aligned systems at massive scale.

Fine-Tuning & Adaptation

LoRA & parameter-efficient methods
Instruction tuning strategies
Avoiding catastrophic forgetting
When to fine-tune vs train

RLHF & Alignment

Human preference learning
Reward model training
PPO policy optimization
Helpfulness vs safety balance

Multimodal Models

Vision transformers & CLIP
Cross-modal attention
Text + image + audio fusion
Unified representation learning

Mixture of Experts

Sparse routing mechanisms
Trillion parameters, lower cost
Emergent capabilities at scale
Distributed training strategies

Understanding Large Language Models

Part 1: Introduction

From Symbols to Numbers: Fundamentals

Key Concepts

Questions to Explore

Part 2: The Transformer Architecture

Tokenization: How Words Become Numbers

Key Insight

Core Concepts

Critical Questions

Embeddings: From Tokens to Meaning

Core Concepts

Deep Questions

Attention: The Breakthrough That Changed Everything

Core Mechanisms

Key Insights

Feed-Forward Networks: The Hidden Brain

Architecture

Key Questions

Part 3: Training

Datasets and Evaluation

Data Engineering

Critical Issues

Optimizers and Learning Rates

Optimization Algorithms:

Practical Insights:

Regularization & Preventing Overfitting

Regularization Techniques:

Diagnosing Overfitting:

Part 4: Advanced Techniques

Advanced Techniques

Fine-Tuning & Adaptation

RLHF & Alignment

Multimodal Models

Mixture of Experts