A deep dive into the mathematics and engineering behind modern AI. From raw text to transformers, from training to advanced techniques like RLHF and multimodal models. The goal is not just to use these models, but to understand how and why they work.
Everything starts with a simple idea: computers only see numbers. Characters, words, images, and even "meaning" all need to be represented as vectors and matrices before a model can work with them. In this section, you'll master the linear algebra and data representations that power modern AI.
Transformers are built from a small set of powerful components: tokenization, embeddings, attention, and feed-forward layers. In this part, you'll dissect each piece and see how they work together to process sequences of text (and later, other modalities).
Before a model can "read" anything, text must be broken into tokens and mapped to integer IDs. Tokenization choices affect everything: model performance, multilingual behavior, and even what the model struggles with.
Xhosa needs 3.5Γ more tokens than English for the same meaning. This tokenization bias directly impacts model performance, cost, and speed for underrepresented languages.
Once we have token IDs, we map each token to a vector in continuous space. These embedding vectors capture semantic relationships that let the model reason about similarity and analogy.
Attention lets the model dynamically focus on relevant parts of the input. Every position can look at every other position with learned weights - the key innovation that made modern LLMs possible.
After attention mixes information, each token passes through feed-forward networks. These layers add non-linearity and act as the model's "memory store."
Now that we have the architecture, how do we make it learn? This part covers the complete training pipeline: data preparation, optimization algorithms, regularization techniques, and the practical tricks that make the difference between success and failure.
Data quality determines model capability. Master curation, evaluation metrics, and avoiding pitfalls that cause real-world failures.
Plain gradient descent is rarely enough for modern deep learning. You'll master advanced optimizers that adapt learning rates per-parameter, momentum techniques that accelerate convergence, and scheduling strategies that help navigate complex loss landscapes.
Large models can easily memorize training data instead of learning generalizable patterns. You'll learn the techniques that encourage models to learn robust representations that transfer to new data.
Once you can build and train base models, the real power comes from adaptation and alignment. This part covers the frontier techniques used in production systems: fine-tuning, RLHF, multimodal models, and massive scale.
Modern AI goes beyond basic training - fine-tuning, RLHF, and multimodal architectures enable powerful, aligned systems at massive scale.