Matrix Calculus for Machine Learning and Beyond
This is the course page for an 18.S096 Special Subject in Mathematics at MIT taught in January 2022 (IAP) by
Professors Alan Edelman and Steven G. Johnson.
Lectures: MWF 11am–1pm, Jan 10–28, virtually via Zoom. 3 units, 2 problem sets due Jan 19 and Jan 26, no exams. Piazza discussions. TA/grader: Gaurav Arya.
Description:
We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning require the next big step, matrix calculus.
This class covers a coherent approach to matrix calculus showing techniques that allow you to think of a matrix holistically (not just as an array of scalars), compute derivatives of important matrix factorizations, and really understand forward and reverse modes of differentiation. We will discuss adjoint methods, custom Jacobian matrix vector products, and how modern automatic differentiation is more computer science than mathematics in that it is neither symbolic nor finite differences.
Prerequisites: Linear Algebra such as 18.06 and multivariate calculus such as 18.02.
Course will involve simple numerical compuations using the Julia language. Ideally install it on your own computer following these instructions, but as a fallback you can run it in the cloud here:
Topics:
The following is a list of topics we intend to cover, and roughly the order in which we will cover them. Subject to revision as the course progresses.
 Multidimensional Taylor expansions, derivatives as linear operators, and multidimensional chain rule.
 Forward and reversemode differentiation for manual and automatic multivariate differentiation. Chain rules on computational graphs (e.g. neural networks).
 Derivatives of matrix functions and factorizations.
 Adjoint methods and vJp/pullback rules.
 Applications in engineering/scientific optimization, machine learning.
 Exterior calculus. Linear operators and Kronecker products.
Lecture 1 (Jan 10)

part 1: introductory slides

part 2: derivatives as linear operators (handwritten notes)

lecture video (unfortunately missing a section of part 2).
Further reading: matrixcalculus.org (linked in the slides) is a fun site to play with derivatives of matrix and vector functions. The Matrix Cookbook has a lot of formulas for these derivatives, but no derivations. Some notes on vector and matrix differentiation were posted for 6.S087 from IAP 2021.
Further reading (fancier math): the perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I’m calling “fancy”) presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The “littleo notation” o(δx) we’re using here for “infinitesimal asymptotics” is closely related to the asymptotic notation used in computer science, but in computer science people are typically taking the limit as the argument (often called “n”) becomes very large instead of very small. A fancy name for a row vector is a “covector” or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to nonEuclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.
Lecture 2 (Jan 12)
 video
 part 1: derivatives as linear operators, continued from lecture 1.
 part 2: 2×2 Matrix Jacobians (html) (pluto notebook source code)
 pset 1 (due next Wed)
Continued discussing Jacobian matrices (for vectorvalued functions of vectors), with some example. Switched a streamlined “infinitesimal” notation df=f'(x)dx, where we now simply omit higherorder terms instead of writing o(δx), and f'(x) is taken to be a linear operator acting to the right on dx (≠ dx f'(x)!). Sum, product, and chain rules for derivatives as linear operators.
The chain rule and associativity: forward vs. reverse differentiation. The chain rule leads to a product of Jacobian matrices, and while you can’t rearrange this product (matrix multiplication is not commutative) you can change the order in which you do the multiplications from lefttoright or righttoleft (matrix multiplication is associative). It turns out that this ordering can have a huge impact on the practical speed at which you can compute derivatives in large problem. Explained why, if you have 1 (or few) outputs, then you want to do Jacobian products from lefttoright so that you do vector–matrix products and not matrix–matrix products: this is called reversemode differentiation (also “adjoint” differentiation, or “backpropagation” in machine learning). Conversely, if there is only 1 (or a few) inputs, then you want to do the chain rule from righttoleft, calledd forwardmode differentiation. Gave an example of training neural networks, where there are zillions of inputs (the “fitting” parameters of the NN) but only one output (the “loss” function measuring the error compared to training data), and this leads to reversemode differentiation or “backpropagation”. (Unfortunately somewhat garbled during lecture, but written notes are cleaned up.)
In part 2 (last few minutes), began setting up some example problems involving matrix functions of matrices, and “vectorization” of matrices to vectors and linear operators to Kronecker products. To be continued.
Further reading: The terms “forwardmode” and “reversemode” differentiation are most prevalent in automatic differentiation, which will will cover later in this course. You can find many, many articles online about backpropagation in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx righttoleft) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian lefttoright is called a pullback.
Lecture 3 (Jan 14)
 part 1: slides
 part 1: Matrix Jacobians notebook from lecture 2 (html), (pluto notebook source code)
 part 2: notebook
 video
Revisiting the gradient ∇f. Scalar functions of matrices, matrix dot products, and the trace. Matrix Jacobians, continued from lecture 2. Finitedifference approximation.
Further reading: Wikipedia has a useful list of properties of the matrix trace. The “matrix dot product” introduced today is also called the Frobenius inner product, and the corresponding norm (“length” of the matrix viewed as a vector) is the Frobenius norm. When you “flatten” a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products. The Matrix Cookbook has lots of formulas for derivatives of matrix functions. There is a lot of information online on finite difference approximations, these 18.303 notes, or Section 5.7 of Numerical Recipes. The Julia FiniteDifferences.jl package provides lots of algorithms to compute finitedifference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floatingpoint arithmetic leads to catastrophic cancellation errors.
Lecture 4 (Jan 19)
Coming soon…