# A short tutorial on matrix derivatives

30 April 2020 • 8 min read

The normal equation arises in the context of identifying the line of best fit for a set of points. In this post, I present a short tutorial on calculating derivatives of matrix equations, which I then apply to the residual sum of squares to derive the normal equation.

The normal equation, $$\mathbf{X}^T(\mathbf{y} - \mathbf{X}\beta) = \mathbf{0}, \tag{1}$$ describes the value $\beta$ which minimizes the residual sum of squares, $$\text{RSS}(\beta) = (\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta), \tag{2}$$ and is used to determine the line of best fit through a set of points. Texts that I’ve read will either show the derivation by decomposing the matrix product into summations over the individual matrix elements, or claim without derivation that the normal equation is the derivative of the $\text{RSS}$. The former approach works, but I’m not going to use it in this post. For the latter, the conclusion seems straightforward to some people—the $\text{RSS}$ is a quadratic equation and the derivative is calculated using the product rule. The derivative of the $\text{RSS}$ is then the derivative of the term in the parentheses $(\mathbf{X}^T)$, times the term in the parentheses $(\mathbf{y} - \mathbf{X}\beta)$, times $2$. The $2$ is missing from the normal equation above because $2\cdot\mathbf{A} = 0$ implies that $\mathbf{A} = 0$.

While this seems reasonable, what if we represented the equation above as \begin{align} \text{RSS}(\beta) = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\beta - \beta^T\mathbf{X}^T\mathbf{y} + \beta^T\mathbf{X}^T\mathbf{X}\beta. \end{align} We can see that $\frac{\partial}{\partial\beta} \mathbf{y}^T\mathbf{y} = 0$ for the first term, since there is no $\beta$ in the term. For the second term, we get that $-\frac{\partial}{\partial\beta} \mathbf{y}^T\mathbf{X}\beta = -\mathbf{y}^T\mathbf{X}$. But, what about the third term? Should it be $-\frac{\partial}{\partial\beta} \beta^T\mathbf{X}^T\mathbf{y} = -\mathbf{X}^T\mathbf{y}$ or $-\frac{\partial}{\partial\beta} \beta^T\mathbf{X}^T\mathbf{y} = -\mathbf{y}^T\mathbf{X}$? How do we take a derivative of an equation in terms of $\beta$ if the equation has $\beta^T$ in it? We don’t have to deal with transposes when taking derivatives of polynomials on the real numbers. And what about the last term? If we use the product rule, $(f \cdot g)' = f \cdot g' + f' \cdot g$, with $f = \beta^T\mathbf{X}^T$ and $g = \mathbf{X}\beta$, we have the same problem of having to determine if $\frac{\partial}{\partial\beta} \beta^T\mathbf{X}^T$ is $\mathbf{X}$ or $\mathbf{X}^T$.

Now, we could look at the dimensions of the matrices to see which version would work, or we could memorize the answer, but I think it would be more helpful to show how we determine these derivatives using the definition of the derivative and some basic properties about vectors and inner products. But first, I’m going to present a quick overview of linear regression to motivate the normal equation for readers who are not familiar with it.

Linear regression is the problem of finding the line of best fit through a set of points. As a simple example, let’s take a set of points defined by values $\mathbf{y}$ and $\mathbf{X}$ in two dimensions, If I draw a line through this set of points, then a question I can ask myself is whether or not this is the best line through these points. There are multiple ways I could define best. If I represent the y-values of this proposed line by $\mathbf{\hat{y}}$ then one proposal for the definition of best could be the line which minimizes the sum of absolute deviations, $\sum_i \lvert y_i - \hat{y}_i \rvert$. This would be the sum of the lengths of the vertical line segments on the following plot, For various reasons, it is often more convenient to define the line of best fit as the one which minimizes the residual sum of squares, $\sum_i (y_i - \hat{y}_i)^2$. This is the sum of the squares of the lengths of the vertical line segments above. One justification for this is that I can then easily use calculus to minimize the function, whereas the absolute value function is less analytically convenient.

Although the example used above considers the problem of finding a line through a set of points in two dimensions, I can use the same method to find a line of best fit in any finite dimensional space. For this reason I am going to use matrix notation, where $\mathbf{y} \in \mathbb{R}^N$ is an $N$-dimensional vector, $\mathbf{X} \in \mathbb{R}^{N,p+1}$ is an $N \times p + 1$ matrix, and $\beta \in \mathbb{R}^{p+1}$ is a $p+1$-dimensional parameter vector. Given this representation, the line of best fit is described by the equation

$$\mathbf{\hat{y}} = \mathbf{X}\beta,$$

and the residual sum of squares is defined as in equation $(2)$. The solution to this minimization problem is given by the normal equation $(1)$ and can be written,

$$\beta = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}.$$

To minimize the residual sum of squares, I’m going to find the value of $\beta$ where the first derivative of the RSS, $\frac{\partial}{\partial \beta} \text{RSS}$, equals $0$. To ensure that this value of $\beta$ minimizes the function, I’ll also check that the second derivative, $\frac{\partial^2}{\partial \beta \partial \beta^T} \text{RSS}$, is positive.

Now that I’ve set up the problem, it’s time to go back to the definition of the derivative to figure out how to calculate the derivative of the RSS.

We’ll take our definition of the derivative from Rudin’s Principles of Mathematical Analysis. Let $\mathbf{f}$ be a function that maps a subset of $\mathbb{R}^n$ to $\mathbb{R}^m$. Then the derivative (if it exists) is defined as the linear transformation $A$ from $\mathbb{R}^n$ to $\mathbb{R}^m$ such that

$$\begin{equation} \lim_{h \rightarrow 0} \frac{\lvert \mathbf{f}(\mathbf{x} + \mathbf{h}) - \mathbf{f}(\mathbf{x}) - A\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} = 0. \end{equation}$$

Note that the norm in the numerator is in $\mathbb{R}^m$ and the norm in the denominator is in $\mathbb{R}^n$.

To get more familiar with this definition, I’m going to apply it to a few simple examples to determine their derivatives. For the first example, the goal is to find the derivative $\frac{\partial}{\partial\mathbf{x}}$ of $\mathbf{x}^T\mathbf{b}$, a function of $\mathbf{x}$ from $\mathbb{R}^n$ to $\mathbb{R}$, where both $\mathbf{x}$ and $\mathbf{b}$ are vectors in $\mathbb{R}^n$.

I’ll start with part of the definition, ignoring the limit for now

$$\begin{equation} \frac{\lvert \mathbf{f}(\mathbf{x} + \mathbf{h}) - \mathbf{f}(\mathbf{x}) - A\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} = 0. \end{equation}$$

Substituting $\mathbf{x}^T\mathbf{b}$ for $f$, I get

\begin{align} \frac{\lvert (\mathbf{x} + \mathbf{h})^T\mathbf{b} - \mathbf{x}^T\mathbf{b} - A\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \frac{\lvert \mathbf{x}^T\mathbf{b} + \mathbf{h}^T\mathbf{b} - \mathbf{x}^T\mathbf{b} - A\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \frac{\lvert \mathbf{h}^T\mathbf{b} - A\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0. \end{align}

In the last equation, $\mathbf{h}^T\mathbf{b}$ and $A\mathbf{b}$ are equal to the inner products $\langle \mathbf{h}, \mathbf{b} \rangle$ and $\langle A^T, \mathbf{h} \rangle$. As this is defined within a real inner product space, by symmetry, $\langle A^T, \mathbf{h} \rangle = \langle \mathbf{h}, A^T \rangle$. So the last euation above is equal to

$$\frac{\lvert \langle \mathbf{h}, \mathbf{b} \rangle - \langle \mathbf{h}, A^T \rangle \rvert}{\lvert \mathbf{h} \rvert} = 0.$$

The above equation holds when $A = b^T$. Therefore $\frac{\partial \mathbf{x}^T\mathbf{b}}{\partial \mathbf{x}} = \mathbf{b}^T$.

By a similar set of calculations you can show that $\frac{\partial \mathbf{b}^T\mathbf{x}}{\partial \mathbf{x}} = \mathbf{b}^T$, as well.

For the last example, let’s calculate the derivative $\frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial \mathbf{x}}$, where $A$ is a symmetric matrix (e.g. $A = A^T$). Substituting the function into the definition of the derivative, I get

\begin{align} \frac{\lvert (\mathbf{x} + \mathbf{h})^T A (\mathbf{x} + \mathbf{h}) - \mathbf{x}^T A \mathbf{x} - B\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \frac{\lvert \mathbf{x}^T A \mathbf{x} + \mathbf{x}^T A \mathbf{h} + \mathbf{h}^T A \mathbf{x} + \mathbf{h}^T A \mathbf{h} - \mathbf{x}^T A \mathbf{x} - B\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \frac{\lvert \mathbf{x}^T A \mathbf{h} + \mathbf{h}^T A \mathbf{a} + \mathbf{h}^T A \mathbf{h} - B\mathbf{h} \rvert}{\lvert \mathbf{h} \rvert} &= 0. \end{align}

Representing the vector products as inner products and keeping in mind that $A$ is symmetric, this becomes

\begin{align} \frac{\lvert \langle A\mathbf{x}, \mathbf{h} \rangle + \langle \mathbf{h}, A \mathbf{x} \rangle + \langle A\mathbf{h}, \mathbf{h} \rangle - \langle B^T, \mathbf{h} \rangle \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \frac{\lvert \langle 2A\mathbf{x} + A\mathbf{h}, \mathbf{h} \rangle - \langle B^T, \mathbf{h} \rangle \rvert}{\lvert \mathbf{h} \rvert} &= 0,\newline \end{align}

where the simplification in the second line uses two properties: (1) the symmetry of the inner product in a real inner product space and (2) additivity in the first slot. Remembering that the definition of the derivative involves taking the limit as $h \rightarrow 0$, this indicates that $2A\mathbf{x} = B^T$ and that $\frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial \mathbf{x}} = 2\mathbf{x}^T A$.

Now that I’ve determined the values for these derivatives, I’m ready to calculate the derivatives for the RSS to derive the normal equation. First, I can see that the RSS is equal to

\begin{align} \text{RSS}(\beta) &= (\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta),\newline &= (\mathbf{y}^T - \mathbf{X}^T\beta^T)(\mathbf{y} - \mathbf{X}\beta),\newline &= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\beta - \beta^T\mathbf{X}^T\mathbf{y} + \beta^T\mathbf{X}^T\mathbf{X}\beta. \end{align}

Using the derivatives derived before and the observation that $\mathbf{X}^T\mathbf{X}$ is symmetric, I can see that the derivative of the RSS is

\begin{align} \frac{\delta \text{RSS}}{\delta \beta} &= 0 - \mathbf{y}^T\mathbf{X} - \mathbf{y}^T\mathbf{X} + 2\beta^T\mathbf{X}^T\mathbf{X},\newline &= -2\mathbf{y}^T\mathbf{X} + 2\beta^T\mathbf{X}^T\mathbf{X}, \newline &= -2(\mathbf{y}^T - \beta^T\mathbf{X}^T)\mathbf{X}, \newline &= -2(\mathbf{y} - \mathbf{X}\beta)^T\mathbf{X}. \end{align}

This row vector is the transpose of the column vector in the normal equation as presented in equation $(1)$ at the beginning of the post. I can also see—most clearly from line 2 of the above derivation—that the second derivative, $\frac{\partial^2 \text{RSS}}{\partial \beta \partial \beta^T}$, is equal to $2\mathbf{X}^T\mathbf{X}$. Assuming that $\mathbf{X}$ has full column rank, then $\mathbf{X}^T\mathbf{X}$ is positive definite and the RSS has a minimum at

\begin{align} \mathbf{0} &= -2\mathbf{y}^T\mathbf{X} + 2\beta^T\mathbf{X}^T\mathbf{X},\newline 2\beta^T\mathbf{X}^T\mathbf{X} &= 2\mathbf{y}^T\mathbf{X},\newline \beta^T &= \mathbf{y}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1},\newline \beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}. \end{align}

I’m @siclait on Twitter—reach out to continue the conversation.