Derivative of a quadratic form: Difference between revisions

Latest revision as of 03:23, 14 July 2018

Let $A \in M_{n, n} (R)$ be an $n$ by $n$ symmetric real-valued matrix, and let $f : R^{n} \to R$ be defined by $f (x) = x^{T} A x$ . On this page, we calculate the derivative of $f$ using three methods.

Understanding the problem

Since $f$ is a real-valued function of $R^{n}$ , the derivative and the gradient coincide.

Straightforward method

This method is the most straightforward, and involves breaking apart the matrix and vector into components and performing the differentiation. While straightforward, it appears messy due to the indices involved.

Let $A = (a_{k i})$ and $x = (x_{1}, \dots, x_{n})$ .

We expand

x^{T} A x = x^{T} (\begin{matrix} \sum_{i = 1}^{n} a_{1 i} x_{i} \\ ⋮ \\ \sum_{i = 1}^{n} a_{n i} x_{i} \end{matrix}) = \sum_{k = 1}^{n} x_{k} \sum_{i = 1}^{n} a_{k i} x_{i}

Now we find the partial derivative of the above with respect to $x_{j}$ . To distinguish the constants from the variable, it makes sense to split the sum:

\sum_{k = 1}^{n} x_{k} \sum_{i = 1}^{n} a_{k i} x_{i} = x_{j} \sum_{i = 1}^{n} a_{j i} x_{i} + \sum_{k \neq j} x_{k} \sum_{i = 1}^{n} a_{k i} x_{i} = x_{j} (a_{j j} x_{j} + \sum_{i \neq j} a_{j i} x_{i}) + \sum_{k \neq j} x_{k} (a_{k j} x_{j} + \sum_{i \neq j} a_{k i} x_{i})

The first equality comes from splitting the outer summation, and the second comes from splitting the two inner summations.

Now distributing we have

\begin{array}{r} a_{j j} x_{j}^{2} + (\sum_{i \neq j} a_{j i} x_{i}) x_{j} + \sum_{k \neq j} (a_{k j} x_{k} x_{j} + x_{k} \sum_{i \neq j} a_{k i} x_{i}) \\ = a_{j j} x_{j}^{2} + (\sum_{i \neq j} a_{j i} x_{i}) x_{j} + (\sum_{k \neq j} a_{k j} x_{k}) x_{j} + \sum_{k \neq j} x_{k} \sum_{i \neq j} a_{k i} x_{i} \end{array}

It is now easy to do the differentiation. We obtain

2 a_{j j} x_{j} + \sum_{i \neq j} a_{j i} x_{i} + \sum_{k \neq j} a_{k j} x_{k}

Since the matrix is symmetric, $a_{k j} = a_{j k}$ so $\sum_{k \neq j} a_{k j} x_{k} = \sum_{k \neq j} a_{j k} x_{k} = \sum_{i \neq j} a_{j i} x_{i}$ . The final equality follows because $k$ is just an indexing variable and we are free to rename it. But now the derivative becomes

2 a_{j j} x_{j} + 2 \sum_{i \neq j} a_{j i} x_{i} = 2 \sum_{i = 1}^{n} a_{j i} x_{i}

But this is just the $j$ th component of $2 A x$ . It follows that the full derivative is just $2 A x$ (or its transpose, depending on whether we want to view it as a row or column vector).

Using the definition of the derivative

This is an expanded version of the answer at [1].

Using the definition, we can compute the derivative from first principles without exposing the components.

The derivative is the linear transformation $L$ such that:

lim_{x \to x_{0}; x \neq x_{0}} \frac{| f (x) - (f (x_{0}) + L (x - x_{0})) |}{| x - x_{0} |} = 0

Using our function, this is:

lim_{x \to x_{0}; x \neq x_{0}} \frac{| x^{T} A x - x_{0}^{T} A x_{0} - L (x - x_{0}) |}{| x - x_{0} |} = 0

Defining $h = x - x_{0}$ , we have $x = x_{0} + h$ and

\frac{| (x_{0} + h)^{T} A (x_{0} + h) - x_{0}^{T} A x_{0} - L (h) |}{| h |}

Focusing on the subexpression $(x_{0} + h)^{T} A (x_{0} + h)$ , since $A$ is a matrix, it is a linear transformation, so we obtain $(x_{0} + h)^{T} (A x_{0} + A h)$ . Since the transpose of a sum is the sum of the transposes, we have $(x_{0}^{T} + h^{T}) (A x_{0} + A h)$ . Now using linearity we have $x_{0}^{T} A x_{0} + h^{T} A x_{0} + x_{0}^{T} A h + h^{T} A h$ .

Now the fraction is

\frac{| x_{0}^{T} A x_{0} + h^{T} A x_{0} + x_{0}^{T} A h + h^{T} A h - x_{0}^{T} A x_{0} - L (h) |}{| h |} = \frac{| h^{T} A x_{0} + x_{0}^{T} A h + h^{T} A h - L (h) |}{| h |}

Focusing on $h^{T} A x_{0}$ , it is a real number so taking the transpose leaves it unchanged: $h^{T} A x_{0} = (h^{T} A x_{0})^{T} = x_{0}^{T} A^{T} h$ .

Now the fraction is

\frac{| x_{0}^{T} A^{T} h + x_{0}^{T} A h + h^{T} A h - L (h) |}{| h |} = \frac{| x_{0}^{T} (A^{T} + A) h + h^{T} A h - L (h) |}{| h |}

In the numerator, $h^{T} A h$ is a higher order term that will disappear when taking the limit, so the linear transformation we are looking for must be $L (h) = x_{0}^{T} (A^{T} + A) h$ . Since $A$ is symmetric, we have $A^{T} + A = 2 A$ and $L (h) = 2 x_{0}^{T} A h$ .

Using the chain rule

In this approach, we think of $f$ as a composition of $g (x, y) = x \cdot y$ and $h (x) = (x, A x)$ and use the multivariable chain rule.

Define:

$y = A x = (h (x))_{n + 1, \dots, n + n}$
$z = x \cdot y = g (x, y)$

What is tricky is that $y$ is not $h (x)$ ; to make the composition work, we must stick on $x$ to $y$ to form $(x, y)$ before passing to $g$ .

Now the multivariable chain rule says:

\frac{\partial z}{\partial x_{j}} = \underset{first half of terms}{\underset{⏟}{\frac{\partial z}{\partial x_{1}} \frac{\partial x_{1}}{\partial x_{j}} + \dots + \frac{\partial z}{\partial x_{n}} \frac{\partial x_{n}}{\partial x_{j}}}} + \underset{second half of terms}{\underset{⏟}{\frac{\partial z}{\partial y_{1}} \frac{\partial y_{1}}{\partial x_{j}} + \dots + \frac{\partial z}{\partial y_{n}} \frac{\partial y_{n}}{\partial x_{j}}}}

The notation is confusing because $\frac{\partial z}{\partial x_{j}}$ means different things on each side of the equation (since $x$ is both the input variable and an intermediate variable).

Looking only at the first half of the terms, $\frac{\partial x_{k}}{\partial x_{j}}$ is $1$ if $k = j$ and $0$ otherwise, so we keep only the $j$ th term, where we see $\frac{\partial z}{\partial x_{j}} = y_{j}$ .

Now looking at the second half of the terms, $\frac{\partial z}{\partial y_{k}} = x_{k}$ and $\frac{\partial y_{k}}{\partial x_{j}} = a_{k j}$ .

Putting all the above together, we obtain

\frac{\partial z}{\partial x_{j}} = y_{j} + x_{1} a_{1 j} + \dots + x_{n} a_{n j} = 2 y_{j}

In the last equality we used the fact that $A$ is symmetric.

We now have the $j$ th component of the derivative, so the full derivative is $2 y = 2 A x$ .

See [2] for something similar.

@@ Line 1: / Line 1: @@
-Let <math>A \in \mathcal M_{n,n}(\mathbf R)</math> be an <math>n</math> by <math>n</math> real-valued matrix, and let <math>f\colon \mathbf R^n \to \mathbf R</math> be defined by <math>f(x) = x^{\mathrm T}Ax</math>. On this page, we calculate the derivative of <math>f</math>.
+Let <math>A \in \mathcal M_{n,n}(\mathbf R)</math> be an <math>n</math> by <math>n</math> symmetric real-valued matrix, and let <math>f\colon \mathbf R^n \to \mathbf R</math> be defined by <math>f(x) = x^{\mathrm T}Ax</math>. On this page, we calculate the derivative of <math>f</math> using three methods.
 ==Understanding the problem==
+Since <math>f</math> is a real-valued function of <math>\mathbf R^n</math>, the derivative and the gradient coincide.
 ==Straightforward method==
@@ Line 16: / Line 18: @@
 :<math>\sum_{k=1}^n x_k \sum_{i=1}^n a_{ki}x_i = x_j \sum_{i=1}^n a_{ji}x_i + \sum_{k\ne j} x_k \sum_{i=1}^n a_{ki}x_i = x_j\left(a_{jj}x_j + \sum_{i\ne j} a_{ji} x_i\right) + \sum_{k\ne j} x_k \left(a_{kj}x_j + \sum_{i\ne j} a_{ki} x_i\right)</math>
+The first equality comes from splitting the outer summation, and the second comes from splitting the two inner summations.
 Now distributing we have
-:<math>a_{jj}x_j^2 + \left(\sum_{i\ne j} a_{ji} x_i\right)x_j + \sum_{k\ne j} \left(a_{kj}x_k x_j + x_k \sum_{i\ne j} a_{ki} x_i\right) = a_{jj}x_j^2 + \left(\sum_{i\ne j} a_{ji} x_i\right)x_j + \left(\sum_{k\ne j}a_{kj}x_k\right) x_j + \sum_{k\ne j}x_k \sum_{i\ne j} a_{ki} x_i</math>
+:<math>\begin{align}a_{jj}x_j^2 + \left(\sum_{i\ne j} a_{ji} x_i\right)x_j + \sum_{k\ne j} \left(a_{kj}x_k x_j + x_k \sum_{i\ne j} a_{ki} x_i\right) \\ = a_{jj}x_j^2 + \left(\sum_{i\ne j} a_{ji} x_i\right)x_j + \left(\sum_{k\ne j}a_{kj}x_k\right) x_j + \sum_{k\ne j}x_k \sum_{i\ne j} a_{ki} x_i\end{align}</math>
-It is now easy to do the differentiation. We have
+It is now easy to do the differentiation. We obtain
 :<math>2a_{jj}x_j + \sum_{i\ne j} a_{ji} x_i + \sum_{k\ne j}a_{kj}x_k</math>
@@ Line 28: / Line 32: @@
 :<math>2a_{jj}x_j + 2\sum_{i\ne j} a_{ji} x_i = 2\sum_{i=1}^n a_{ji} x_i</math>
+But this is just the <math>j</math>th component of <math>2Ax</math>. It follows that the full derivative is just <math>2Ax</math> (or its transpose, depending on whether we want to view it as a row or column vector).
 ==Using the definition of the derivative==
 This is an expanded version of the answer at [https://math.stackexchange.com/a/189436/35525].
+Using the definition, we can compute the derivative from first principles without exposing the components.
 The derivative is the linear transformation <math>L</math> such that:
@@ Line 60: / Line 68: @@
 ==Using the chain rule==
+In this approach, we think of <math>f</math> as a composition of <math>g(x,y) = x\cdot y</math> and <math>h(x) = (x, Ax)</math> and use the multivariable chain rule.
+Define:
+* <math>y = Ax = (h(x))_{n+1,\ldots,n+n}</math>
+* <math>z = x\cdot y = g(x,y)</math>
+What is tricky is that <math>y</math> is not <math>h(x)</math>; to make the composition work, we must stick on <math>x</math> to <math>y</math> to form <math>(x,y)</math> before passing to <math>g</math>.
+Now the multivariable chain rule says:
+:<math>\frac{\partial z}{\partial x_j} = \underbrace{\frac{\partial z}{\partial x_1}\frac{\partial x_1}{\partial x_j} + \cdots + \frac{\partial z}{\partial x_n}\frac{\partial x_n}{\partial x_j}}_{\text{first half of terms}} + \underbrace{\frac{\partial z}{\partial y_1}\frac{\partial y_1}{\partial x_j} + \cdots + \frac{\partial z}{\partial y_n}\frac{\partial y_n}{\partial x_j}}_{\text{second half of terms}}</math>
+The notation is confusing because <math>\frac{\partial z}{\partial x_j}</math> means different things on each side of the equation (since <math>x</math> is both the input variable and an intermediate variable).
+Looking only at the first half of the terms, <math>\frac{\partial x_k}{\partial x_j}</math> is <math>1</math> if <math>k=j</math> and <math>0</math> otherwise, so we keep only the <math>j</math>th term, where we see <math>\frac{\partial z}{\partial x_j} = y_j</math>.
+Now looking at the second half of the terms, <math>\frac{\partial z}{\partial y_k} = x_k</math> and <math>\frac{\partial y_k}{\partial x_j} = a_{kj}</math>.
+Putting all the above together, we obtain
+:<math>\frac{\partial z}{\partial x_j} = y_j + x_1 a_{1j} + \cdots + x_n a_{nj} = 2y_j</math>
+In the last equality we used the fact that <math>A</math> is symmetric.
+We now have the <math>j</math>th component of the derivative, so the full derivative is <math>2y = 2Ax</math>.
+See [http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.xhtml] for something similar.