APEX Hessians and the General Second Derivative Test

Section 14.8 Hessians and the General Second Derivative Test

In Section 14.5, we saw that, just as for functions of a single variable (recall Section 3.1), local extreme values occur at critical points. Definition 14.5.3 defined a critical point \((a,b)\) of a function \(f(x,y)\) to be one where the gradient vanishes:

\begin{equation*} \nabla f(a,b) = \la f_x(a,b),f_y(a,b)\ra = \la 0,0\ra\text{.} \end{equation*}

🔗

Given a critical point for a function \(f\) of two variables, Theorem 14.5.13, the Second Derivative Test, tells us how to determine whether that critical point corresponds to a local minimum, local maximum, or saddle point. You might have been left wondering why the second derivative test looks so different in two variables. You might also have been left wondering what this test looks like if we have three or more variables!

🔗

The appearance of the quantity

\begin{equation*} D = f_{xx}(a,b)f_{yy}(a,b)-f_{xy}^{\,2}(a,b) \end{equation*}

seems a bit weird at first, but the idea is actually fairly simple, if you’re willing to accept Taylor’s Theorem without proof for functions of more than one variable. We already know that if \(f(x,y)\) is \(C^1\) (continuously differentiable), then we get the linear approximation

\begin{equation*} f(x,y) \approx f(a,b) +\nabla f(a,b)\cdot\langle x-a,y-b\rangle \end{equation*}

near a point \((a,b)\) in the domain of \(f\text{.}\) (Multiplying out the dot product above gives us the differential \(df\) defined in Definition 14.1.1.)

🔗

Taylor’s theorem tells us that if \(f\) is \(C^2\) (has continuous second-order derivatives), then we get the quadratic approximation

\begin{align*} f(x,y) \amp \approx f(a,b) + \nabla f(a,b)\cdot \langle x-a,y-b\rangle\\ \amp \, +\frac{1}{2}A(x-a)^2+B(x-a)(y-b)+\frac{1}{2}C(y-b)^2\text{,} \end{align*}

where \(A = \dfrac{\partial ^2 f}{\partial x^2}(a,b)\text{,}\) \(B = \dfrac{\partial^2 f}{\partial x\partial y}(a,b)\text{,}\) and \(C =\dfrac{\partial^2 f}{\partial y^2}(a,b)\text{.}\) (Compare this to the single-variable version: \(f(x)\approx f(a) + f'(a)(x-a)+\frac{1}{2}f''(a)(x-a)^2\text{.}\))

🔗

Now, if \((a,b)\) is a critical point, then \(\nabla f(a,b)=\vec{0}\text{,}\) and we get the approximation

\begin{equation*} f(x,y) \approx k+ \frac{1}{2}\left(AX^2+2BXY+CY^2\right)\text{,} \end{equation*}

where \(k=f(a,b), X=x-a, Y=y-b\text{.}\) So it’s enough to understand the critical points of the function

\begin{equation*} g(x,y) = Ax^2+2Bxy+Cy^2\text{,} \end{equation*}

since \(f\) locally looks just like \(g\text{.}\) (We’ve basically just done a shift of the graph, and stretched by a factor of 2 to get rid of the 1/2.)

🔗

Now, we can re-write \(g\) as follows, assuming \(A\neq 0\text{:}\)

\begin{align*} g(x,y) \amp = Ax^2+2Bxy+Cy^2\\ \amp = A\left(x^2+2\frac{B}{A}xy\right) + Cy^2\\ \amp = A\left(x+\frac{B}{A}y\right)^2 - \frac{B^2}{A}y^2+Cy^2\\ \amp = A\left(x+\frac{B}{A}y\right)^2 + \frac{1}{A}(AC-B^2)y^2\text{.} \end{align*}

Now we can see that this is basically just a paraboloid, as long as \(AC-B^2\neq 0\text{.}\) (Otherwise, we end up with a parabolic cylinder.)

🔗

If \(AC-B^2\gt0\) (note that this is just the discriminant \(D\text{!}\)), then both the coefficient for both terms has the same sign; if \(A\gt0\) we get an elliptic paraboloid opening upwards (local minimum), and if \(A\lt 0\) we get an elliptic paraboloid opening downwards (local maximum). If \(AC-B^2\lt 0\text{,}\) then the two terms have coefficients with opposite signs, and that gives us a hyperbolic paraboloid (saddle point).

🔗

And what if \(A=0?\) Well, in that case \(AC-B^2=-B^2\leq 0\text{,}\) so there are two cases: if \(B\neq 0\text{,}\) the second derivative test tells us to expect a saddle point, and indeed this is what we get. Either \(C=0\) as well, and \(g(x,y) = 2Bxy\text{,}\) which is just a hyperbolic paraboloid rotated by \(\pi/4\) (its contour curves are the hyperbolas \(xy=c\)), or \(C\neq 0\text{,}\) in which case you can complete the square in \(y\text{,}\) and check that the result is once again a hyperbolic paraboloid (exercise).

🔗

The other case is if \(B=0\text{,}\) in which case \(D=0\text{,}\) so we can’t make any conclusions from the second derivative test (although we’ll have \(g(x,y)=Cy^2\text{,}\) which is again a parabolic cylinder).

🔗

We will now explain how to state second derivative test in general, for functions of \(n\) variables, where \(n=1,2,3,\ldots\text{.}\) We will also give an outline of the proof of this result. The proof requires the use of Taylor’s theorem for a function of several variables, which we will not prove, and a bit of terminology from linear algebra. Our sketch of the proof follows the exposition given in the text Vector Calculus, 4th edition, by Marsden and Tromba.

🔗

Subsection 14.8.1 Taylor Polynomials in Several Variables

Before getting to the general result, let’s take a brief detour and discuss Taylor polynomials. One way of thinking about differentiability of a function \(f:D\subseteq\mathbb{R}^n\to\mathbb{R}\) is to think of the linearization \(L(\vec{x})\) as the degree one Taylor polynomial

\begin{equation*} P_1(\vec{x}) = f(\vec{a})+\nabla f(\vec{a})\cdot(\vec{x}-\vec{a}) = f(\vec{a})+ \frac{\partial f}{\partial x_1}(\vec{a})(x_1-a_1)+\cdots + \frac{\partial f}{\partial x_n}(x_n-a_n)\text{.} \end{equation*}

🔗

The requirement of differentiability is then that the remainder \(R_1(\vec{x}) = f(\vec{x})-P_1(\vec{x})\) goes to zero faster than \(\norm{\vec{x}-\vec{a}}\text{;}\) that is,

\begin{equation*} \lim_{\vec{x}\to\vec{a}}\frac{R_1(\vec{x})}{\norm{\vec{x}-\vec{a}}}=0\text{.} \end{equation*}

Using the terminology from Section 14.6, we say that \(f\) and \(P_1\) “agree to first order”. From here we can go on and ask for degree \(k\) Taylor polynomials \(P_k(\vec{x})\) that give a “\(k\)th-order approximation” of \(f\) near \(\vec{a}\text{.}\)

🔗

In other words, we want a polynomial

\begin{align*} P_k(x_1,\ldots, x_n) \amp = a_0 +a_{1}x_1+\cdots+a_{n}x_n+a_{11}x_1^2+a_{12}x_1x_2+\cdots+a_{nn}x_n^2\\ \amp \quad\quad\quad\quad +\cdots+a_{1\cdots 1}x_1^k+a_{1\cdots 12}x_1^{k-1}x_2+\cdots+a_{n\cdots n}x_n^k\text{,} \end{align*}

in \(n\) variables, of degree \(k\text{,}\) such that the remainder \(R_k(\vec{x}) = f(\vec{x})-P_k(\vec{x})\) satisfies \(R_k(\vec{x})\approx C\norm{\vec{x}-\vec{a}}^l\text{,}\) with \(l\gt k\text{.}\) In terms of limits, this means

\begin{equation*} \lim_{\vec{x}\to\vec{a}}\frac{R_k(\vec{x})}{\norm{\vec{x}-\vec{a}}^k}=0\text{.} \end{equation*}

🔗

You’ve probably already noticed a problem with talking about higher-order polynomials in several variables: the notation gets really messy, since there are so many more possible terms! For example, even a relatively simple case like a degree 3 polynomial in 3 variables looks like

\begin{align*} P(x,y,z) \amp = a+bx+cy+dz+ex^2+fxy+gxz+hy^2+kyz+lz^2\\ \amp \quad\quad\quad\quad +mx^3+nx^2y+oxy^2+pxyz+qx^2z+rxz^2+sy^3+ty^2z+uyz^2+vz^3 \end{align*}

for constants \(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v\text{!}\)

🔗

Usually we get around this using “multi-index” notation We let \(\alpha=(a_1,\ldots, a_n)\) denote a \(n\)-tuple of non-negative integers, and then we define \(\vec{x}^\alpha = x_1^{a_1}x_2^{a_2}\cdots x_n^{a_n}\text{,}\) \(\lvert\alpha\rvert=a_1+\cdots +a_n\) (so that \(\vec{x}^\alpha\) is a monomial of order \(\lvert\alpha\rvert\)), and we denote a possible coefficient of \(\vec{x}^\alpha\) by \(a_\alpha\text{.}\) A general \(k^{\text{th}}\)-order polynomial then looks like

\begin{equation*} P_k(\vec{x}) = \sum_{\lvert\alpha\rvert=0}^k a_\alpha x^\alpha\text{.} \end{equation*}

🔗

For example, in 3 variables, the terms where \(\lvert\alpha\rvert=3\) would involve \(\alpha = (3,0,0)\text{,}\) \((2,1,0)\text{,}\) \((2,0,1)\text{,}\) \((1,2,0)\text{,}\) \((1,0,2)\text{,}\) \((0,3,0)\text{,}\) \((0,2,1)\text{,}\) \((0,1,2)\text{,}\) \((0,0,3)\text{,}\) so in the above polynomial \(m=a_{(3,0,0)},\, n = a_{(2,1,0)}\text{,}\) etc., with \(\vec{x}^{(3,0,0)} = x^3,\, \vec{x}^{(2,1,0)} = x^2y\text{,}\) and so on. (Note that \(\alpha = (0,\ldots, 0)\) is the only multi-index with \(\lvert\alpha\rvert=0\)).

🔗

With all of that notational unpleasantness out of the way, we can say what the \(k^{\textrm{th}}\)-order Taylor polynomial for \(f\) near \(\vec{a}\) should be: Taylor’s Theorem, generalized to \(n\) variables, states that

\begin{equation*} P_k(\vec{x}) = \sum_{\lvert\alpha\rvert=0}^k \frac{f^{(\alpha)}(\vec{a})}{\alpha!}(\vec{x}-\vec{a})^\alpha\text{,} \end{equation*}

where \(\alpha! = a_1!a_2!\cdots a_n!\text{,}\) and

\begin{equation*} f^{(\alpha)}(\vec{a}) = \left(\frac{\partial^{a_1}}{\partial x_1^{a_1}}\frac{\partial^{a_2}}{\partial x_2^{a_n}}\cdots\frac{\partial^{a_n}}{\partial x_n}f\right)(\vec{a})\text{.} \end{equation*}

🔗

As an exercise, check that putting \(k=1\) reproduces the linearization \(P_1(\vec{x})\) (note that if \(\lvert\alpha\rvert=1\) we have to have \(\alpha = (1,0,\ldots, 0),\, (0,1,0,\ldots, 0),\) etc.), and that putting \(k=2\) gives the quadratic approximation discussed below.

🔗

Subsection 14.8.2 Quadratic Functions in Several Variables

Let \(A=[a_{ij}]\) be an \(n\times n\) matrix. We say that \(A\) is symmetric if \(A^T=A\text{,}\) or equivalently, if \(a_{ij} = a_{ji}\) for each \(i,j\) between 1 and \(n\text{.}\) To each such \(A\) we can associate a quadratic function \(q:\mathbb{R}^n\to \mathbb{R}\) given by

\begin{equation*} q(\vec{x}) = \vec{x}\cdot (A\cdot \vec{x})\text{,} \end{equation*}

or in terms of components,

\begin{equation*} q(x_1,\ldots, x_n) = \sum_{i,j=1}^n a_{ij}x^ix^j\text{.} \end{equation*}

🔗

We say that \(A\) is non-degenerate if \(\det A\neq 0\text{;}\) this is equivalent to saying that \(A\) is invertible, or that \(A\vec{x}=\vec{0}\) is possible only if \(\vec{x}=\vec{0}\text{.}\) (Note however that the corresponding property does not hold for \(q\text{:}\) it is possible to have \(q(\vec{x})=0\) for \(\vec{x}\neq \vec{0}\) even if the corresponding matrix \(A\) is non-degenerate.) For example, the quadratic function \(q(x,y)=x^2-y^2\) has \(q(1,1)=0\) and corresponds to the non-degenerate matrix \(\begin{bmatrix}1\amp 0\\0\amp -1\end{bmatrix}\text{.}\)

🔗

A quadratic function \(q\) is called positive-definite if \(q(\vec{x})\geq 0\) for all \(\vec{x}\in\mathbb{R}^n\text{,}\) and \(q(\vec{x})=0\) only for \(\vec{x}=\vec{0}\text{.}\) (Note that the quadratic function \(q(x,y)=x^2-y^2\) given above is not positive definite; however, \(\tilde{q}(x,y) = x^2+y^2\) is.) Similarly, \(q\) is negative-definite if \(q(\vec{x})\leq 0\) for all \(\vec{x}\in\mathbb{R}^n\) with \(q(\vec{x})=0\) for \(\vec{x}=\vec{0}\) only.

🔗

If \(q(\vec{x}) = \vec{x}\cdot A\vec{x}\) is positive(negative)-definite, we refer to the corresponding symmetric matrix \(A\) as positive(negative)-definite as well. In general it can be difficult to determine when a given quadratic function (or its corresponding matrix) is positive or negative-definite. In the case of a \(2\times 2\) matrix \(A = \begin{bmatrix} a\amp b\\b\amp c\end{bmatrix}\) we get

\begin{align*} q(x_1,x_2)\amp = ax_1^2+2bx_1x_2+cx_2^2\\ \amp =a\left(x_1+\frac{b}{a}x_2\right)^2+\left(c-\frac{b^2}{a}\right)x_2^2\text{,} \end{align*}

by completing the square. Since we must have \(q(x_1,0)\gt0\) if \(x_1\neq 0\text{,}\) we get \(a\gt0\text{,}\) and since \(q(0,x_2)\gt0\) for \(x_2\neq 0\text{,}\) it follows that \(ac-b^2=\det A \gt0\text{.}\) Similarly \(q\) is negative-definite if \(a\lt 0\) and \(\det A\gt0\text{.}\)

🔗

For an \(n\times n\) matrix, one test is as follows: consider the sequence of \(j\times j\) matrices \(A_j\text{,}\) for \(j=1,\ldots , n\text{,}\) given by \(A_1=[a_{11}]\text{,}\) \(A_2 = \begin{bmatrix} a_{11}\amp a_{12}\\a_{21}\amp a_{22}\end{bmatrix},\ldots, A_n=A\text{.}\) (i.e. we take upper-left square sub-matrices of increasing size.) Then \(A\) is positive-definite if and only if \(\det A_j\gt0\) for each \(j=1,\ldots n\text{,}\) and \(A\) is negative-definite if the signs of \(\det A_j\) alternate between negative and positive. (So \(\det A_1 = a_{11}\lt 0, \det A_2\gt0, \det A_3\lt 0,\ldots\text{.}\))

🔗

Another approach, which is more illuminating but requires more advanced linear algebra, is to use the fact that for any symmetric matrix \(A\text{,}\) there exists a change of basis such that \(A\) becomes a diagonal matrix \(\tilde{A}\) with respect to that basis. (i.e. \(A\) can be diagonalized.) If the entries \(\tilde{a}_{ii}\) of \(\tilde{A}\) along the main diagonal (that is, the eigenvalues of \(A\)) are all non-zero, then \(A\) is non-degenerate. If they are all positive, then \(A\) is positive-definite. If they are all negative, then \(A\) is negative-definite.

🔗

We will need the following lemma below, which is a consequence of the Extreme Value Theorem.

🔗

Theorem 14.8.1.

If \(q:\mathbb{R}^n\to \mathbb{R}\) is a positive-definite quadratic function, then there exists a real number \(M\gt0\) such that

\begin{equation*} q(\vec{x})\geq M\lVert\vec{x}\rVert^2 \end{equation*}

for any \(x\in\mathbb{R}^n\text{.}\)

🔗

To see that this is true, consider \(q(\vec{x})\) on the set \(B\) of all \(\vec{x}\) with \(\lVert \vec{x}\rVert = 1\text{.}\) The set \(B\) is closed and bounded and \(q\) is continuous on \(B\text{,}\) so by the Extreme Value Theorem, \(q\) must attain a minimum value \(M\) for some \(\vec{a}\in B\text{.}\) Now, for any constant \(c\in\mathbb{R}\text{,}\) the fact that \(q\) is quadratic implies that \(q(c\vec{x}) = c^2q(\vec{x})\text{.}\) For any non-zero \(\vec{x}\in \mathbb{R}^n\text{,}\) we know that \(\dfrac{\vec{x}}{\lVert\vec{x}\rVert}\in B\text{,}\) and thus, we have

\begin{equation*} q(\vec{x}) = q\left(\lVert\vec{x}\rVert\frac{\vec{x}}{\lVert\vec{x}\rVert}\right)=\lVert\vec{x}\rVert^2q\left(\frac{\vec{x}}{\lVert\vec{x}\rVert}\right)\geq M\lVert\vec{x}\rVert^2\text{.} \end{equation*}

Finally, if \(\vec{x}=\vec{0}\) we get \(q(\vec{0})=0=M\lVert\vec{0}\rVert^2\text{.}\)

🔗

Subsection 14.8.3 The Hessian Matrix of a Real-Valued Function

Definition 14.8.2. The Hessian Matrix.

Let \(f:\mathbb{R}^n\to \mathbb{R}\) be a function with continuous second-order partial derivatives. We define the Hessian matrix of \(f\) at a point \(\vec{a}\) in the domain of \(f\) to be the \(n\times n\) symmetric matrix

\begin{equation*} \Hess f(\vec{a}) = \frac{1}{2}\begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(\vec{a}) \amp \frac{\partial^2 f}{\partial x_1\partial x_2}(\vec{a})\amp \cdots \amp \frac{\partial^2 f}{\partial x_1\partial x_n}(\vec{a})\\ \frac{\partial^2 f}{\partial x_2\partial x_1}(\vec{a}) \amp \frac{\partial^2 f}{\partial x_2^2}(\vec{a})\amp \cdots \amp \frac{\partial^2 f}{\partial x_2\partial x_n}(\vec{a})\\ \vdots \amp \vdots \amp \amp \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1}(\vec{a}) \amp \frac{\partial^2 f}{\partial x_n\partial x_2}(\vec{a})\amp \cdots \amp \frac{\partial^2 f}{\partial x_n^2}(\vec{a}) \end{bmatrix}\text{.} \end{equation*}

🔗

Note that \(\Hess f(\vec{a})\) is symmetric by Theorem 11.3.15. The factor of \(1/2\) is included for convenience with respect to Taylor’s theorem. Recall that for a function of one variable, the second-order Taylor polynomial of \(f\) about \(x=a\) is

\begin{equation*} P_2(x)=f(a)+f'(a)(x-a)+\frac{1}{2}f''(a)(x-a)^2\text{.} \end{equation*}

🔗

For \(\vec{x}\in\mathbb{R}^n\text{,}\) let us define the quadratic function \(h_{f,\vec{a}}(\vec{x}) = \vec{x}\cdot (\Hess f(\vec{a})\cdot\vec{x})\) associated to the Hessian of \(f\) at \(\vec{a}\text{.}\) Taylor’s theorem for functions of several variables tells us that if all the third derivatives of \(f\) are continuous, then near \(\vec{a}\in\mathbb{R}^n\) we have

\begin{equation} f(\vec{x}) = f(\vec{a}) + \nabla f(\vec{a})\cdot (\vec{x}-\vec{a}) + h_{f,\vec{a}}(\vec{x}-\vec{a}) + R(\vec{a},\vec{x})\text{,}\tag{14.8.1} \end{equation}

where the remainder term \(R(\vec{a},\vec{x})\) satisfies

\begin{equation} \lim_{\vec{x}\to\vec{a}}\frac{R(\vec{x},\vec{a})}{\lVert \vec{x}-\vec{a}\rVert^2}=0\text{.}\tag{14.8.2} \end{equation}

🔗

Finally, let us define a critical point \(\vec{a}\) for \(f\) to be non-degenerate if \(\Hess f(\vec{a})\) is non-degenerate. Now we’re ready to state our result on the second derivative test.

🔗

Theorem 14.8.3. The General Second Derivative Test.

Let \(f:\mathbb{R}^n\to\mathbb{R}\) be three times continuously differentiable, and suppose that \(f\) has a non-degenerate critical point at \(\vec{a}\text{.}\) If \(\Hess f(\vec{a})\) is positive-definite, then \(\vec{a}\) is a local minimum for \(f\text{.}\) Similarly, if \(\Hess f(\vec{a})\) is negative-definite, then \(\vec{a}\) is a local maximum for \(f\text{.}\)

🔗

The way to think about this intuitively is the following: the matrix \(\Hess f(\vec{a})\) is symmetric. We know from Linear Algebra that every symmetric matrix can be diagonalized. Less obvious (but still true) is that we can make a (linear) change of variables \((u_1,\ldots, u_n) = T(x_1,\ldots, x_n)\) so that the vectors in the direction of the \(u_i\) coordinate axes are eigenvectors for \(\Hess f(\vec{a})\text{.}\) Slightly harder to show (but also true) is that this change of variables can be chosen so that it is orthogonal. That is, we simply have to rotate our coordinate system: lengths and angles are all preserved.

🔗

In this new coordinate system, the Hessian matrix is diagonal:

\begin{equation*} \Hess f(\vec{a}) = \begin{bmatrix} \lambda_1 \amp 0 \amp \cdots \amp 0\\ 0 \amp \lambda_2 \amp \cdots \amp 0\\ \vdots \amp \vdots \amp \ddots \amp \vdots\\ 0 \amp 0 \amp \cdots \amp \lambda_n\end{bmatrix}\text{.} \end{equation*}

If each of the eigenvalues \(\lambda_1,\ldots, \lambda_n\) is positive, the Hessian is positive-definite, and our critical point is a local minimum. If all the eigenvalues are negative, our critical point is a local maximum. If some of the eigenvalues are positive and some are negative, we have a saddle point.

🔗

Proving the result is somewhat more technical. Suppose \(\vec{a}\) is a critical point for \(f\text{,}\) and that \(\Hess f(\vec{a})\) is positive definite. We know that \(\nabla f(\vec{a})=0\) at a critical point, so from Equation (14.8.1) we get

\begin{equation*} f(\vec{x})-f(\vec{a}) = h_{f,\vec{a}}(\vec{x}-\vec{a})+R(\vec{a},\vec{x})\text{.} \end{equation*}

🔗

Theorem 14.8.1 tells us that \(h_{f,\vec{a}}(\vec{x}-\vec{a})\geq M\lVert\vec{x}-\vec{a}\rVert^2\) for some \(M\text{,}\) and by Equation (14.8.2), there exists a \(\delta\gt0\) such that whenever \(0\lt \lVert \vec{x}-\vec{a}\rVert\lt \delta\text{,}\) we get \(\lvert R(\vec{a},\vec{x})\rvert\lt M\lVert \vec{x}-\vec{a}\rVert^2\text{.}\) (Take \(\epsilon=M\) in the definition of the limit.)

🔗

If we carefully put all this together, we can show that

\begin{equation*} h_{f,\vec{a}}(\vec{x}-\vec{a})+R(\vec{a},\vec{x})\gt0\text{,} \end{equation*}

since

\begin{equation*} h_{f,\vec{a}}(\vec{x}-\vec{a})\geq M\lVert \vec{x}-\vec{a}\rVert^2\gt \lvert R(\vec{a},\vec{x})\rvert\text{.} \end{equation*}

Substituting this into the above equation, we get \(f(\vec{x})-f(\vec{a})\gt0\) for any \(\vec{x}\) with \(0\lt \lVert\vec{x}-\vec{a}\rVert\lt \delta\text{,}\) and thus \(f\) has a local minimum at \(\vec{a}\in\mathbb{R}^n\text{.}\) The case of a local maximum can be handled similarly (or by replacing \(f\) with \(-f\)).

🔗

Prev Top Next