## Section 14.8 Hessians and the General Second Derivative Test

In Section 14.5, we saw that, just as for functions of a single variable (recall Section 3.1), local extreme values occur at critical points. Definition 14.5.3 defined a critical point \((a,b)\) of a function \(f(x,y)\) to be one where the gradient vanishes:

Given a critical point for a function \(f\) of two variables, Theorem 14.5.13, the Second Derivative Test, tells us how to determine whether that critical point corresponds to a local minimum, local maximum, or saddle point. You might have been left wondering why the second derivative test looks so different in two variables. You might also have been left wondering what this test looks like if we have three or more variables!

The appearance of the quantity

seems a bit weird at first, but the idea is actually fairly simple, if you're willing to accept Taylor's Theorem without proof for functions of more than one variable. We already know that if \(f(x,y)\) is \(C^1\) (continuously differentiable), then we get the linear approximation

near a point \((a,b)\) in the domain of \(f\text{.}\) (Multiplying out the dot product above gives us the differential \(df\) defined in Definition 14.1.1.)

Taylor's theorem tells us that if \(f\) is \(C^2\) (has continuous second-order derivatives), then we get the *quadratic* approximation

where \(A = \dfrac{\partial ^2 f}{\partial x^2}(a,b)\text{,}\) \(B = \dfrac{\partial^2 f}{\partial x\partial y}(a,b)\text{,}\) and \(C =\dfrac{\partial^2 f}{\partial y^2}(a,b)\text{.}\) (Compare this to the single-variable version: \(f(x)\approx f(a) + f'(a)(x-a)+\frac{1}{2}f''(a)(x-a)^2\text{.}\))

Now, if \((a,b)\) is a critical point, then \(\nabla f(a,b)=\vec{0}\text{,}\) and we get the approximation

where \(k=f(a,b), X=x-a, Y=y-b\text{.}\) So it's enough to understand the critical points of the function

since \(f\) locally looks just like \(g\text{.}\) (We've basically just done a shift of the graph, and stretched by a factor of 2 to get rid of the 1/2.)

Now, we can re-write \(g\) as follows, assuming \(A\neq 0\text{:}\)

Now we can see that this is basically just a paraboloid, as long as \(AC-B^2\neq 0\text{.}\) (Otherwise, we end up with a parabolic cylinder.)

If \(AC-B^2\gt0\) (note that this is just the discriminant \(D\text{!}\)), then both the coefficient for both terms has the same sign; if \(A\gt0\) we get an elliptic paraboloid opening upwards (local minimum), and if \(A\lt 0\) we get an elliptic paraboloid opening downwards (local maximum). If \(AC-B^2\lt 0\text{,}\) then the two terms have coefficients with opposite signs, and that gives us a hyperbolic paraboloid (saddle point).

And what if \(A=0?\) Well, in that case \(AC-B^2=-B^2\leq 0\text{,}\) so there are two cases: if \(B\neq 0\text{,}\) the second derivative test tells us to expect a saddle point, and indeed this is what we get. Either \(C=0\) as well, and \(g(x,y) = 2Bxy\text{,}\) which is just a hyperbolic paraboloid rotated by \(\pi/4\) (its contour curves are the hyperbolas \(xy=c\)), or \(C\neq 0\text{,}\) in which case you can complete the square in \(y\text{,}\) and check that the result is once again a hyperbolic paraboloid (exercise).

The other case is if \(B=0\text{,}\) in which case \(D=0\text{,}\) so we can't make any conclusions from the second derivative test (although we'll have \(g(x,y)=Cy^2\text{,}\) which is again a parabolic cylinder).

We will now explain how to state second derivative test in general, for functions of \(n\) variables, where \(n=1,2,3,\ldots\text{.}\) We will also give an outline of the proof of this result. The proof requires the use of Taylor's theorem for a function of several variables, which we will not prove, and a bit of terminology from linear algebra. Our sketch of the proof follows the exposition given in the text *Vector Calculus*, 4th edition, by Marsden and Tromba.

### Subsection 14.8.1 Taylor Polynomials in Several Variables

Before getting to the general result, let's take a brief detour and discuss Taylor polynomials. One way of thinking about differentiability of a function \(f:D\subseteq\mathbb{R}^n\to\mathbb{R}\) is to think of the linearization \(L(\vec{x})\) as the degree one Taylor polynomial

The requirement of differentiability is then that the remainder \(R_1(\vec{x}) = f(\vec{x})-P_1(\vec{x})\) goes to zero faster than \(\norm{\vec{x}-\vec{a}}\text{;}\) that is,

Using the terminology from Section 14.6, we say that \(f\) and \(P_1\) “agree to first order”. From here we can go on and ask for degree \(k\) Taylor polynomials \(P_k(\vec{x})\) that give a “\(k\)th-order approximation” of \(f\) near \(\vec{a}\text{.}\)

In other words, we want a polynomial

in \(n\) variables, of degree \(k\text{,}\) such that the remainder \(R_k(\vec{x}) = f(\vec{x})-P_k(\vec{x})\) satisfies \(R_k(\vec{x})\approx C\norm{\vec{x}-\vec{a}}^l\text{,}\) with \(l\gt k\text{.}\) In terms of limits, this means

You've probably already noticed a problem with talking about higher-order polynomials in several variables: the notation gets really messy, since there are so many more possible terms! For example, even a relatively simple case like a degree 3 polynomial in 3 variables looks like

for constants \(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v\text{!}\)

Usually we get around this using “multi-index” notation We let \(\alpha=(a_1,\ldots, a_n)\) denote a \(n\)-tuple of non-negative integers, and then we define \(\vec{x}^\alpha = x_1^{a_1}x_2^{a_2}\cdots x_n^{a_n}\text{,}\) \(\lvert\alpha\rvert=a_1+\cdots +a_n\) (so that \(\vec{x}^\alpha\) is a monomial of order \(\lvert\alpha\rvert\)), and we denote a possible coefficient of \(\vec{x}^\alpha\) by \(a_\alpha\text{.}\) A general \(k^{\text{th}}\)-order polynomial then looks like

For example, in 3 variables, the terms where \(\lvert\alpha\rvert=3\) would involve \(\alpha = (3,0,0)\text{,}\) \((2,1,0)\text{,}\) \((2,0,1)\text{,}\) \((1,2,0)\text{,}\) \((1,0,2)\text{,}\) \((0,3,0)\text{,}\) \((0,2,1)\text{,}\) \((0,1,2)\text{,}\) \((0,0,3)\text{,}\) so in the above polynomial \(m=a_{(3,0,0)},\, n = a_{(2,1,0)}\text{,}\) etc., with \(\vec{x}^{(3,0,0)} = x^3,\, \vec{x}^{(2,1,0)} = x^2y\text{,}\) and so on. (Note that \(\alpha = (0,\ldots, 0)\) is the only multi-index with \(\lvert\alpha\rvert=0\)).

With all of that notational unpleasantness out of the way, we can say what the \(k^{\textrm{th}}\)-order Taylor polynomial for \(f\) near \(\vec{a}\) should be: Taylor's Theorem, generalized to \(n\) variables, states that

where \(\alpha! = a_1!a_2!\cdots a_n!\text{,}\) and

As an exercise, check that putting \(k=1\) reproduces the linearization \(P_1(\vec{x})\) (note that if \(\lvert\alpha\rvert=1\) we have to have \(\alpha = (1,0,\ldots, 0),\, (0,1,0,\ldots, 0),\) etc.), and that putting \(k=2\) gives the quadratic approximation discussed below.

### Subsection 14.8.2 Quadratic Functions in Several Variables

Let \(A=[a_{ij}]\) be an \(n\times n\) matrix. We say that \(A\) is *symmetric* if \(A^T=A\text{,}\) or equivalently, if \(a_{ij} = a_{ji}\) for each \(i,j\) between 1 and \(n\text{.}\) To each such \(A\) we can associate a *quadratic function* \(q:\mathbb{R}^n\to \mathbb{R}\) given by

or in terms of components,

We say that \(A\) is *non-degenerate* if \(\det A\neq 0\text{;}\) this is equivalent to saying that \(A\) is invertible, or that \(A\vec{x}=\vec{0}\) is possible only if \(\vec{x}=\vec{0}\text{.}\) (Note however that the corresponding property does not hold for \(q\text{:}\) it is possible to have \(q(\vec{x})=0\) for \(\vec{x}\neq \vec{0}\) even if the corresponding matrix \(A\) is non-degenerate. For example, the quadratic function \(q(x,y)=x^2-y^2\) has \(q(1,1)=0\) and corresponds to the non-degenerate matrix \(\begin{bmatrix}1\amp 0\\0\amp -1\end{bmatrix}\text{.}\)

A quadratic function \(q\) is called *positive-definite* if \(q(\vec{x})\geq 0\) for all \(\vec{x}\in\mathbb{R}^n\text{,}\) and \(q(\vec{x})=0\) only for \(\vec{x}=\vec{0}\text{.}\) (Note that the quadratic function \(q(x,y)=x^2-y^2\) given above is *not* positive definite; however, \(\tilde{q}(x,y) = x^2+y^2\) is.) Similarly, \(q\) is *negative-definite* if \(q(\vec{x})\leq 0\) for all \(\vec{x}\in\mathbb{R}^n\) with \(q(\vec{x})=0\) for \(\vec{x}=\vec{0}\) only.

If \(q(\vec{x}) = \vec{x}\cdot A\vec{x}\) is positive(negative)-definite, we refer to the corresponding symmetric matrix \(A\) as positive(negative)-definite as well. In general it can be difficult to determine when a given quadratic function (or its corresponding matrix) is positive or negative-definite. In the case of a \(2\times 2\) matrix \(A = \begin{bmatrix} a\amp b\\b\amp c\end{bmatrix}\) we get

by completing the square. Since we must have \(q(x_1,0)\gt0\) if \(x_1\neq 0\text{,}\) we get \(a\gt0\text{,}\) and since \(q(0,x_2)\gt0\) for \(x_2\neq 0\text{,}\) it follows that \(ac-b^2=\det A \gt0\text{.}\) Similarly \(q\) is negative-definite if \(a\lt 0\) and \(\det A\gt0\text{.}\)

For an \(n\times n\) matrix, one test is as follows: consider the sequence of \(j\times j\) matrices \(A_j\text{,}\) for \(j=1,\ldots , n\text{,}\) given by \(A_1=[a_{11}]\text{,}\) \(A_2 = \begin{bmatrix} a_{11}\amp a_{12}\\a_{21}\amp a_{22}\end{bmatrix},\ldots, A_n=A\text{.}\) (i.e. we take upper-left square sub-matrices of increasing size.) Then \(A\) is positive-definite if and only if \(\det A_j\gt0\) for each \(j=1,\ldots n\text{,}\) and \(A\) is negative-definite if the signs of \(\det A_j\) alternate between negative and positive. (So \(\det A_1 = a_{11}\lt 0, \det A_2\gt0, \det A_3\lt 0,\ldots\text{.}\))

Another approach, which is more illuminating but requires more advanced linear algebra, is to use the fact that for any symmetric matrix \(A\text{,}\) there exists a change of basis such that \(A\) becomes a diagonal matrix \(\tilde{A}\) with respect to that basis. (i.e. \(A\) can be diagonalized.) If the entries \(\tilde{a}_{ii}\) of \(\tilde{A}\) along the main diagonal (that is, the eigenvalues of \(A\)) are all non-zero, then \(A\) is non-degenerate. If they are all positive, then \(A\) is positive-definite. If they are all negative, then \(A\) is negative-definite.

We will need the following lemma below, which is a consequence of the Extreme Value Theorem.

#### Theorem 14.8.1.

If \(q:\mathbb{R}^n\to \mathbb{R}\) is a positive-definite quadratic function, then there exists a real number \(M\gt0\) such that

for any \(x\in\mathbb{R}^n\text{.}\)

To see that this is true, consider \(q(\vec{x})\) on the set \(B\) of all \(\vec{x}\) with \(\lVert \vec{x}\rVert = 1\text{.}\) The set \(B\) is closed and bounded and \(q\) is continuous on \(B\text{,}\) so by the Extreme Value Theorem, \(q\) must attain a minimum value \(M\) for some \(\vec{a}\in B\text{.}\) Now, for any constant \(c\in\mathbb{R}\text{,}\) the fact that \(q\) is quadratic implies that \(q(c\vec{x}) = c^2q(\vec{x})\text{.}\) For any non-zero \(\vec{x}\in \mathbb{R}^n\text{,}\) we know that \(\dfrac{\vec{x}}{\lVert\vec{x}\rVert}\in B\text{,}\) and thus, we have

Finally, if \(\vec{x}=\vec{0}\) we get \(q(\vec{0})=0=M\lVert\vec{0}\rVert^2\text{.}\)

### Subsection 14.8.3 The Hessian Matrix of a Real-Valued Function

#### Definition 14.8.2. The Hessian Matrix.

Let \(f:\mathbb{R}^n\to \mathbb{R}\) be a function with continuous second-order partial derivatives. We define the *Hessian matrix* of \(f\) at a point \(\vec{a}\) in the domain of \(f\) to be the \(n\times n\) symmetric matrix

Note that \(\Hess f(\vec{a})\) is symmetric by Clairaut's theorem (Theorem 13.3.15. The factor of \(1/2\) is included for convenience with respect to Taylor's theorem. Recall that for a function of one variable, the second-order Taylor polynomial of \(f\) about \(x=a\) is

For \(\vec{x}\in\mathbb{R}^n\text{,}\) let us define the quadratic function \(h_{f,\vec{a}}(\vec{x}) = \vec{x}\cdot (\Hess f(\vec{a})\cdot\vec{x})\) associated to the Hessian of \(f\) at \(\vec{a}\text{.}\) Taylor's theorem for functions of several variables tells us that if all the *third* derivatives of \(f\) are continuous, then near \(\vec{a}\in\mathbb{R}^n\) we have

where the remainder term \(R(\vec{a},\vec{x})\) satisfies

Finally, let us define a critical point \(\vec{a}\) for \(f\) to be *non-degenerate* if \(\Hess f(\vec{a})\) is non-degenerate. Now we're ready to state our result on the second derivative test.

#### Theorem 14.8.3. The General Second Derivative Test.

Let \(f:\mathbb{R}^n\to\mathbb{R}\) be three times continuously differentiable, and suppose that \(f\) has a non-degenerate critical point at \(\vec{a}\text{.}\) If \(\Hess f(\vec{a})\) is positive-definite, then \(\vec{a}\) is a local minimum for \(f\text{.}\) Similarly, if \(\Hess f(\vec{a})\) is negative-definite, then \(\vec{a}\) is a local maximum for \(f\text{.}\)

The way to think about this intuitively is the following: the matrix \(\Hess f(\vec{a})\) is symmetric. We know from Linear Algebra that every symmetric matrix can be diagonalized. Less obvious (but still true) is that we can make a (linear) change of variables \((u_1,\ldots, u_n) = T(x_1,\ldots, x_n)\) so that the vectors in the direction of the \(u_i\) coordinate axes are eigenvectors for \(\Hess f(\vec{a})\text{.}\) Slightly harder to show (but also true) is that this change of variables can be chosen so that it is orthogonal. That is, we simply have to *rotate* our coordinate system: lengths and angles are all preserved.

In this new coordinate system, the Hessian matrix is diagonal:

If each of the eigenvalues \(\lambda_1,\ldots, \lambda_n\) is positive, the Hessian is positive-definite, and our critical point is a local minimum. If all the eigenvalues are negative, our critical point is a local maximum. If some of the eigenvalues are positive and some are negative, we have a saddle point.

Proving the result is somewhat more technical. Suppose \(\vec{a}\) is a critical point for \(f\text{,}\) and that \(\Hess f(\vec{a})\) is positive definite. We know that \(\nabla f(\vec{a})=0\) at a critical point, so from (14.8.1) we get

Theorem 14.8.1 tells us that \(h_{f,\vec{a}}(\vec{x}-\vec{a})\geq M\lVert\vec{x}-\vec{a}\rVert^2\) for some \(M\text{,}\) and by (14.8.2), there exists a \(\delta\gt0\) such that whenever \(0\lt \lVert \vec{x}-\vec{a}\rVert\lt \delta\text{,}\) we get \(\lvert R(\vec{a},\vec{x})\rvert\lt M\lVert \vec{x}-\vec{a}\rVert^2\text{.}\) (Take \(\epsilon=M\) in the definition of the limit.)

If we carefully put all this together, we can show that

since

Substituting this into the above equation, we get \(f(\vec{x})-f(\vec{a})\gt0\) for any \(\vec{x}\) with \(0\lt \lVert\vec{x}-\vec{a}\rVert\lt \delta\text{,}\) and thus \(f\) has a local minimum at \(\vec{a}\in\mathbb{R}^n\text{.}\) The case of a local maximum can be handled similarly (or by replacing \(f\) with \(-f\)).