Section14.6The Derivative as a Linear Transformation
We defined what it means for a real-valued function of two variables to be differentiable in Definition Definition 14.1.3 in Section 14.1.
The definition there easily extends to real-valued functions of three or more variables, but it leaves unanswered a couple of natural questions:
What about vector-valued functions of several variables? (That is, functions \(f\) with a domain \(D\subseteq \mathbb{R}^n\) and range in \(\mathbb{R}^m\) for some \(m >1\text{.}\))
What is the derivative of a function of several variables? After all, we know how to define \(\fp(x)\) and \(\vrp(t)\) for real or vector-valued functions of one variable.
One might be tempted at first to simply mimic the definition of the derivative from Chapter 2, but we quickly run into trouble, for a reason that is immediately obvious.
Let \(\vec{a}\) be a fixed point in \(\mathbb{R}^n\text{,}\) and let \(\vec{h}\) represent a point \((h_1,h_2,\ldots, h_n)\text{.}\) Since we’re treating \(\vec{h}\) and \(\vec{a}\) as vectors, we can add them, and write down the limit
(Note that division by a vector is nonsense, so we must divide by \(\norm{\vec{h}}\text{,}\) not \(\vec{h}\text{.}\)) But of course, we know that this limit does not exist, because it depends on the direction in which \(\vec{h}\) approaches \(\vec{0}\text{!}\) Indeed, if \(\vec{h} = h\vec{i}\) or \(h\vec{j}\text{,}\) we get a partial derivative, and for any unit vector \(\vec{u}\text{,}\) setting \(\vec{h}=h\vec{u}\) gives us a directional derivative, and we know from Section 14.3 that a directional derivative depends on \(\vec{u}\text{.}\) It seems this approach is doomed to failure. What can we try instead?
Subsection14.6.1The Definition of the Derivative
The key to generalizing the definition of the derivative given in Definition 2.1.7 in Chapter 2 is remembering the following essential property of the derivative: the derivative \(\fp(a)\) is used to compute the best linear approximation to \(f\) at \(a\text{.}\) Indeed, the linearization of \(f\) at \(a\) is the linear function
That this is the best linear approximation of \(f\) at \(a\) can be understood as follows: first, note that the graph \(y=L_a(x)\) is simply the equation of the tangent line to \(y=f(x)\) at \(a\text{.}\) Second, note that the difference between \(f(x)\) and \(L_a(x)\) vanishes faster than the difference \(x-a\) as \(x\) approaches \(a\text{:}\)
While the definition of the derivative doesn’t generalize well to several variables, the notion of linear approximation does. Recall from your first course in linear algebra that, given any \(m\times n\) matrix \(A\text{,}\) we can define a function \(T\text{,}\) called a linear transformation, that takes an \(n\times 1\) column vector as input, and produces an \(m\times 1\) column vector as output:
In the above definition, the product \(A\vec{x}\) is the usual matrix product of the \(m\times n\) matrix \(A\) with the \(n\times 1\) matrix \(\vec{x}\text{.}\) In this text, we generally do not write our vectors as columns, so for a vector \(\vec{x}=\langle x_1,\ldots, x_n\rangle\) we will use the notation
to represent the same product in our notation. (And yes, the dot in this product is intended to remind you of the dot product between vectors: recall that the \((i,j)\)-entry of a matrix product \(AB\) is the dot product of the \(i^{\text{th}}\) row of \(A\) with the \(j^{\text{th}}\) column of \(B\text{.}\)) We can now make the following definition.
Definition14.6.2.Linear function.
A function \(\ell\) from \(\mathbb{R}^n\) to \(\mathbb{R}^m\) will be called a linear function if \(\ell\) is of the form
for some \(m\times n\) matrix \(M\) and vector \(\vec{b}\)\(\mathbb{R}^m\text{.}\)
If we apply the convention of representing points in terms of their position vectors to the codomain as well as the domain, we can express such a function as \(f=\langle f_1,\ldots, f_n\rangle\text{,}\) where each function \(f_i\) is a real-valued function of \(n\) variables. We want differentiability of \(f\) to mean that \(f\) has a linear approximation \(\ell\) that agrees with \(f\) to first order at \(a\text{.}\) Since \(f(\vec{x})\) and \(\ell(\vec{x})\) are now vectors, saying that \(\ell\) is a good approximation of \(f\) requires that the magnitude \(\norm{f(\vec{x})-\ell(\vec{x})}\) is small relative to the size of \(\norm{\vec{x}-\vec{a}}\text{.}\)
Definition14.6.3.General definition of differentiability.
Let \(D\) be an open subset of \(\mathbb{R}^n\) and let \(f\) be a function with domain \(D\) and values in \(\mathbb{R}^m\text{.}\) We say that \(f\) is differentiable at a point \(\vec{a}\in D\) if there exists a linear function \(\ell:\mathbb{R}^n\to\mathbb{R}^m\) that agrees with \(f\) to first order at \(\vec{a}\text{;}\) that is, if
This definition is going to take a lot of unpacking. First of all, what is this function \(\ell\text{?}\) How do we compute it? Does this definition include Definition 14.1.3 from Section 14.1 as a special case? What about differentiability for vector-valued functions of one variable, or real-valued functions of one variable?
We will answer the first two questions in due course. The answer to the rest is, “Yes.” The above definition generalizes all the definitions of differentiability we’ve encountered so far. As a first step, let us note that for \(\ell(\vec{x})=M\cdot \vec{x}+\vec{b}\text{,}\) we must have \(\ell(\vec{a})=f(\vec{a})\text{,}\) or the limit above will not exist. Thus \(M\cdot \vec{a}+\vec{b} = f(\vec{a})\text{,}\) so \(\vec{b}=f(\vec{a})-M\cdot \vec{a}\text{.}\) This tells us that \(\ell\) must have the following form:
This should ring some bells: the form of \(\ell\) is very similar to that of the linearization given for a function of one variable in Equation (14.6.1) above, with the matrix \(M\) playing the role of \(\fp(a)\text{.}\) Perhaps this matrix is the derivative we seek?
Subsection14.6.2Real-valued functions of several variables
Let \(f:D\subseteq \mathbb{R}^n\to \mathbb{R}\) be a given function of \(n\) variables (you can assume \(n=1, 2\) or 3 if you prefer). Let us denote a point \((x_1,x_2,\ldots, x_n)\in\mathbb{R}^n\) using the vector \(\vec x = \langle x_1,x_2,\ldots, x_n\rangle\text{,}\) so that \(f(\vec x) = f(x_1,x_2,\ldots,x_n)\text{.}\) Let \(\vec a = \langle a_1,a_2,\ldots, a_n\rangle\) denote a fixed point \((a_1,a_2,\ldots, a_n)\in D\text{.}\)
In Section 14.1, we saw that differentiability means that the difference \(\ddz = f(x+dx,y+dy) - f(x,y)\) can be approximated by the differential \(dz = f_x(x,y)\,dx+f_y(x,y)\,dy\text{.}\) Differentiability was defined to mean that the error functions \(E_x\) and \(E_y\text{,}\) defined by
go to zero as \(\langle dx,dy\rangle\) goes to zero. Let’s rephrase this so that it works for any number of variables. Recall that the gradient of \(f\) at \(\vec{a}\in D\) is the vector \(\nabla f(\vec{a})\) defined by
Definition14.6.4.The linearization of a function of several variables.
Let \(f\) be continuously differentiable on some open set \(D\subseteq\mathbb{R}^n\text{,}\) and let \(\vec{a}\in D\text{.}\) The linearization of \(f\) at \(\vec{a}\) is the function \(L_{\vec{a}}(\vec{x})\) defined by
When \(n=1\text{,}\) we get the linearization \(L_a(x) = f(a)+f'(a)(x-a)\text{,}\) which is the usual linearization from Calculus I. (You might also notice that \(L_a(x)\) is the first-degree Taylor polynomial of \(f\) about \(x=a\text{.}\) The same is true of the linearization of \(f\) for more than one variable, although we will not be considering Taylor polynomials in several variables.)
For \(n=2\text{,}\) we get the linear approximation associated to the total differential:
Compare this with Equation (14.6.2) above. It seems that the gradient \(\nabla f (\vec{a})\) is our matrix \(M\) in this case: for a real-valued function, \(m=1\text{,}\) so we expect a \(1\times n\) row matrix, and the gradient certainly can be interpreted to fit that description.
Definition14.6.5.Differentiability of real-valued functions.
We say that \(f\) is differentiable at \(\vec{a}\in D\) if \(\nabla f(\vec{a})\) exists, and \(f(\vec{x})\) and \(L_{\vec{a}}(\vec{x})\) agree to first order at \(\vec{a}\text{;}\) that is, if
What this definition says is that the linearization \(L_{\vec{a}}(\vec{x})\) is a good linear approximation to \(f\) at \(\vec{a}\text{.}\) In fact, it’s the only (and hence, best) linear approximation: if a linear approximation exists, it has to be \(L_{\vec{a}}(\vec{x})\text{.}\)
If you want to see why this has to be true, recall that since the above limit exists, we have to be able to evaluate it along any path we like. Suppose we chose the path
which is just another way of stating the definition of the partial derivative with respect to \(x_1\text{.}\) Of course, approaching along any of the other coordinate directions will similarly produce the other partial derivatives.
Recall that in one variable, the derivative is often written instead in terms of \(h=x-a\text{,}\) so that
In more than one variable, we can define \(h_i = x_i-a_i\text{,}\) for \(i=1,\ldots, n\text{,}\) or the corresponding vector \(\vec{h} = \vec{x}-\vec{a}\text{.}\) The definition of differentiability then can be written as
Note that we want the difference between \(f(\vec{a}+\vec{h})\) and \(L_{\vec{a}}(\vec{h})\) to go to zero faster than \(\lVert \vec{h}\rVert\) goes to zero, and that it only makes sense to divide by the length of \(\vec{h}\text{,}\) since division by a vector (or the corresponding point) is not defined.
Let’s return to \(n=2\) and Definition 14.1.3 from Section 14.1. If we write \(\vec{h} = \langle dx, dy\rangle\text{,}\) then \(f(\vec{a}+\vec{h})-f(\vec{a}) = \ddz\text{,}\) and \(\nabla f(\vec{a})\cdot \vec{h} = dz\text{,}\) and Equation (14.6.3) becomes
which is another way of saying that the error terms \(E_x,E_y\) must vanish as \(dx\) and \(dy\) approach zero. Success! Definition 14.6.3 is indeed a generalization of Definition 14.1.3.
which is just another way of re-writing the usual definition of the derivative. In fact, we’ve also generalized Definition 13.2.10 from Chapter 13 for differentiability of vector-valued functions: all we have to do is write our vector-valued function as a column matrix.
which again reproduces the definition of \(\vrp(a)\text{.}\)
One of the results we learn in Calculus I is that differentiability implies continuity. The situation is no different in general, and with our new definition of differentiability, an easy proof is possible.
Thus, taking limits of the above as \(\vec{x}\to\vec{a}\text{,}\) we find \(\displaystyle \lim_{\vec{x}\to\vec{a}}f(\vec{x}) = f(\vec{a})\text{,}\) since the first term is a constant (\(f(\vec{a})\)), the second is the product of two terms that both go to zero (the first term is zero by the definition of differentiability, and clearly \(\lim_{\vec{x}\to\vec{a}}\lVert\vec{x}-\vec{a}\rVert = 0\)), and the last term vanishes since it’s linear (and thus continuous) in \(\vec{x}\text{,}\) and so, by direct substitution,
Subsection14.6.3Vector-valued functions of several variables
Let us now consider Definition 14.6.3 for general functions \(f:D\subseteq \mathbb{R}^n\to \mathbb{R}^m\text{.}\) If \(f\) is differentiable at \(\vec{a}\text{,}\) then we must have
for some linear function \(\ell(\vec{x})\text{.}\) Moreover, we’ll see below that (a) the matrix \(M\) is uniquely defined, and (b) \(M\) is deserving of the title of “the” derivative of \(f\text{.}\)
We saw in Equation (14.6.2) above that \(T\) must have the form of a linear approximation:
Let’s compare again to the one variable case: \(L_a(x)=f(a)+f'(a)(x-a)\text{.}\) With this in mind, the matrix \(M\text{,}\) whatever it is, certainly seems to play the role of the derivative for general functions from \(\mathbb{R}^n\) to \(\mathbb{R}^m\text{.}\) It remains to determine the matrix \(M\text{,}\) and see that there can only be one possibility. To that end, let us write
so \(M\cdot (\vec{x}-\vec{a})\) gives us \(t\) times the first column of \(M\text{,}\) since for each row of \(M\text{,}\) the first entry is multiplied by \(t\text{,}\) and the remaining entries are multiplied by zero. Thus,
Since \(\langle c_{11}, c_{21}, \ldots, c_{m1}\rangle\) is a constant vector, from differentiability of \(f\text{,}\) together with Definition 14.6.3, we get
But this limit on the left is just the partial derivative of \(f\) with respect to \(x_1\text{!}\) If we write \(f(\vec{x}) = \langle f_1(\vec{x}),f_2(\vec{x}),\ldots, f_m(\vec{x})\langle\text{,}\) then we have
and this gives us the first column of \(M\text{!}\) Repeating this for each variable, we see that the matrix \(M\) is exactly the matrix of all the partial derivatives of \(f\text{.}\) This matrix is important enough to have a name:
Definition14.6.7.The Jacobian matrix of a differentiable function.
Let \(D\subseteq \mathbb{R}^n\) be an open subset, and let \(f:D\to \mathbb{R}^m\) be a differentiable function. At any point \(\vec{a}\in D\text{,}\) the Jacobian matrix of \(f\) at \(\vec{a}\text{,}\) denoted \(Df(\vec{a})\text{,}\) is the \(m\times n\) matrix defined by
The linear transformation \(T_{f,\vec{a}}:\mathbb{R}^n\to \mathbb{R}^m\) defined by \(T_{f,\vec{a}}(\vec{x})=Df(\vec{a})\cdot \vec{x}\) is defined to be the derivative of \(f\) at \(\vec{a}\text{.}\)
Notice that if \(f\) is differentiable, the Jacobian matrix is the only matrix that can fit the definition: the fact that the limit must be zero along a path parallel to one of the coordinate axes forces the matrix \(M\) to contain the partial derivatives of \(f\text{.}\)
In particular, note that for a function \(f:\mathbb{R}^n\to \mathbb{R}\text{,}\) we recover the gradient vector. Technically, the derivative in this sense is a row vector (some might say dual vector), not a column vector. Note that multiplying a row vector by a column vector is the same as taking the dot product of two column vectors.
This definition also accounts for parametric curves, viewed as vector-valued functions of one variable. If \(\mathbf{r}:\mathbb{R}\to \mathbb{R}^n\) defines a parametric curve, then the derivative \(\mathbf{r}'(t) = \begin{bmatrix}x_1'(t)\\x_2'(t)\\\vdots \\x_n'(t)\end{bmatrix}\) as introduced in Chapter 13 is the same as the one obtained using this definition.
Subsection14.6.4The General Chain Rule
One of the big advantages of representing the derivative of a function of several variables in terms of its Jacobian matrix is that the Chain Rule becomes completely transparent. Arguably, the version of the Chain Rule we’re about to present is even more intuitive than the single-variable version!
Recall that the Chain Rule is all about derivatives of composite functions. In one variable, given \(h=f\circ g\text{,}\) if \(b=g(a)\text{,}\) we have
The derivative of the composition is the product of the derivatives of the functions being composed, as long as we take care to evaluate them at the appropriate points.
In Section 14.2 we saw that in several variables, the Chain Rule comes in various flavours, depending on the number of variables involved in each function being composed. If we think of derivatives in terms of the Jacobian matrix, then each of these flavours says exactly the same thing as the original Chain Rule above!
Theorem14.6.8.The general Chain Rule (matrix form).
Let \(f:U\subseteq \mathbb{R}^m\to\mathbb{R}^p\) and \(g:V\subseteq \mathbb{R}^n\to \mathbb{R}^m\) be differentiable functions, such that the range of \(g\) is contained in the domain \(U\) of \(f\text{.}\) Then the composite function \(h=f\circ g\) is differentiable on \(V\text{,}\) and for each \(\vec{a}\in V\text{,}\) we have
where \(\vec{b}=g(\vec{a})\text{,}\) and the product on the right is the usual matrix product of the two Jacobian matrices.
This is a remarkable result. Let’s unpack it in a couple of examples.
Example14.6.9.Applying the general chain rule.
Let \(f:U\subseteq \mathbb{R}^3\to\mathbb{R}\) be a differentiable function of three variables, and let \(\vec{r}(t) = \la x(t),y(t),z(t)\ra\) be a vector-valued function of one variable. Use Theorem 14.6.8 to determine a formula for the derivative of \(h(t) = f(\vec{r}(t))\text{.}\)
Solution.
We already know what this derivative should look like from Section 14.2. The point is to confirm that this is a special case of Theorem 14.6.8. The Jacobian matrix of \(f\) is a \(1\times 3\) matrix and Jacobian matrix of \(\vec{r}\) is a \(3\times 1\) matrix. They are given, respectively, by
as before. Of course, in this context we usually write \(Df(\vec{x})\) as \(\nabla f(\vec{x})\) and \(D\vec{r}(t)\) as \(\vrp(t)\text{,}\) and instead of a matrix product, we write a dot product. But this is simply a shift in notation — the quantities involved are no different than before.
Example14.6.10.Applying the general chain rule.
Let \(f:U\subseteq \mathbb{R}^2\to\mathbb{R}\) be a function of 2 variables, and let \(g:V\subseteq \mathbb{R}^2\to\mathbb{R}^2\) be given by
Again, this reproduces another instance of the Chain Rule from Section 14.2.
With additional experimentation, you will find that every instance of the Chain Rule you have previously encountered can be interpreted as a special case of Theorem 14.6.8. Moreover, a slight shift in interpretation makes this version of the Chain Rule even more obvious! (There’s another detour coming, but stick with us.)
Let us digress briefly and discuss the progression of mathematics from Calculus to higher math. If you continue on to upper-level undergraduate mathematics, you will encounter courses in Analysis and Topology. Analysis deals with the theoretical underpinnings of Calculus: this is where you see all the careful proofs of theorems that have been omitted from this text. Topology is a further abstraction of Analysis. In Topology, one studies continuity (and its consequences) at its most fundamental, abstract level.
The corresponding successors to Calculus in several variables are known as differential geometry and differential topology. You probably won’t encounter these unless you continue on to graduate studies in mathematics. One of the core philosophies in these two (closely related) subjects is the following:
This can be understood in our context. At any point \(\vec{a}\) in \(\mathbb{R}^n\text{,}\) we can attach a copy of the vector space\(\mathbb{R}^n\text{,}\) thought of as all the possible tangent vectors to curves passing through that point.
Let \(\vec{r}:(a,b)\to \mathbb{R}^n\) be such a curve, and let \(f:\mathbb{R}^n\to \mathbb{R}^m\) be a differentiable function. The composite function \(\vec{s}=f\circ \vec{r}\) is then a curve in \(\mathbb{R}^m\text{.}\) The point \(\vec{a} = \vec{r}(t_0)\) on our first curve in \(\mathbb{R}^n\) becomes a point
on our new curve in \(\mathbb{R}^m\text{.}\) What about tangent vectors?
At the point \(\vec{a}\text{,}\) we have the tangent vector \(\vec{v} = \vrp(t_0)\text{.}\) What is the tangent vector to \(\vec{s}(t)\) at the point \(\vec{b}\text{?}\) On the one hand, by definition, we have the tangent vector
Multiplying the original tangent vector by the derivative of \(f\) gives us the new tangent vector. Cool!
What’s more, we can view this as a linear transformation. Let \(V\) denote the vector space of all tangent vectors at the point \(\vec{a}\) in \(\mathbb{R}^n\) (this is just a copy of \(\mathbb{R}^n\)) and let \(W\) denote the space of all tangent vectors in \(\mathbb{R}^m\) at the point \(\vec{b}\text{.}\) Then we have the linear transformation \(T:V\to W\) given by
In more advanced Calculus, or Differential Geometry, we view this linear transformation as the derivative of \(f\) at \(\vec{a}\text{.}\) Now, recall from Linear Algebra that matrix multiplication corresponds to the composition of the corresponding linear transformations: if \(S(\vec{v}) = A\vec{v}\) and \(T(\vec{w}) = B\vec{w}\text{,}\) and the matrices \(A\) and \(B\) are of the appropriate sizes, then
Suppose we have differentiable functions \(f:\mathbb{R}^n\to \mathbb{R}^m\) and \(g:\mathbb{R}^m\to \mathbb{R}^p\text{.}\) Let \(T_f:\mathbb{R}^n\to \mathbb{R}^m\) be the linear function given by the derivative of \(f\text{,}\) and let \(T_g:\mathbb{R}^m\to \mathbb{R}^p\) be the linear function given by the derivative of \(g\text{.}\) The chain rule is then essentially telling us that the derivative of a composition is the composition of the derivatives: we have
(But beware of the dual usage of \(\mathbb{R}^n\) here. In the first composition, we’re thinking of it as a set of points in the domain of a function. In the second composition, we’re thinking of it as the set of tangent vectors at a point.)
This turns out to be an extremely powerful way of looking at derivatives and the Chain Rule. You may want to keep this in mind in later sections, such as when we consider change of variables in multiple integrals at the end of Chapter 15, and when we define integrals over curves and surfaces in Chapter 16. We won’t use this language when we get there, but many of the results in those sections (for example, the formula for surface area of a parametric surface) can be understood according to the two principles we have just seen: functions map points, while derivatives map tangent vectors, and the derivative of a composition is the composition of the derivatives.