导数、梯度、雅可比矩阵和海森矩阵

导数、梯度、雅可比矩阵和海森矩阵
Derivatives, Gradients, Jacobians and Hessians

原始链接: https://blog.demofox.org/2025/08/16/derivatives-gradients-jacobians-and-hessians-oh-my/

$y=x^2-6x+13$

## 微积分基础：导数、梯度、雅可比矩阵与海森矩阵本文解释了微积分中的四个关键概念及其应用，从**导数**开始。导数衡量函数在任何给定点上的变化，对于**优化**（寻找最小值或最大值）至关重要。这可以通过迭代方式完成（如梯度下降，模拟滚下山坡）或通过直接找到导数为零的点来完成。对于具有多个变量的函数，我们使用**梯度**。梯度是一个向量，包含偏导数（相对于每个变量的变化），指向最陡峭的上升方向。梯度下降利用这一点来寻找最小值。当函数输出多个值时，**雅可比矩阵**就派上用场。它本质上是梯度的一个集合——为每个输出值提供一个——揭示了函数如何扭曲空间。它的行列式表明函数是扩张、收缩还是翻转空间。最后，**海森矩阵**代表*二阶*导数，详细描述了函数的曲率。这通过理解函数的弯曲来提供更快的优化。然而，计算海森矩阵可能在计算上很昂贵，导致使用拟牛顿方法作为替代方案。这些工具不仅在数学中是基础，在机器学习、计算机图形学和渲染等领域也至关重要。

## 导数、梯度、雅可比矩阵和海森矩阵：总结这次Hacker News讨论的核心是理解微积分中的关键概念——导数、梯度、雅可比矩阵和海森矩阵，以及它们与优化算法的关系。一个关键点是将梯度可视化为“箭头图”，以便更好地理解优化过程。对话深入探讨了数学上的细微差别，明确了雅可比矩阵代表多变量函数的梯度集合，而海森矩阵描述了函数的局部曲率（如抛物线）。关于如何最好地理解这些概念存在争论，一些人提倡基于功能的理解方式，而不是纯粹基于符号的定义。自动微分（使用Enzyme等工具）等高级技术以及考虑切空间的重要性也被讨论。一个反复出现的主题是人类在低维空间中寻找最小值时的直觉与算法在高维空间中面临的挑战之间的对比，这突出了基于梯度的方法对于训练大型语言模型等复杂问题的重要性。最终，该讨论强调了不同的思维模型和工具在理解这些数学概念方面的力量。

原文

This article explains how these four things fit together and shows some examples of what they are used for.

Derivatives

Derivatives are the most fundamental concept in calculus. If you have a function, a derivative tells you how much that function changes at each point.

If we start with the function $y=x^2-6x+13$ , we can calculate the derivative as $y'=2x-6$ . Here are those two functions graphed.

One use of derivatives is for optimization – also known as finding the lowest part on a graph.

If you were at $x = 1$ and wanted to know whether you should go left or right to get lower, the derivative can tell you. Plugging 1 into $2x-6$ gives the value -4. A negative derivative means taking a step to the right will make the y value go down, so going right is down hill. We could take a step to the right and check the derivative again to see if we’ve walked far enough. As we are taking steps, if the derivative becomes positive, that means we went too far and need to turn around, and start going left. If we shrink our step size whenever we go too far in either direction, we can get arbitrarily close to the actual minimum point on the graph.

What I just described is an iterative optimization method that is similar to gradient descent. Gradient descent simulates a ball rolling down hill to find the lowest point that we can, adjusting step size, and even adding momentum to try and not get stuck in places that are not the true minimum.

We can make an observation though: The minimum of a function is flat, and has a derivative of 0. If not, that would mean it was on a hill, which means that going either left or right is lower, so it wouldn’t be the minimum.

Armed with this knowledge, another way to use derivatives to find the minimum is to find where the derivative is 0. We can do that by solving the equation $2x-6 = 0$ and getting the value $x=3$ . Without iteration, we found that the minimum of the function is at $x=3$ and we can plug 3 into the original equation $y=x^2-6x+13$ to find out that the minimum y value is 4.

Things get more complicated when the functions are higher order than quadratic. Higher order functions have both minimums and maximums, and both of those have 0 derivatives. Also, if the $x^2$ term of a quadratic is negative, then it only has a maximum, instead of a minimum.

Higher dimensional functions also get more complex, where for instance you could have a point on a two dimensional function $z=f(x,y)$ that is a local minimum for x but a local maximum for y. The gradient will be zero in each direction, despite it not being a minimum, and the simulated ball will get stuck.

Gradients

Speaking of higher dimensional functions, that is where gradients come in.

If you have a function $w=f(x,y,z)$ , a gradient is a vector of derivatives, where you consider changing only one variable at a time, leaving the other variables constant. The notation for a gradient looks like this:

$\nabla f(x,y,z) = \begin{bmatrix} \frac{\partial w}{\partial x} & \frac{\partial w}{\partial y} & \frac{\partial w}{\partial z} \end{bmatrix}$

Looking at a single entry in the vector, $\frac{\partial w}{\partial x}$ , that means “The derivative of w with respect to x”. Another way of saying that is “If you added 1 to x before plugging it into the function, this is how much w would change”. These are called partial derivatives, because they are derivatives of one variable, in a function that takes multiple variables.

Let’s work through calculating the gradient of the function $w=3x^2+6yz^3+4$ .

To calculate the derivative of w with regard to x ( $\frac{\partial w}{\partial x}$ ), we take the derivative of the function as usual, but we only treat x as a variable, and all other variables as constants. That gives us with $6x$ .

Calculating the derivative of w with regard to y, we treat y as a variable and all others as constants to get: $6z^3$ .

Lastly, to calculate the derivative of w with regard to z, we treat z as a variable and all others as constants. That gives us $12yz^2$ .

The full gradient of the function is: $\begin{bmatrix} 6x & 6z^3 & 12yz^2 \end{bmatrix}$ .

An interesting thing about gradients is that when you calculate them for a specific point, they give a vector that points in the direction of the biggest increase in the function, or equivalently, in the steepest uphill direction. The opposite direction of the gradient is the biggest decrease of the function, or the steepest downhill direction. This is why gradients are used in the optimization method “Gradient Descent”. The gradient (multiplied by a step size) is subtracted from a point to move it down hill.

Besides optimization, gradients can also be used in rendering. For instance, here it’s used for rendering anti aliased signed distance fields: https://iquilezles.org/articles/distance/

Jacobian Matrix

Let’s say you had a function that took in multiple values and gave out multiple values: $v,w =f(x,y,z)$ .

We could calculate the gradient of this function for v, and we could calculate it for w. If we put those two gradient vectors together to make a matrix, we would get the Jacobian matrix! You can also think of a gradient vector as being the Jacobian matrix of a function that outputs a single scalar value, instead of a vector.

Here is the Jacobian for $v,w =f(x,y,z)$ :

$\mathbb{J} = \begin{bmatrix} \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} & \frac{\partial v}{\partial z} \\ \frac{\partial w}{\partial x} & \frac{\partial w}{\partial y} & \frac{\partial w}{\partial z} \end{bmatrix}$

If that’s hard to read, the top row is the gradient for v, and the bottom row is the gradient for w.

When you evaluate the Jacobian matrix at a specific point in space (of whatever space the input parameters are in), it tells you how the space is warped in that location – like how much it is rotated and squished. You can also take the determinant of the Jacobian to see if things in that area get bigger (determinant greater than 1), smaller (determinant less than 1 but greater than 0), or if they get flipped inside out (determinant is negative). If the determinant is zero, it means it squishes everything into a single point, and also means that the operation can’t be reversed (the matrix can’t be inverted).

Here’s a great 10 minute video that goes into Jacobian Matrices a little more deeply and shows how they can be useful in machine learning: https://www.youtube.com/watch?v=AdV5w8CY3pw

Since Jacobians describe warping of space, they are also useful in computer graphics, where for instance, you might want to use alpha transparency to fade an object out over a specific number of pixels to perform anti aliasing, but the object may be described in polar coordinates, or be warped in way that makes it hard to know how many units to fade out over in that modified space. This has come up for me when doing 2D SDF rendering in shadertoy.

Hessian Matrix

If you take all partial derivatives (aka make a gradient) of a function $w=f(x,y,z)$ , that will give you a vector with three partial derivatives out – one for x, one for y, one for z.

What if we wanted to get the 2nd derivatives? In other words, what if we wanted to take the derivative of the derivatives?

You could just take the derivative with respect to the same variables again, but to really understand the second derivatives of the function, we should take all three partial derivatives (one for x, one for y, one for z) of EACH of those three derivatives in the gradient.

That would give us 9 derivatives total, and that is exactly what the Hessian Matrix is.

$\mathbb{H} = \begin{bmatrix} \frac{\partial^2 w}{\partial x^2} & \frac{\partial^2 w}{\partial xy} & \frac{\partial^2 w}{\partial xz} \\ \frac{\partial^2 w}{\partial yx} & \frac{\partial^2 w}{\partial y^2} & \frac{\partial^2 w}{\partial yz} \\ \frac{\partial^2 w}{\partial zx} & \frac{\partial^2 w}{\partial zy} & \frac{\partial^2 w}{\partial z^2} \end{bmatrix}$

If that is hard to read, each row is the gradient, but then the top row is differentiated with respect to x, the middle row is differentiated with respect to y, and the bottom row is differentiated with respect to z.

Another way to think about the Hessian is that it’s the transpose of the Jacobian matrix of the gradient. That’s a mouthful, but it hopefully helps you better see how these things fit together.

Taking the 2nd derivative of a function tells you how the function curves, which can be useful (again!) for optimization.

This 11 minute video talks about how the Hessian is used in optimization to get the answer faster, by knowing the curvature of the functions: https://www.youtube.com/watch?v=W7S94pq5Xuo

Calculating the Hessian can be quite costly both computationally and in regards to how much memory it uses, for machine learning problems that have millions of parameters or more. In those cases, there are quasi newton methods, which you can watch an 11 minute video about here: https://www.youtube.com/watch?v=UvGQRAA8Yms

Thanks for reading and hopefully this helps clear up some scary sounding words!