To find out why it is we start with the directional derivative.
Directional Derivative
I expect you understand the partial derivatives and have a somewhat intuitive understanding of them. Otherwise we would have to start there.
The idea is to find out the rate of change in an arbitrary direction. So let's say we look at the point P(x0, y0) of the function f(x,y).
We want to know the rate of change in the direction of the vector u = <A, B>
How do we do this? Let's define u to be a unit vector.
Well the change in the function is of course Δf = f(x0+Ah,y0+Bh)-f(x0,y0)
This is the fundamental idea that we need to grasp. The vector controls the increase in x and y by its A and B components. The h variable is a continuos variable. In fact, the function
L(h) = f(x0+Ah,y0+Bh) describes all the values of the function f(x,y), that lies on the line parallel to the vector u.
If we accept this, then the rate of change can be defined by how much f(x,y) is changing, with respect to changes to this variable h.
So we have Δf/h = (f(x0+Ah,y0+Bh)-f(x0,y0))/h
When we take the limit, we of course end up with df/dh, which is the derivative. But let's find out what that fella actually looks like. Well to do this we have to use the dreaded Chain Rule!
Chain Rule
You said you had a problem understanding that a change due to change in y plus a change due to change in x is equal to the total change. Well let's look at that now.
If we want to approximate the change that happens in a function f, how can we do it?
Well if we know the partial derivatives, we know the rate of change in each direction (x and y). So we can compute the approximate change by the formula
Δf ≈ ∂f/∂x Δx + ∂f/∂y Δy
Why is this? Well if we say that
Δf1 = f(x0+Δx, y0) - f(x0, y0) and
Δf2 = f(x0, y0+Δy) - f(x0, y0)
Then Δf ≈ f(x0+Δx, y0) + f(x0, y0+Δy) - 2 f(x0, y0)
This can only be true if the two first values are approximately equal to each other. That is
f(x0+Δx, y0) ≈ f(x0, y0+Δy)
Is this true? Well yes of course it is! If the change in x or change in y is very, very small. Then we will have almost no change in the function, and the two values will almost (but not quite) be equal to each other.
So since we can describe Δf ≈ ∂f/∂x Δx + ∂f/∂y Δy
If we then divide it all by h (since in our case Δx and Δy are actually Ah and Bh respectivly), we have.
Δf/Δh ≈ ∂f/∂x A + ∂f/∂y B
When we take the limit we end up with
df/dh = ∂f/∂x A + ∂f/∂y B
Gradient
We now have the directional derivative. The next question would be - In which direction do we find the greatest rate of change?
Well we can choose to view the above equation as the dot product between two vectors.
so df/dh = <∂f/∂x , ∂f/∂y> * <A, B> = v * u
The dot product is also determined by
<∂f/∂x , ∂f/∂y> * <A, B> = |v||u|cos(θ)
Since |u| = 1 (because it is a unit vector) we have
df/dh = |v|cos(θ)
When is this equation the largest? Well it is the largest when the angle θ is zero. When is it zero? It is zero when the two vectors are parallel. This the greatest rate of change is in the direction of the vector v. This vector we call the gradient and signify by ∇f.
I hope this helped :)