Is my math okay for gradient descent

fahraynk · Mar 5, 2018

I am trying to minimize the function below, ##R##, to find the optimum ##K##, ##V_d##, and ##V_m##.
Currently I minimize ##V_d## and ##V_m## with gradient descent, and find the best K through a binary search, but, if possible, I would like to get rid of binary search and use only gradient descent. I am having trouble though, and I'm not sure if its my code or math, so I am hoping someone can help check my math.

I have ##N## experiments. ##K## is an unknown independent variable that, along with a given ##H_j##, controls my dependent ##X## variables through a function ##X_n=F(H,K); n \in[1,2]##. ##V_d## and ##V_m## are coefficients within a certain range which must be found through minimization. ##H_j## and ##V_{0j}## are given experimental values.

$$R=\sum_{j=0}^N[V_{0j}-\frac{1}{H_{j}}(V_mX_{1j}+2V_dX_{2j})]^2$$
$$\\\\\phi_j=V_{0j}-\frac{1}{H_j}(V_mX_{1j}+2V_dX_{2j})$$

$$\frac{\partial R}{\partial V_m}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}X_{1j}$$

$$\frac{\partial R}{\partial V_d}=\sum_{j=0}^N-\frac{4\phi_j}{H_j}X_{2j}$$

Per Gradient Descent, I upgrade ##V_m## and ##V_d## using the partial derivatives of ##R## and learning rate ##\alpha##.
$$V_m^{i+1}=V_m^i-\alpha\frac{\partial R}{\partial V_m}$$
$$V_d^{i+1}=V_d^i-\alpha\frac{\partial R}{\partial V_d}$$

The above calculations work, but I run into trouble when I try to do the same for ##K##. This is my math for taking the partial derivative of ##R## with respect to ##K##:
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial V_m}{\partial K}X_{1j}+\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial V_d}{\partial K}X_{2j}+2\frac{\partial X_{2j}}{\partial K}V_d)$$

Because ##K## controls ##X_j##, for a small change in ##K## the optimal ##V_d## and ##V_m## might change a bit through the minimization of ##R## by Gradient Descent, but I don't have a relationship for it. I am not sure, but I think ##\frac{\partial V_d}{\partial K} = 0## and ##\frac{\partial V_m}{\partial K} = 0##. Furthermore, for ##X## :
$$\frac{\partial X_n}{\partial K}=\frac{\partial X_n}{\partial F}\frac{\partial F}{\partial K}=1\frac{\partial F}{\partial K}=\frac{F(H,k)-F(H,K+\delta K)}{\delta K}$$

So, the partial of ##R## becomes :
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial X_{2j}}{\partial K}V_d)$$

and the update step for ##K## :
$$K_{i+1}=K_i-\alpha\frac{\partial R}{\partial K}$$

Anyway, it does not converge correctly when I try to implement this in Matlab. I am not sure if it is because I am wrong about the partials of ##V_d##,##V_m##, or if it is because ##V_m## multiplies with ##X## and maybe that has some effect, or if my math is right and maybe the problem is in my code. Any help is appreciated.

StoneTemplePython · Mar 5, 2018

I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.

fahraynk · Mar 5, 2018

StoneTemplePython said:

I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.

There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!

StoneTemplePython · Mar 5, 2018

fahraynk said:

There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!

It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...

fahraynk · Mar 6, 2018

StoneTemplePython said:

It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...

Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?

StoneTemplePython · Mar 6, 2018

fahraynk said:

Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?

Look, your original post says that you already use gradient descent to tune your model with respect to those two V's. At some point you just need to come out and assert they are independent -- and state the other features of your model.

The very basic idea underlying all this is to clearly state how your model works, then trace the implications of the derivatives and trace the (finite difference) implications of small perturbations, and compare.

Is my math okay for gradient descent

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Sweetspot of data compression

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect