Is my math okay for gradient descent

Click For Summary

Discussion Overview

The discussion revolves around the application of gradient descent to minimize a function involving multiple variables, specifically focusing on the optimization of parameters ##K##, ##V_d##, and ##V_m##. Participants explore the mathematical formulation of the gradient descent algorithm, the derivation of partial derivatives, and the challenges faced in implementing the algorithm in code.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • The original poster describes their approach to minimizing a function ##R## using gradient descent for parameters ##V_d## and ##V_m##, while seeking to include ##K## in the same framework.
  • Some participants suggest using matrix vector notation for clarity and efficiency in calculations, indicating that this could simplify the gradient descent process.
  • Concerns are raised about the clarity of the notation used, particularly regarding the indices of variables like ##V_{0j}## and ##X_{1j}##, which some find confusing.
  • One participant emphasizes the importance of verifying gradient calculations through finite difference methods to ensure accuracy before troubleshooting code implementation.
  • There is a discussion about the potential convergence issues in gradient descent, with participants noting that these can arise from various factors, including the correctness of the mathematical formulation.
  • Participants express uncertainty about the relationships between the variables, particularly regarding the derivatives of ##V_d## and ##V_m## with respect to ##K##, with some suggesting they may be zero.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the clarity of the mathematical formulation or the correctness of the gradient calculations. Multiple competing views on notation and the approach to gradient descent remain present, and the discussion is unresolved regarding the implementation issues faced by the original poster.

Contextual Notes

Participants highlight limitations in notation clarity and dimensionality understanding, which may affect the interpretation of the mathematical expressions. There is also mention of unresolved issues regarding the relationships between variables and their derivatives.

fahraynk
Messages
185
Reaction score
5
I am trying to minimize the function below, ##R##, to find the optimum ##K##, ##V_d##, and ##V_m##.
Currently I minimize ##V_d## and ##V_m## with gradient descent, and find the best K through a binary search, but, if possible, I would like to get rid of binary search and use only gradient descent. I am having trouble though, and I'm not sure if its my code or math, so I am hoping someone can help check my math.

I have ##N## experiments. ##K## is an unknown independent variable that, along with a given ##H_j##, controls my dependent ##X## variables through a function ##X_n=F(H,K); n \in[1,2]##. ##V_d## and ##V_m## are coefficients within a certain range which must be found through minimization. ##H_j## and ##V_{0j}## are given experimental values.

$$R=\sum_{j=0}^N[V_{0j}-\frac{1}{H_{j}}(V_mX_{1j}+2V_dX_{2j})]^2$$
$$\\\\\phi_j=V_{0j}-\frac{1}{H_j}(V_mX_{1j}+2V_dX_{2j})$$

$$\frac{\partial R}{\partial V_m}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}X_{1j}$$

$$\frac{\partial R}{\partial V_d}=\sum_{j=0}^N-\frac{4\phi_j}{H_j}X_{2j}$$

Per Gradient Descent, I upgrade ##V_m## and ##V_d## using the partial derivatives of ##R## and learning rate ##\alpha##.
$$V_m^{i+1}=V_m^i-\alpha\frac{\partial R}{\partial V_m}$$
$$V_d^{i+1}=V_d^i-\alpha\frac{\partial R}{\partial V_d}$$

The above calculations work, but I run into trouble when I try to do the same for ##K##. This is my math for taking the partial derivative of ##R## with respect to ##K##:
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial V_m}{\partial K}X_{1j}+\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial V_d}{\partial K}X_{2j}+2\frac{\partial X_{2j}}{\partial K}V_d)$$

Because ##K## controls ##X_j##, for a small change in ##K## the optimal ##V_d## and ##V_m## might change a bit through the minimization of ##R## by Gradient Descent, but I don't have a relationship for it. I am not sure, but I think ##\frac{\partial V_d}{\partial K} = 0## and ##\frac{\partial V_m}{\partial K} = 0##. Furthermore, for ##X## :
$$\frac{\partial X_n}{\partial K}=\frac{\partial X_n}{\partial F}\frac{\partial F}{\partial K}=1\frac{\partial F}{\partial K}=\frac{F(H,k)-F(H,K+\delta K)}{\delta K}$$

So, the partial of ##R## becomes :
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial X_{2j}}{\partial K}V_d)$$

and the update step for ##K## :
$$K_{i+1}=K_i-\alpha\frac{\partial R}{\partial K}$$

Anyway, it does not converge correctly when I try to implement this in Matlab. I am not sure if it is because I am wrong about the partials of ##V_d##,##V_m##, or if it is because ##V_m## multiplies with ##X## and maybe that has some effect, or if my math is right and maybe the problem is in my code. Any help is appreciated.
 
Technology news on Phys.org
I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.
 
StoneTemplePython said:
I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.
There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!
 
fahraynk said:
There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!

It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...
 
Last edited:
StoneTemplePython said:
It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...

Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?
 
fahraynk said:
Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?

Look, your original post says that you already use gradient descent to tune your model with respect to those two V's. At some point you just need to come out and assert they are independent -- and state the other features of your model.

The very basic idea underlying all this is to clearly state how your model works, then trace the implications of the derivatives and trace the (finite difference) implications of small perturbations, and compare.
 

Similar threads

Replies
31
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 3 ·
Replies
3
Views
1K
Replies
32
Views
3K
Replies
9
Views
2K
Replies
4
Views
1K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 3 ·
Replies
3
Views
4K