Is my math okay for gradient descent

In summary, the conversation is about minimizing the function R to find the optimum values for K, V_d, and V_m. The current method involves using gradient descent for V_d and V_m and binary search for K, but the speaker wants to eliminate binary search and only use gradient descent. They are having difficulties with their math and code and are seeking help. The function R involves unknown independent variable K, which controls the dependent variables X_1 and X_2 through a function F. The coefficients V_d and V_m must be found through minimization. Gradient descent is used to update V_d and V_m, but there are issues with updating K. The speaker is unsure if the problem lies in their math or code and is seeking
  • #1
fahraynk
186
6
I am trying to minimize the function below, ##R##, to find the optimum ##K##, ##V_d##, and ##V_m##.
Currently I minimize ##V_d## and ##V_m## with gradient descent, and find the best K through a binary search, but, if possible, I would like to get rid of binary search and use only gradient descent. I am having trouble though, and I'm not sure if its my code or math, so I am hoping someone can help check my math.

I have ##N## experiments. ##K## is an unknown independent variable that, along with a given ##H_j##, controls my dependent ##X## variables through a function ##X_n=F(H,K); n \in[1,2]##. ##V_d## and ##V_m## are coefficients within a certain range which must be found through minimization. ##H_j## and ##V_{0j}## are given experimental values.

$$R=\sum_{j=0}^N[V_{0j}-\frac{1}{H_{j}}(V_mX_{1j}+2V_dX_{2j})]^2$$
$$\\\\\phi_j=V_{0j}-\frac{1}{H_j}(V_mX_{1j}+2V_dX_{2j})$$

$$\frac{\partial R}{\partial V_m}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}X_{1j}$$

$$\frac{\partial R}{\partial V_d}=\sum_{j=0}^N-\frac{4\phi_j}{H_j}X_{2j}$$

Per Gradient Descent, I upgrade ##V_m## and ##V_d## using the partial derivatives of ##R## and learning rate ##\alpha##.
$$V_m^{i+1}=V_m^i-\alpha\frac{\partial R}{\partial V_m}$$
$$V_d^{i+1}=V_d^i-\alpha\frac{\partial R}{\partial V_d}$$

The above calculations work, but I run into trouble when I try to do the same for ##K##. This is my math for taking the partial derivative of ##R## with respect to ##K##:
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial V_m}{\partial K}X_{1j}+\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial V_d}{\partial K}X_{2j}+2\frac{\partial X_{2j}}{\partial K}V_d)$$

Because ##K## controls ##X_j##, for a small change in ##K## the optimal ##V_d## and ##V_m## might change a bit through the minimization of ##R## by Gradient Descent, but I don't have a relationship for it. I am not sure, but I think ##\frac{\partial V_d}{\partial K} = 0## and ##\frac{\partial V_m}{\partial K} = 0##. Furthermore, for ##X## :
$$\frac{\partial X_n}{\partial K}=\frac{\partial X_n}{\partial F}\frac{\partial F}{\partial K}=1\frac{\partial F}{\partial K}=\frac{F(H,k)-F(H,K+\delta K)}{\delta K}$$

So, the partial of ##R## becomes :
$$\frac{\partial R}{\partial K}=\sum_{j=0}^N-\frac{2\phi_j}{H_j}(\frac{\partial X_{1j}}{\partial K}V_m+2\frac{\partial X_{2j}}{\partial K}V_d)$$

and the update step for ##K## :
$$K_{i+1}=K_i-\alpha\frac{\partial R}{\partial K}$$

Anyway, it does not converge correctly when I try to implement this in Matlab. I am not sure if it is because I am wrong about the partials of ##V_d##,##V_m##, or if it is because ##V_m## multiplies with ##X## and maybe that has some effect, or if my math is right and maybe the problem is in my code. Any help is appreciated.
 
Technology news on Phys.org
  • #2
I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.
 
  • #3
StoneTemplePython said:
I found this rather hard to parse.

It would be customary to use matrix vector notation at each step, and make this a simple exercise in matrix calculus. (More generally I think there's a 3-D tensor, but if you look at it one data point at a time via the simplest form of stochastic gradient descent it simplifies things a lot in particular to matrix vector...)

For instance, I don't know what ##[V_{0j}## is and why it has a double index like an array but all the others are single indices. Similar issues with the ##X##'s where you say they are ##X_n## for ##n \in [1,2]## but you have ##X_{2j}## which seems to contradict that.

- - - -

A more general idea: gradient descent can have convergence problems for a lot of different reasons -- you should have the math bolted down long before that. A simple check to use: compute your gradient at some sampled data points and compare vs small perturbed values i.e. finite difference -- if they don't essentially match, your gradient calculation has problems. You can repeat this with many (random) trials as
needed.

Note: there are automatic differentiation tools out there that you can use if your goal is to just get gradient descent 'right'. If your goal is to practice the math, that is a different goal.
There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!
 
  • #4
fahraynk said:
There are 2 ##X## variables, and N experiments. ##X_{1j}## is the value of ##X_1## for experiment ##J : J \in [1,N]##
##V_0## is a given experimental value, which is known, and there are ##N## experiments, so ##V_{0j}, j\in [1,N]## are all known.
What exactly do you want me to turn into a matrix notation? I'm not sure what you mean, but I am willing to try. I am a math major, so I should learn!

It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...
 
Last edited:
  • #5
StoneTemplePython said:
It should be immediately clear to people that ##X## is a matrix // array, if you invert the values of H, then they are in a vector (traditionally done with lower case bold), I think ##V_d## and ##V_m## are just scalars (greek letters preferred here) and despite the double index ##V_{0j}## seems to just be a vector (again lower case bold). Notation is by no means standardized, but after looking at your original post twice I still struggled to figure out basic things related to dimensionality of these parameters. This really isn't the main point though -- the bigger issue issue I think lurks in ##K##.

High level -- it looks like you are differentiating against those two values ##V_d## and ##V_m## and applying chain rule correctly in the process.

Again, fundamental point: did you compute finite difference and compare against your gradient? What were the results? If you don't know how to compute a good finite difference and compare against gradient over some sampled data points-- it tells you that you're in trouble. I.e. if you don't have a ruler to measure / estimate your gradient's accuracy against...

Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?
 
  • #6
fahraynk said:
Ill try that later today or thursday and let you know.
So, when you say applying the chain rule correctly, do you believe my assumption that ##\frac{\partial{V_d}}{\partial{V_k}}=0## and ##\frac{\partial{V_m}}{\partial{V_k}}=0## was correct?

Look, your original post says that you already use gradient descent to tune your model with respect to those two V's. At some point you just need to come out and assert they are independent -- and state the other features of your model.

The very basic idea underlying all this is to clearly state how your model works, then trace the implications of the derivatives and trace the (finite difference) implications of small perturbations, and compare.
 

1. Can I use any type of math for gradient descent?

No, proper mathematical techniques and formulas must be used in order to ensure accurate and efficient gradient descent. Using incorrect math can result in incorrect or unstable results.

2. What type of math is necessary for gradient descent?

Linear algebra, calculus, and optimization techniques are typically used in gradient descent. It is important to have a strong understanding of these areas in order to successfully implement gradient descent.

3. How can I test if my math is correct for gradient descent?

One way to test your math is to use a known data set and compare the results of your gradient descent calculations to the expected results. Additionally, consulting with a mentor or colleague who has experience with gradient descent can also help verify the accuracy of your math.

4. Can I use alternative methods for gradient descent?

While gradient descent is a widely used and effective method for optimization, there are alternative methods that can be used. However, it is important to understand the limitations and potential drawbacks of these alternative methods before implementing them.

5. How can I improve my math skills for gradient descent?

Practicing and applying mathematical concepts, as well as seeking out resources such as textbooks, online courses, and tutorials can help improve your math skills for gradient descent. Collaborating with other experts in the field can also provide valuable insights and feedback.

Similar threads

  • Programming and Computer Science
Replies
31
Views
2K
Replies
3
Views
1K
  • Advanced Physics Homework Help
Replies
9
Views
1K
Replies
32
Views
1K
Replies
4
Views
636
Replies
6
Views
914
  • Advanced Physics Homework Help
Replies
16
Views
908
Replies
3
Views
1K
  • Advanced Physics Homework Help
Replies
0
Views
69
Replies
5
Views
364
Back
Top