Newton-Raphson in Least Squares: How is it used? Cost Function?

In summary: Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.That
  • #1
WWGD
Science Advisor
Gold Member
7,003
10,423
TL;DR Summary
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used.
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used. Just curious, how is it used? I assume to minimize the cost function, but this function was not made explicit. Anyone know?
Thanks.
 
Physics news on Phys.org
  • #2
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
 
  • Like
Likes WWGD
  • #3
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
 
  • #4
RPinPA said:
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
I think it may have been used with squared residuals, as Dale mentioned.
 
  • #5
Dale said:
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
 
  • #6
WWGD said:
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
Yes, or you can use any other optimization method also or a linear solver which is faster. They should all get the same answer to within numerical precision.
 
  • Like
Likes WWGD
  • #7
WWGD said:
I think it may have been used with squared residuals, as Dale mentioned.

Yes, that's the definition of least-squares. You write the expression for the sum of the squared residuals in terms of the data values and the unknown parameters. You take the gradient of that sum of squared residuals with respect to the parameters. And you set that gradient equal to 0.

That leads to a linear system, which can be solved in a single step with a linear solver. No need for an iterative method.
 
  • #8
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
 
  • #9
RPinPA said:
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
I understand. But I am using ML and not standard stats. I am not clear on just how ML regression algorithms use iteration, but they do.
 
  • #10
Which Matlab function are you talking about exactly? It still sounds like you're using a nonlinear solver.

The equations I wrote above work out because E is quadratic in the parameters, and therefore its gradient is linear in the parameters. Thus you get a linear system to solve. I've implemented precisely that method in Matlab on numerous equations since it does have a solver (the backslash operator) built in. So all you have to do for your parameter estimate is b = (A'*A) \ (A' * y);

In general nonlinear least squares you're still trying to find the zero of the gradient, but it's no longer linear in the parameters. Thus you're trying to find roots of a general nonlinear function. And that's what Newton-Raphson does.

In one dimension in its simplest form, the iteration to find the zero of ##f(x)## is simply: ##x_{i+1} = x_i - f(x)/f'(x)##. Since our vector ##f(x)## is the gradient of ##E## with respect to the parameter vector ##\mathbf b## its derivative generalizes to the Hessian of ##E##, the second-derivative matrix. And Newton-Raphson becomes in its more general vectorized form:
$$\mathbf b_{i+1} = \mathbf b_i - \alpha \mathbf H^{-1}(E) \nabla E$$
where in general there's an adaptive stepsize parameter ##\alpha## which grows and shrinks in response to the local properties of the function.

You can apply this to a quadratic function such as arises in linear least squares, but then it usually converges in a single step. The update rule with ##\alpha = 1## leads you to the zero of the gradient in one step.
 
Last edited:
  • Like
Likes pbuk
  • #11
We're using Matplotlib and Graphlab. I realized the iterations are being done over all partitions of data points into test data and training data and we stop when loss function is below threshold. But I still don't see why we iterate. I think we are tweaking the coefficients to reduce MSE, but not sure how.
 
  • #12
I haven't done the analysis but I suspect that any or all of the following may be true
  1. the linear solver has a memory requirement ## O(n^2) ## whereas the iterative method is ## O(n) ## and is therefore more scalable
  2. the linear solver is only applicable to linear regression whereas an iterative method can be applied to any curve and is therefore more flexible
  3. Newton-Raphson is very efficient for solving quadratic systems so there is little if any computational penalty vs the linear transformation and solution (and for a non-linear fit can easily be extended to higher order Householder's method if desired)
 
  • #13
WWGD said:
I think we are tweaking the coefficients to reduce MSE, but not sure how.
Whichever method you are using to solve for the coefficients, the solution minimizes the Mean Squared Error (that is the definition of Ordinary Least Squares regression as RPinPA pointed out in #7) so I'm not sure either. You could be tweaking them to achieve something else? Or perhaps this is NOT Ordinary Least Squares, you don't seem sure in your original post?
 
  • #14
If there are higher levels of polynomial than just squared, or you have interaction terms, then I think simple linear algebra techniques will not work. A numerical (iterative) method may converge to a solution, though.
 

1. What is the Newton-Raphson method?

The Newton-Raphson method is an iterative algorithm used to find the roots of a function. It involves using the derivative of the function to approximate the location of the root and then repeating the process until a desired level of accuracy is achieved.

2. How is the Newton-Raphson method used in least squares?

In least squares, the Newton-Raphson method is used to find the minimum value of the cost function. The cost function is a measure of the difference between the predicted values of a model and the actual values of the data. By minimizing the cost function, the parameters of the model can be estimated.

3. What is the cost function in the Newton-Raphson method?

The cost function in the Newton-Raphson method is a measure of the difference between the predicted values of a model and the actual values of the data. It is typically represented as the sum of squared errors and is minimized to find the best fit parameters for the model.

4. How does the Newton-Raphson method improve upon the least squares method?

The Newton-Raphson method improves upon the least squares method by using the derivative of the cost function to find the minimum value, rather than relying on trial and error. This results in a more efficient and accurate estimation of the model parameters.

5. What are the limitations of using Newton-Raphson in least squares?

One limitation of using Newton-Raphson in least squares is that it requires the cost function to be differentiable, which may not always be the case. Additionally, the method may converge to a local minimum rather than the global minimum, resulting in a suboptimal solution.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
455
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
804
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
972
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
7K
Back
Top