Newton-Raphson in Least Squares: How is it used? Cost Function?

Click For Summary

Discussion Overview

The discussion revolves around the use of Newton's method in the context of least squares regression, specifically questioning its application in minimizing the cost function associated with ordinary least squares (OLS) and whether it is necessary given the linear nature of the problem.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant inquires about the use of Newton's method in minimizing the cost function for a data set analyzed using OLS.
  • Another participant states that the cost function for OLS is the sum of squared residuals, which is a standard definition.
  • Some participants express surprise that Newton's method is used, suggesting that OLS can be solved directly using linear solvers without iteration.
  • There is a suggestion that Newton's method may be applicable in the context of nonlinear least squares, which is a broader optimization problem.
  • One participant confirms that Newton's method can be used to minimize the sum of squared residuals, but also notes that other optimization methods or linear solvers could yield the same results more efficiently.
  • Another participant elaborates on the mathematical formulation of the cost function and the conditions under which the gradient is set to zero, leading to a linear system that can be solved directly.
  • One participant mentions using machine learning (ML) regression algorithms, indicating uncertainty about how these algorithms incorporate iteration in their processes.
  • Another participant describes the iterative nature of nonlinear least squares and how Newton-Raphson is applied in this context, including the update rules for parameter estimation.
  • There is a mention of using tools like Matlab and Graphlab, with a focus on the iterative process over data partitions in ML contexts.

Areas of Agreement / Disagreement

Participants express differing views on the necessity and appropriateness of using Newton's method for OLS, with some advocating for its use in nonlinear contexts while others argue that it is unnecessary for linear least squares. The discussion remains unresolved regarding the specific application of iteration in ML regression algorithms.

Contextual Notes

Some participants highlight the potential for confusion regarding the application of iterative methods in linear versus nonlinear least squares, as well as the computational efficiency of different approaches. There are references to specific implementations in software that may have implications for memory requirements and performance.

WWGD
Science Advisor
Homework Helper
Messages
7,806
Reaction score
13,120
TL;DR
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used.
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used. Just curious, how is it used? I assume to minimize the cost function, but this function was not made explicit. Anyone know?
Thanks.
 
Physics news on Phys.org
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
 
  • Like
Likes   Reactions: WWGD
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
 
RPinPA said:
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
I think it may have been used with squared residuals, as Dale mentioned.
 
Dale said:
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
 
WWGD said:
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
Yes, or you can use any other optimization method also or a linear solver which is faster. They should all get the same answer to within numerical precision.
 
  • Like
Likes   Reactions: WWGD
WWGD said:
I think it may have been used with squared residuals, as Dale mentioned.

Yes, that's the definition of least-squares. You write the expression for the sum of the squared residuals in terms of the data values and the unknown parameters. You take the gradient of that sum of squared residuals with respect to the parameters. And you set that gradient equal to 0.

That leads to a linear system, which can be solved in a single step with a linear solver. No need for an iterative method.
 
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
 
RPinPA said:
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
I understand. But I am using ML and not standard stats. I am not clear on just how ML regression algorithms use iteration, but they do.
 
  • #10
Which Matlab function are you talking about exactly? It still sounds like you're using a nonlinear solver.

The equations I wrote above work out because E is quadratic in the parameters, and therefore its gradient is linear in the parameters. Thus you get a linear system to solve. I've implemented precisely that method in Matlab on numerous equations since it does have a solver (the backslash operator) built in. So all you have to do for your parameter estimate is b = (A'*A) \ (A' * y);

In general nonlinear least squares you're still trying to find the zero of the gradient, but it's no longer linear in the parameters. Thus you're trying to find roots of a general nonlinear function. And that's what Newton-Raphson does.

In one dimension in its simplest form, the iteration to find the zero of ##f(x)## is simply: ##x_{i+1} = x_i - f(x)/f'(x)##. Since our vector ##f(x)## is the gradient of ##E## with respect to the parameter vector ##\mathbf b## its derivative generalizes to the Hessian of ##E##, the second-derivative matrix. And Newton-Raphson becomes in its more general vectorized form:
$$\mathbf b_{i+1} = \mathbf b_i - \alpha \mathbf H^{-1}(E) \nabla E$$
where in general there's an adaptive stepsize parameter ##\alpha## which grows and shrinks in response to the local properties of the function.

You can apply this to a quadratic function such as arises in linear least squares, but then it usually converges in a single step. The update rule with ##\alpha = 1## leads you to the zero of the gradient in one step.
 
Last edited:
  • Like
Likes   Reactions: pbuk
  • #11
We're using Matplotlib and Graphlab. I realized the iterations are being done over all partitions of data points into test data and training data and we stop when loss function is below threshold. But I still don't see why we iterate. I think we are tweaking the coefficients to reduce MSE, but not sure how.
 
  • #12
I haven't done the analysis but I suspect that any or all of the following may be true
  1. the linear solver has a memory requirement ## O(n^2) ## whereas the iterative method is ## O(n) ## and is therefore more scalable
  2. the linear solver is only applicable to linear regression whereas an iterative method can be applied to any curve and is therefore more flexible
  3. Newton-Raphson is very efficient for solving quadratic systems so there is little if any computational penalty vs the linear transformation and solution (and for a non-linear fit can easily be extended to higher order Householder's method if desired)
 
  • #13
WWGD said:
I think we are tweaking the coefficients to reduce MSE, but not sure how.
Whichever method you are using to solve for the coefficients, the solution minimizes the Mean Squared Error (that is the definition of Ordinary Least Squares regression as RPinPA pointed out in #7) so I'm not sure either. You could be tweaking them to achieve something else? Or perhaps this is NOT Ordinary Least Squares, you don't seem sure in your original post?
 
  • #14
If there are higher levels of polynomial than just squared, or you have interaction terms, then I think simple linear algebra techniques will not work. A numerical (iterative) method may converge to a solution, though.
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
3
Views
3K
  • · Replies 11 ·
Replies
11
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K