I Newton-Raphson in Least Squares: How is it used? Cost Function?

AI Thread Summary
The discussion centers on the application of Newton-Raphson in the context of least squares regression, particularly questioning its necessity for ordinary least squares (OLS) versus nonlinear least squares. It is clarified that OLS minimizes the sum of squared residuals, typically solvable via linear methods without iteration. However, Newton-Raphson is relevant for nonlinear least squares, where the cost function is not linear in parameters, necessitating iterative solutions. The conversation also touches on the efficiency and scalability of iterative methods compared to linear solvers, particularly in machine learning contexts where model complexity may increase. Ultimately, the use of Newton-Raphson is confirmed for nonlinear problems, while OLS can be efficiently solved using direct linear algebra techniques.
WWGD
Science Advisor
Homework Helper
Messages
7,700
Reaction score
12,732
TL;DR Summary
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used.
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used. Just curious, how is it used? I assume to minimize the cost function, but this function was not made explicit. Anyone know?
Thanks.
 
Physics news on Phys.org
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
 
  • Like
Likes WWGD
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
 
RPinPA said:
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
I think it may have been used with squared residuals, as Dale mentioned.
 
Dale said:
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
 
WWGD said:
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
Yes, or you can use any other optimization method also or a linear solver which is faster. They should all get the same answer to within numerical precision.
 
  • Like
Likes WWGD
WWGD said:
I think it may have been used with squared residuals, as Dale mentioned.

Yes, that's the definition of least-squares. You write the expression for the sum of the squared residuals in terms of the data values and the unknown parameters. You take the gradient of that sum of squared residuals with respect to the parameters. And you set that gradient equal to 0.

That leads to a linear system, which can be solved in a single step with a linear solver. No need for an iterative method.
 
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
 
RPinPA said:
General linear least squares:
Let ##\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)##

For standard linear regression ##p = 2##, ##\phi_1(x) = 1## and ##\phi_2(x) = x## but this generalizes easily to polynomials or arbitrary linear combinations of functions of ##x##.

Define the cost function E = ##\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2## where the sum is over ##n## data points.

We want the vector ##\mathbf b = (a_1, a_2, ... , a_p)^T## that minimizes E.

In matrix form we can define ##\mathbf x## and ##\mathbf y## as our vectors of data values (so they are ##n \times 1)## and then define a ##n \times p## matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix}
\phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\
\phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\
\vdots \\
\phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n)
\end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your ##x## values.

Then ##\mathbf {\hat y} = \mathbf {Ab}## and ##E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})## ##= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}## ##= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}##

We want ##\nabla_b E = \mathbf 0##, the gradient with respect to vector ##\mathbf b## to be 0.

That works out to the equation ##\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0## although you might have to do some work to convince yourself of that. And so the optimum ##\mathbf b## is the solution of the ##p \times p## linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix ##\mathbf A##, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
I understand. But I am using ML and not standard stats. I am not clear on just how ML regression algorithms use iteration, but they do.
 
  • #10
Which Matlab function are you talking about exactly? It still sounds like you're using a nonlinear solver.

The equations I wrote above work out because E is quadratic in the parameters, and therefore its gradient is linear in the parameters. Thus you get a linear system to solve. I've implemented precisely that method in Matlab on numerous equations since it does have a solver (the backslash operator) built in. So all you have to do for your parameter estimate is b = (A'*A) \ (A' * y);

In general nonlinear least squares you're still trying to find the zero of the gradient, but it's no longer linear in the parameters. Thus you're trying to find roots of a general nonlinear function. And that's what Newton-Raphson does.

In one dimension in its simplest form, the iteration to find the zero of ##f(x)## is simply: ##x_{i+1} = x_i - f(x)/f'(x)##. Since our vector ##f(x)## is the gradient of ##E## with respect to the parameter vector ##\mathbf b## its derivative generalizes to the Hessian of ##E##, the second-derivative matrix. And Newton-Raphson becomes in its more general vectorized form:
$$\mathbf b_{i+1} = \mathbf b_i - \alpha \mathbf H^{-1}(E) \nabla E$$
where in general there's an adaptive stepsize parameter ##\alpha## which grows and shrinks in response to the local properties of the function.

You can apply this to a quadratic function such as arises in linear least squares, but then it usually converges in a single step. The update rule with ##\alpha = 1## leads you to the zero of the gradient in one step.
 
Last edited:
  • Like
Likes pbuk
  • #11
We're using Matplotlib and Graphlab. I realized the iterations are being done over all partitions of data points into test data and training data and we stop when loss function is below threshold. But I still don't see why we iterate. I think we are tweaking the coefficients to reduce MSE, but not sure how.
 
  • #12
I haven't done the analysis but I suspect that any or all of the following may be true
  1. the linear solver has a memory requirement ## O(n^2) ## whereas the iterative method is ## O(n) ## and is therefore more scalable
  2. the linear solver is only applicable to linear regression whereas an iterative method can be applied to any curve and is therefore more flexible
  3. Newton-Raphson is very efficient for solving quadratic systems so there is little if any computational penalty vs the linear transformation and solution (and for a non-linear fit can easily be extended to higher order Householder's method if desired)
 
  • #13
WWGD said:
I think we are tweaking the coefficients to reduce MSE, but not sure how.
Whichever method you are using to solve for the coefficients, the solution minimizes the Mean Squared Error (that is the definition of Ordinary Least Squares regression as RPinPA pointed out in #7) so I'm not sure either. You could be tweaking them to achieve something else? Or perhaps this is NOT Ordinary Least Squares, you don't seem sure in your original post?
 
  • #14
If there are higher levels of polynomial than just squared, or you have interaction terms, then I think simple linear algebra techniques will not work. A numerical (iterative) method may converge to a solution, though.
 
Back
Top