# I Newton-Raphson in Least Squares: How is it used? Cost Function?

#### WWGD

Gold Member
Summary
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used.
I just went over analysis of a data set that was analzed using Linear Regression (OLS, I believe) and I saw Newton's method was used. Just curious, how is it used? I assume to minimize the cost function, but this function was not made explicit. Anyone know?
Thanks.

Related Set Theory, Logic, Probability, Statistics News on Phys.org

#### Dale

Mentor
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.

#### RPinPA

Homework Helper
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.

#### WWGD

Gold Member
That's a little surprising since the equations of linear least squares are a linear system and can be solved with a linear solver, using Gaussian elimination for instance.

No need for an iterative method like Newton-Raphson. Perhaps this was nonlinear least squares? That's a more general nonlinear optimization problem.
I think it may have been used with squared residuals, as Dale mentioned.

#### WWGD

Gold Member
The cost function for OLS is always the sum of the squared residuals. That is what ordinary least squares (OLS) means.
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?

#### Dale

Mentor
Thanks.So, to verify, we use Newton's to minimize the sum of squared residuals?
Yes, or you can use any other optimization method also or a linear solver which is faster. They should all get the same answer to within numerical precision.

#### RPinPA

Homework Helper
I think it may have been used with squared residuals, as Dale mentioned.
Yes, that's the definition of least-squares. You write the expression for the sum of the squared residuals in terms of the data values and the unknown parameters. You take the gradient of that sum of squared residuals with respect to the parameters. And you set that gradient equal to 0.

That leads to a linear system, which can be solved in a single step with a linear solver. No need for an iterative method.

#### RPinPA

Homework Helper
General linear least squares:
Let $\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)$

For standard linear regression $p = 2$, $\phi_1(x) = 1$ and $\phi_2(x) = x$ but this generalizes easily to polynomials or arbitrary linear combinations of functions of $x$.

Define the cost function E = $\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2$ where the sum is over $n$ data points.

We want the vector $\mathbf b = (a_1, a_2, ... , a_p)^T$ that minimizes E.

In matrix form we can define $\mathbf x$ and $\mathbf y$ as our vectors of data values (so they are $n \times 1)$ and then define a $n \times p$ matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix} \phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\ \phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\ \vdots \\ \phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n) \end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your $x$ values.

Then $\mathbf {\hat y} = \mathbf {Ab}$ and $E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})$ $= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}$ $= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}$

We want $\nabla_b E = \mathbf 0$, the gradient with respect to vector $\mathbf b$ to be 0.

That works out to the equation $\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0$ although you might have to do some work to convince yourself of that. And so the optimum $\mathbf b$ is the solution of the $p \times p$ linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix $\mathbf A$, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.

#### WWGD

Gold Member
General linear least squares:
Let $\hat y = a_1 \phi_1(x) + a_2 \phi_2(x) + ... + a_p \phi_p(x)$

For standard linear regression $p = 2$, $\phi_1(x) = 1$ and $\phi_2(x) = x$ but this generalizes easily to polynomials or arbitrary linear combinations of functions of $x$.

Define the cost function E = $\sum_i (y_i - \hat y(x_i))^2 = \sum_i\left (y_i - a_1 \phi_1(x_i) - ... - a_p \phi_p(x_i) \right )^2$ where the sum is over $n$ data points.

We want the vector $\mathbf b = (a_1, a_2, ... , a_p)^T$ that minimizes E.

In matrix form we can define $\mathbf x$ and $\mathbf y$ as our vectors of data values (so they are $n \times 1)$ and then define a $n \times p$ matrix (I forget the name for this matrix)
$$\mathbf A = \begin{pmatrix} \phi_1(x_1) & \phi_2(x_1) & ... & \phi_p(x_1) \\ \phi_1(x_2) & \phi_2(x_2) & ... & \phi_p(x_2) \\ \vdots \\ \phi_1(x_n) & \phi_2(x_n) & ... & \phi_p(x_n) \end{pmatrix}$$

and again for simple linear regression, the first column is all 1's and the second column is your $x$ values.

Then $\mathbf {\hat y} = \mathbf {Ab}$ and $E = (\mathbf y - \mathbf {\hat y})^T (\mathbf y - \mathbf {\hat y})$ $= \mathbf {y^T y + \hat y^T \hat y - 2 y^T \hat y}$ $= \mathbf{ y^T y + b^T (A^T A) b - 2 (y^T A) b}$

We want $\nabla_b E = \mathbf 0$, the gradient with respect to vector $\mathbf b$ to be 0.

That works out to the equation $\mathbf { 2A^TA b - 2(A^T y)} = \mathbf 0$ although you might have to do some work to convince yourself of that. And so the optimum $\mathbf b$ is the solution of the $p \times p$ linear system

$$\mathbf {(A^TA)b} = \mathbf {A^T y}$$

And that's how you implement general linear least squares in systems that have a linear solver. You define the matrix $\mathbf A$, calculate the coefficient matrix and right-hand side of that system, and feed them to your solver.
I understand. But I am using ML and not standard stats. I am not clear on just how ML regression algorithms use iteration, but they do.

#### RPinPA

Homework Helper
Which Matlab function are you talking about exactly? It still sounds like you're using a nonlinear solver.

The equations I wrote above work out because E is quadratic in the parameters, and therefore its gradient is linear in the parameters. Thus you get a linear system to solve. I've implemented precisely that method in Matlab on numerous equations since it does have a solver (the backslash operator) built in. So all you have to do for your parameter estimate is b = (A'*A) \ (A' * y);

In general nonlinear least squares you're still trying to find the zero of the gradient, but it's no longer linear in the parameters. Thus you're trying to find roots of a general nonlinear function. And that's what Newton-Raphson does.

In one dimension in its simplest form, the iteration to find the zero of $f(x)$ is simply: $x_{i+1} = x_i - f(x)/f'(x)$. Since our vector $f(x)$ is the gradient of $E$ with respect to the parameter vector $\mathbf b$ its derivative generalizes to the Hessian of $E$, the second-derivative matrix. And Newton-Raphson becomes in its more general vectorized form:
$$\mathbf b_{i+1} = \mathbf b_i - \alpha \mathbf H^{-1}(E) \nabla E$$
where in general there's an adaptive stepsize parameter $\alpha$ which grows and shrinks in response to the local properties of the function.

You can apply this to a quadratic function such as arises in linear least squares, but then it usually converges in a single step. The update rule with $\alpha = 1$ leads you to the zero of the gradient in one step.

Last edited:

#### WWGD

Gold Member
We're using Matplotlib and Graphlab. I realized the iterations are being done over all partitions of data points into test data and training data and we stop when loss function is below threshold. But I still don't see why we iterate. I think we are tweaking the coefficients to reduce MSE, but not sure how.

#### pbuk

I haven't done the analysis but I suspect that any or all of the following may be true
1. the linear solver has a memory requirement $O(n^2)$ whereas the iterative method is $O(n)$ and is therefore more scalable
2. the linear solver is only applicable to linear regression whereas an iterative method can be applied to any curve and is therefore more flexible
3. Newton-Raphson is very efficient for solving quadratic systems so there is little if any computational penalty vs the linear transformation and solution (and for a non-linear fit can easily be extended to higher order Householder's method if desired)

#### pbuk

I think we are tweaking the coefficients to reduce MSE, but not sure how.
Whichever method you are using to solve for the coefficients, the solution minimizes the Mean Squared Error (that is the definition of Ordinary Least Squares regression as RPinPA pointed out in #7) so I'm not sure either. You could be tweaking them to achieve something else? Or perhaps this is NOT Ordinary Least Squares, you don't seem sure in your original post?

#### scottdave

Homework Helper
Gold Member
If there are higher levels of polynomial than just squared, or you have interaction terms, then I think simple linear algebra techniques will not work. A numerical (iterative) method may converge to a solution, though.

"Newton-Raphson in Least Squares: How is it used? Cost Function?"

### Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving