MHB Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

AI Thread Summary
Feature normalization is essential in multivariate linear regression using gradient descent, transforming the feature matrix to ensure better convergence. The transformation involves centering each feature by subtracting the mean and scaling by the standard deviation. Predictions after normalization require transforming the input location similarly, ensuring the output remains in the original engineering units. The resulting prediction is derived from the normalized input multiplied by the parameter vector, which inherently includes the necessary scaling. Thus, the normalized theta vector retains the engineering context needed for accurate predictions.
Ackbach
Gold Member
MHB
Messages
4,148
Reaction score
93
Cross-posted on SE.DS Beta.

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis,
$$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$
Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!
 
Physics news on Phys.org
If I'm not mistaken, you're trying to find $\theta$, such that $X\theta$ is as close as possible to some $y$.
That is, find the $\theta$ that minimizes $\|X\theta - y\|$.
Afterwards, the prediction is $\hat y = X\theta$.
Is that correct?

If so, then after normalization, we find $\tilde \theta$, such that $\tilde X\tilde\theta$ is as close as possible to the also normalized $\tilde y = \frac{y-\bar y}{s_y}$, yes?
In that case the prediction becomes:
$$\hat y = \bar y + \tilde X\tilde\theta s_y$$
which is in engineering units.
 
Actually, I just learned the answer (see the SE.DS Beta link): you transform the prediction location $x$ in precisely the same way you did for the columns of $X$, but component-wise. So you do this:

1. To each element of $x$, you add the mean of the corresponding column of $X$.
2. Divide the result by the standard deviation of the corresponding column of $X$.
3. Prepend a $1$ to the $x$ vector to allow for the bias. Call the result $\tilde{x}$.
4. Perform $\tilde{x}\cdot\theta$, which is the prediction value.

As it turns out, if you do feature normalization, then the $\theta$ vector DOES contain all the scaling and engineering units you need. And that actually makes sense, because you're not doing any transform to the outputs of the training data.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top