MHB Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

Ackbach · Jun 22, 2017

Cross-posted on SE.DS Beta.

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis,
$$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$
Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!

I like Serena · Jun 22, 2017

If I'm not mistaken, you're trying to find $\theta$, such that $X\theta$ is as close as possible to some $y$.
That is, find the $\theta$ that minimizes $\|X\theta - y\|$.
Afterwards, the prediction is $\hat y = X\theta$.
Is that correct?

If so, then after normalization, we find $\tilde \theta$, such that $\tilde X\tilde\theta$ is as close as possible to the also normalized $\tilde y = \frac{y-\bar y}{s_y}$, yes?
In that case the prediction becomes:
$$\hat y = \bar y + \tilde X\tilde\theta s_y$$
which is in engineering units.

Ackbach · Jun 22, 2017

Actually, I just learned the answer (see the SE.DS Beta link): you transform the prediction location $x$ in precisely the same way you did for the columns of $X$, but component-wise. So you do this:

1. To each element of $x$, you add the mean of the corresponding column of $X$.
2. Divide the result by the standard deviation of the corresponding column of $X$.
3. Prepend a $1$ to the $x$ vector to allow for the bias. Call the result $\tilde{x}$.
4. Perform $\tilde{x}\cdot\theta$, which is the prediction value.

As it turns out, if you do feature normalization, then the $\theta$ vector DOES contain all the scaling and engineering units you need. And that actually makes sense, because you're not doing any transform to the outputs of the training data.

MHB Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

Thread 'A variant of the Monty Hall problem'

Similar threads

B A Little Probability Puzzle

I A variant of the Monty Hall problem

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

I Please Explain (actually explain) The Monty Hall Problem

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers