Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

Click For Summary
SUMMARY

This discussion focuses on the implementation of feature normalization/scaling in linear regression using gradient descent, specifically in the multivariate case. The transformation of the feature matrix $X$ to $\tilde{X}$ is achieved through the formula $\tilde{X}=\frac{X-\bar{X}}{s_{X}}$, where $\bar{X}$ is the mean and $s_{X}$ is the standard deviation of each feature. The correct prediction after normalization involves transforming the input location $x$ in the same manner, resulting in $\hat{y} = \bar{y} + \tilde{X}\tilde{\theta}s_{y}$, ensuring the outcome is in engineering units. This process confirms that the $\theta$ vector generated through gradient descent retains the necessary scaling and units for accurate predictions.

PREREQUISITES
  • Understanding of linear regression and gradient descent algorithms
  • Familiarity with feature normalization techniques
  • Knowledge of matrix operations and vector notation
  • Basic statistics, including mean and standard deviation calculations
NEXT STEPS
  • Study the implementation of feature normalization in Python using libraries like NumPy or scikit-learn
  • Explore the differences between gradient descent and the normal equation for linear regression
  • Learn about the implications of feature scaling on model performance and convergence rates
  • Investigate advanced optimization techniques for large datasets in linear regression
USEFUL FOR

Data scientists, machine learning practitioners, and statisticians who are implementing linear regression models and seeking to optimize their predictions through effective feature scaling techniques.

Ackbach
Gold Member
MHB
Messages
4,148
Reaction score
94
Cross-posted on SE.DS Beta.

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis,
$$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$
Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!
 
Physics news on Phys.org
If I'm not mistaken, you're trying to find $\theta$, such that $X\theta$ is as close as possible to some $y$.
That is, find the $\theta$ that minimizes $\|X\theta - y\|$.
Afterwards, the prediction is $\hat y = X\theta$.
Is that correct?

If so, then after normalization, we find $\tilde \theta$, such that $\tilde X\tilde\theta$ is as close as possible to the also normalized $\tilde y = \frac{y-\bar y}{s_y}$, yes?
In that case the prediction becomes:
$$\hat y = \bar y + \tilde X\tilde\theta s_y$$
which is in engineering units.
 
Actually, I just learned the answer (see the SE.DS Beta link): you transform the prediction location $x$ in precisely the same way you did for the columns of $X$, but component-wise. So you do this:

1. To each element of $x$, you add the mean of the corresponding column of $X$.
2. Divide the result by the standard deviation of the corresponding column of $X$.
3. Prepend a $1$ to the $x$ vector to allow for the bias. Call the result $\tilde{x}$.
4. Perform $\tilde{x}\cdot\theta$, which is the prediction value.

As it turns out, if you do feature normalization, then the $\theta$ vector DOES contain all the scaling and engineering units you need. And that actually makes sense, because you're not doing any transform to the outputs of the training data.
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 4 ·
Replies
4
Views
5K
  • · Replies 3 ·
Replies
3
Views
3K