Linear Regression Gradient Descent: Feature Normalization/Scaling for Prediction

In summary, the conversation discusses the use of feature normalization in a simple linear regression with gradient descent in the multivariate case. The process involves transforming the original feature matrix into a normalized feature matrix, and then finding the theta vector that minimizes the difference between the predicted outcome and the actual outcome. The prediction also needs to be transformed using the same method, and the resulting theta vector contains the necessary scaling and engineering units. This step is important for large feature sets and is not necessary when using the normal equation.
  • #1
Ackbach
Gold Member
MHB
4,155
89
Cross-posted on SE.DS Beta.

I'm just doing a simple linear regression with gradient descent in the multivariate case. Feature normalization/scaling is a standard pre-processing step in this situation, so I take my original feature matrix $X$, organized with features in columns and samples in rows, and transform to $\tilde{X}$, where, on a column-by-column basis,
$$\tilde{X}=\frac{X-\bar{X}}{s_{X}}.$$
Here, $\bar{X}$ is the mean of a column, and $s_{X}$ is the sample standard deviation of a column. Once I've done this, I prepend a column of $1$'s to allow for a constant offset in the $\theta$ vector. So far, so good.

If I did not do feature normalization, then my prediction, once I found my $\theta$ vector, would simply be $x\cdot\theta$, where $x$ is the location at which I want to predict the outcome. But now, if I am doing feature normalization, what does the prediction look like? I suppose I could take my location $x$ and transform it according to the above equation on an element-by-element basis. But then what? The outcome of $\tilde{x}\cdot\theta$ would not be in my desired engineering units. Moreover, how do I know that the $\theta$ vector I've generated via gradient descent is correct for the un-transformed locations? I realize all of this is a moot point if I'm using the normal equation, since feature scaling is unnecessary in that case. However, as gradient descent typically works better for very large feature sets ($> 10k$ features), this would seem to be an important step. Thank you for your time!
 
Physics news on Phys.org
  • #2
If I'm not mistaken, you're trying to find $\theta$, such that $X\theta$ is as close as possible to some $y$.
That is, find the $\theta$ that minimizes $\|X\theta - y\|$.
Afterwards, the prediction is $\hat y = X\theta$.
Is that correct?

If so, then after normalization, we find $\tilde \theta$, such that $\tilde X\tilde\theta$ is as close as possible to the also normalized $\tilde y = \frac{y-\bar y}{s_y}$, yes?
In that case the prediction becomes:
$$\hat y = \bar y + \tilde X\tilde\theta s_y$$
which is in engineering units.
 
  • #3
Actually, I just learned the answer (see the SE.DS Beta link): you transform the prediction location $x$ in precisely the same way you did for the columns of $X$, but component-wise. So you do this:

1. To each element of $x$, you add the mean of the corresponding column of $X$.
2. Divide the result by the standard deviation of the corresponding column of $X$.
3. Prepend a $1$ to the $x$ vector to allow for the bias. Call the result $\tilde{x}$.
4. Perform $\tilde{x}\cdot\theta$, which is the prediction value.

As it turns out, if you do feature normalization, then the $\theta$ vector DOES contain all the scaling and engineering units you need. And that actually makes sense, because you're not doing any transform to the outputs of the training data.
 

1. What is linear regression gradient descent?

Linear regression gradient descent is a machine learning algorithm used for predicting numerical values based on a set of input data. It works by minimizing the cost function, which measures the difference between the predicted values and the actual values.

2. What is feature normalization/scaling?

Feature normalization or scaling is the process of transforming the numerical values of features in a dataset to a similar scale. This is done to prevent features with larger values from dominating the training process and to improve the performance of the model.

3. Why is feature normalization/scaling important in linear regression gradient descent?

Feature normalization is important in linear regression gradient descent because it helps the algorithm to converge faster and more accurately. It also prevents features with larger values from having a higher impact on the model, making the predictions more balanced.

4. How do you perform feature normalization/scaling in linear regression gradient descent?

Feature normalization in linear regression gradient descent is typically done by subtracting the mean of the feature values from each data point and then dividing by the standard deviation. This ensures that the feature values have a mean of 0 and a standard deviation of 1.

5. Are there any disadvantages to feature normalization/scaling in linear regression gradient descent?

One potential disadvantage of feature normalization in linear regression gradient descent is that it can be computationally expensive, especially for datasets with a large number of features. Additionally, normalization may not be necessary for all datasets, and in some cases, it may even decrease the performance of the model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
488
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
959
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Advanced Physics Homework Help
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
4K
Back
Top