Ridge Regression Cross Validation

SchroedingersLion · Jan 9, 2019

Hello guys,

I have some difficulties understanding the procedure of cross validation to estimate the hyperparameter ## \lambda ## in Ridge Regression.

The Ridge Regression yields the weight vector w from
$$ min_w ( ||Y-Xw||^2 + \lambda ||w||)$$
X is the data matrix that stores N data vectors in its rows, Y is the N-vector of the targets that belong to the N data vectors.

Now, as far as I understand it, the advantage of Ridge Regression as opposed to ordinary least squares, where ##\lambda=0##, is that we suppress the influence of statistical outliers in our given data.

However, I have read that a prominent way of finding the optimal ##\lambda ## is via cross-validation.
We split the data into training and test data, we estimate w for the training data and calculate the mean-squared-error of our model data ##Xw## on the test data. Then we do the same for different types of training-test splitting and average over the different MSE's.

We do this for a range of ##\lambda## and then choose the one ##\lambda## that leads to the least MSE in the cross validation. Fine.
But how is it legit to judge the quality of our model by looking at the mean-squared-error? If I want a minimal mean-squared-error I would have to set ##\lambda=0## and arrive at the ordinary least squares again.

Second question: I have just used a random data set from a machine learning website and performed Ridge Regression on it, using the Python Scikit package, the method linearmodel.RidgeCV.
https://scikit-learn.org/stable/mod...del.RidgeCV.html#sklearn.linear_model.RidgeCV
It gives me a very small optimal ##\lambda=0.02## (it uses cross validation).
What does that mean? That ordinary least squares would have been good enough?

Regards!
SL

StoneTemplePython · Jan 9, 2019

SchroedingersLion said:

Hello guys,

I have some difficulties understanding the procedure of cross validation to estimate the hyperparameter ## \lambda ## in Ridge Regression.

The Ridge Regression yields the weight vector w from
$$ min_w ( ||Y-Xw||^2 + \lambda ||w||)$$
X is the data matrix that stores N data vectors in its rows, Y is the N-vector of the targets that belong to the N data vectors.

Now, as far as I understand it, the advantage of Ridge Regression as opposed to ordinary least squares, where ##\lambda=0##, is that we suppress the influence of statistical outliers in our given data.

Have you worked through any books or courses on this? I see a lot of different ideas in here, blurring together. But thus far it seems ok. A better way to put this is we don't care about minimizing our cost function in sample. We care quite bit about how our cost function performs out of sample. The former is available (and split via training and validation sets-- though the wording in sample vs out can be a little awkward for validations sets) while the latter is not directly observable though we can estimate this via performance on test data.

The reason you impose a regulation penalty is to, in effect, tone down how much your model fits to the idiosyncrasies of your training data. You get a less expressive model in sample so that it avoids over-fitting and does better out of sample.

SchroedingersLion said:

However, I have read that a prominent way of finding the optimal ##\lambda ## is via cross-validation.
We split the data into training and test data, we estimate w for the training data and calculate the mean-squared-error of our model data ##Xw## on the test data. Then we do the same for different types of training-test splitting and average over the different MSE's.

Well, no. You split the data into (i) training, (ii) validation, (iii) test.

SchroedingersLion said:

We do this for a range of ##\lambda## and then choose the one ##\lambda## that leads to the least MSE in the cross validation. Fine.
But how is it legit to judge the quality of our model by looking at the mean-squared-error? If I want a minimal mean-squared-error I would have to set ##\lambda=0## and arrive at the ordinary least squares again.

again cost function of what? If you want a minimal cost function as applied to your training data then yes go for ##\lambda =0## and add some more parameters while you're at it. But what your goal should be is lower expected cost out of sample i.e. when making real world predictions -- and again we use things like test data to estimate this.

You need to consciously and explicitly say whether you're referring to in-sample performance or out-of-sample. They are not the same thing.

SchroedingersLion · Jan 10, 2019

Thank you for your response!

I am visiting a course to Machine Learning, but it goes through all the ML topics pretty quickly and superficially.

StoneTemplePython said:

Well, no. You split the data into (i) training, (ii) validation, (iii) test.

Hmm, this is not how the Python package seems to work. On wiki, it also says we split the data set into two: training and test, where test is the same as validation.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)

StoneTemplePython said:

again cost function of what? If you want a minimal cost function as applied to your training data then yes go for ##\lambda =0## and add some more parameters while you're at it. But what your goal should be is lower expected cost out of sample i.e. when making real world predictions -- and again we use things like test data to estimate this.

You need to consciously and explicitly say whether you're referring to in-sample performance or out-of-sample. They are not the same thing.

I am just unsure about why we consider mse's but our cost function is something different. The Python CV methods calculate mse's, even though we use Ridge Regression.
Are these the mean squared errors that we expect on unseen data? For example, suppose I do 2-fold cross validation.
At first I train my model on the first half of data points and estimate the mse on the second half. Then I train my model on the second half of data points and measure the mse on the first half. Would the average of these mse's be considered "out of sample" and be a legit estimation for prediction of unseen data?

StoneTemplePython · Jan 10, 2019

SchroedingersLion said:

Thank you for your response!

I am visiting a course to Machine Learning, but it goes through all the ML topics pretty quickly and superficially.

Hmm, this is not how the Python package seems to work. On wiki, it also says we split the data set into two: training and test, where test is the same as validation.

If you want to learn this the right, way, do something like Caltech's Course "Learning From Data" which is a available sometimes on edX, as well as here:

https://work.caltech.edu/telecourse.html

- - - -
There is some decent stuff on wikipedia as well as a lot a junk. In this case you seem to be learning from junk.

There are some similarities between validation and tests sets as well as massive differences (immediate things that come to mind: (i) concerns related to data snooping and (ii) goals -- the former is used for tuning whereas the latter is used to estimating -- at the very end -- out of sample performance. Validation sets may be used many times whereas test sets may be used once.).

SchroedingersLion said:

I am just unsure about why we consider mse's but our cost function is something different. The Python CV methods calculate mse's, even though we use Ridge Regression.
Are these the mean squared errors that we expect on unseen data? For example, suppose I do 2-fold cross validation.
At first I train my model on the first half of data points and estimate the mse on the second half. Then I train my model on the second half of data points and measure the mse on the first half. Would the average of these mse's be considered "out of sample" and be a legit estimation for prediction of unseen data?

I'm afraid I can't give a clean answer to this because I think you fundamentally don't understand what a test data set is. What you've said here is kind of close but not really right.

The right way to learn this in my view is to start simple and build -- i.e. begin by only have training and test data, fit increasingly precise models, then discover overfitting (in-sample performance goes up while out-of-sample performance goes down), and then be introduced to a refinement of this where things like (a) penalties are introduced to encompass the hidden (overfitting) cost of very calibrated models and (b) where a portion of training data is held out as validation set -- and then have someone show how this can be used to avoid overfitting.

SchroedingersLion · Jan 11, 2019

Thanks for the ressource, I might take a look. The course thus far has been disappointing. The professor goes through all the topics, linear regression, SVM, kernel methods, PCA, neural networks, but I would rather learn one topic in depth than all of them on a superficial level.

StoneTemplePython said:

I'm afraid I can't give a clean answer to this because I think you fundamentally don't understand what a test data set is. What you've said here is kind of close but not really right.

The right way to learn this in my view is to start simple and build -- i.e. begin by only have training and test data, fit increasingly precise models, then discover overfitting (in-sample performance goes up while out-of-sample performance goes down), and then be introduced to a refinement of this where things like (a) penalties are introduced to encompass the hidden (overfitting) cost of very calibrated models and (b) where a portion of training data is held out as validation set -- and then have someone show how this can be used to avoid overfitting.

The Scikit-learn also explains it like this. In practice, one as ONE large data set and for validation, one splits it up into training and test. Or maybe training and validation in your terms. Only that I did not come across the distinction of validation and test, not on Wiki, not in Scikit-learn and not in my lecture...
Because for test data, it only is important that it got obtained in the same way as the training data, so each data set can be used for both, training on one part of the set, and validating / testing on the other one.
[edit: I just googled a bit and found https://stats.stackexchange.com/que...ifference-between-test-set-and-validation-set. So test data would refer to data where one tests the completely finished model. Whereas what I am doing in cross-validation, where I try to find my best hyperparameter, is still validating and creating the model. The error on the validation set will be biased, since it was used to create the model. Thank you for making me acquainted with these terms. Up until now, we were always talking about test sets. It is either the set where we test a model, or the set we use for hyperparameter estimation. But they are to be distinguished. ]

We did the portrayal of overfitting. Minimizing the mse on the training data might lead to overfitting and the model performs badly on test/validation data (i.e. data that does not belong to the training set). Therefore, we penalize large weights. And now I have it: This leads to a larger mse on the TRAINING data, but with the possibility that the mse on the unseen data is reduced. We still measure mse's on the unseen data, and not our cost function. The cost function is only evaluated on the training data. This is in accordance with what you said earlier, that we want the model to perform good on unseen data. And there, the mse is still the way to go.

Ridge Regression Cross Validation

What is Ridge Regression Cross Validation?

When is Ridge Regression Cross Validation used?

How does Ridge Regression Cross Validation work?

What are the benefits of using Ridge Regression Cross Validation?

Are there any limitations to Ridge Regression Cross Validation?

Similar threads

Hot Threads

Recent Insights