Basically
FactChecker said:
I am not sure that I fully understand. The original post that started all this was
here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.
Imagine two somewhat extreme cases,
(1) you are a scientist and you are seeking explainable knowledge about the nature of an observable. Your goal is to find something which is very close to the true model that nature uses, and you don't care whether it will get you rich in the short term. Then you will use what you know and try to come up with a parametric family of models you think should be the right one. Maybe you use all kinds of scientific theory to guess what that should be, and you will carefully consider the assumptions, and you will try to choose a simple one using Occam's razor. So now you assume you've got the right one, and now you need to fit the model's parameters ##\theta## to the data. This is a clear case where you are doing parameter estimation. And this is fine, the only drawback is that you've basically got to know the system you're predicting down to a science. Scientists tend to like this approach for obvious reasons.
(2) You don't care much about science, or knowledge, and you either don't know, or don't care what the true model is that nature uses. You don't know how, and you either can't, or don't have time to figure out what assumptions you can rely on. You only care about getting results. Then you will probably just start with a complicated model, throwing every possible predictor at the problem you can think of as a candidate variable in the model (e.g. day of the week, temperature, day the groundhog woke up, the alignments of the planets, the lengths and titles of the headline news articles, etc., whatever you want/have). Then you try to simultaneously simplify the model, and minimize the error you get on test data. Simplifying the model usually means reducing the number of variables, combining the variables using PCA, or using regularization to force the weights of redundant or poor performing variables towards 0. The more you simplify the model, the more you are biasing it, and the more you reduce the variance of the model (which is how much the model changes if you train it on different subsets). So a more bias model is more stable/consistent. The reason that simplifying the model reduces the variance, is that each additional variable introduces its own noise. And a complex/unbiased model will have more freedom to fit to that noise. And because of the curse of dimensionality, the more variables you have, the greater the chance that the random noise alone is a perfect predictor of ##y##, and that increases rapidly. In fact, it can be demonstrated that, at some level of model complexity relative to sample size (common in neuroscience for example), you can make a perfect predictor of the training data using using only variables which are purely random noise with absolutely no true correlation with ##y##. For example, you could take each pixel on a TV which is displaying static noise to be a separate variable, and each frame is one data point. With a small enough amount of frames, your unbiased model will perfectly fit the price of the stock market (in your training data), but in your testing data it will have no predictive power at all. And if you had one additional variable that was a true predictor, it would be drowned out by all of the by chance correlations, and still you would get a terrible model. Since you have so many pixels, it will require a large amount of frames before your model starts fitting to that one true predictor rather than some combination of those random fluctuations.
One major problems with approach (2) is that you often end up with a model which you can't explain. It might also be disconnected from any established theory about the domain and too troublesome to try and integrate with theory. It also depends on having large amounts of data in order to avoid overfitting/underfitting, especially as you have more variables.
Scientists also end up using an approach closer to (2) more and more now days, because the first approach is hard, and some data scientists are interested in highly complex data that seems to have no simple model. And in some cases, people have decided that results matter even if we don't know why, like in clinical diagnosis, and in development of more efficient technology. Neuroscience is a good example. Now that people are throwing massive amounts of data such problems however, people are also trying to think about how to use this approach and still end up with an explainable mode (have their cake and eat it too).
In reality, people tend to use an approach somewhere in between. For example, you might know that the phases of the moon are unlikely to predict the stock market, so you will exclude it as a candidate variable from the start. Everything you know about which variables might be better predictors than others can help to reduce the number of variables up front, which in turn reduces the amount of the data you need and reduces the risk of fitting to noise.