A question about regressions.

  • Thread starter RufusDawes
  • Start date
In summary: The word "explains" in Factchecker's quote is not synonymous with "causes." Often you have some reason for choosing a particular "shape" or regression model before you do any curve fitting (and sometimes even before you collect any data, because the "shape" of the curve will influence the best choice of data points to test whether it is a good model or a poor one).Otherwise, you can fall into the trap of just going on a fishing expedition to see if something in a collection of data happens to fit to a straight line. That doesn't "prove" anything, because correlation is not the same as causation.
  • #1
RufusDawes
156
0
I notice that the examples always seem to talk about linear regressions, as in y=mx +c

So how I understand this is that you get your sum of squared residuals and your parameters and you basically find the parameters which minimise the SSE.

Would this mean, that you could essentially set up an excel spreadsheet that would let you use almost ANY expression using solver or VBA ? Even if you can't solve it algebraically you can get it close enough.

Why do they focus on the situation which is least likely to be useful ? If something were the same as a straight line, why would you need to use such a complicated topic ?
 
Physics news on Phys.org
  • #2
RufusDawes said:
Would this mean, that you could essentially set up an excel spreadsheet that would let you use almost ANY expression using solver or VBA ? Even if you can't solve it algebraically you can get it close enough.

Your question is unclear. I think you are asking whether one can fit almost any sort of non-linear curve to data by finding constants that minimize the sum of the squared errors between the curve and the data.

Yes, you can fit non-linear curves to data that way. (You can't necessarily do this by solving simple equations, but it can be done numerically.) Whether you use linear or non-linear curves is a subjective matter unless you know some theoretical model for the data that directs your attention to curves with a particular shape.

Sometimes the sum of squared errors is not a good measure of the utility of a fit. For example, if you are estimating the volume of a water tank needed to serve a city with a certain population, the dis-utility of underestimating the volume is different that the dis-utility of overestimating it by the same amount.
 
  • #3
Here is one of the important uses of linear regression: When data looks "sort of" linear, it is natural to ask if the apparent linear trend is just random luck or if it really explains something about the data. To do that, you want to get the best linear function you can through the data. Then you can compare how much of the scatter of the data is explained by the linear part, versus how much unexplained scatter remains. Statistical tests can say if that linear trend should be assumed to be random luck or not.
 
  • #4
Stephen Tashi said:
Your question is unclear. I think you are asking whether one can fit almost any sort of non-linear curve to data by finding constants that minimize the sum of the squared errors between the curve and the data.

Yes, you can fit non-linear curves to data that way. (You can't necessarily do this by solving simple equations, but it can be done numerically.) Whether you use linear or non-linear curves is a subjective matter unless you know some theoretical model for the data that directs your attention to curves with a particular shape.

Sometimes the sum of squared errors is not a good measure of the utility of a fit. For example, if you are estimating the volume of a water tank needed to serve a city with a certain population, the dis-utility of underestimating the volume is different that the dis-utility of overestimating it by the same amount.

yes I was thinking to solve it numerically using an iterations of some small number

yes my terminology will be off, sorry about that. I'm not an academic.

the reason is that there is that over a to b the gradient is steep but from points c to d the gradient becomes shallow, which means that when you try to fit a straight line you get a flat gradient and it introduces error that doesn't need to be there.

does this in effect mean we're violating an assumption of regression analysis because of heterodescadicity and if you numerically fit a non linear curve does that get around that problem ?
 
  • #5
Often you have some reason for choosing a particular "shape" or regression model before you do any curve fitting (and sometimes even before you collect any data, because the "shape" of the curve will influence the best choice of data points to test whether it is a good model or a poor one).

Otherwise, you can fall into the trap of just going on a fishing expedition to see if something in a collection of data happens to fit to a straight line. That doesn't "prove" anything, because correlation is not the same as causation. The word "explains" in Factchecker's quote

When data looks "sort of" linear, it is natural to ask if the apparent linear trend is just random luck or if it really explains something about the data.
is a dangerous game to play. That way, you can easily "prove" nonsense like "owning a washing machine causes diabetes" (compare the death rate from diabetes to the percentage of families owning washing machines, over the last 100 years), or whatever other piece of crackpottery appeals to you.

IMO the reason that simple example "always seem to" talk about linear regression is probably because linear regression leads to the simplest version of the math, and many apparently nonlinear curve-fitting problems can be mathematically transformed into linear ones.

If you just want to fit a "smooth curve" through some data points, you don't have to use regression at all - for example you can fit a spline curve.
 
  • #6
AlephZero said:
Often you have some reason for choosing a particular "shape" or regression model before you do any curve fitting (and sometimes even before you collect any data, because the "shape" of the curve will influence the best choice of data points to test whether it is a good model or a poor one).

Otherwise, you can fall into the trap of just going on a fishing expedition to see if something in a collection of data happens to fit to a straight line. That doesn't "prove" anything, because correlation is not the same as causation. The word "explains" in Factchecker's quote


is a dangerous game to play. That way, you can easily "prove" nonsense like "owning a washing machine causes diabetes" (compare the death rate from diabetes to the percentage of families owning washing machines, over the last 100 years), or whatever other piece of crackpottery appeals to you.

IMO the reason that simple example "always seem to" talk about linear regression is probably because linear regression leads to the simplest version of the math, and many apparently nonlinear curve-fitting problems can be mathematically transformed into linear ones.

If you just want to fit a "smooth curve" through some data points, you don't have to use regression at all - for example you can fit a spline curve.


Awesome, thanks.

I just went y= p.exp(rt)=where t is time r is a variable of interest and p is a constant.

Plugged into into excel and then calculated MAPE and used solver to minimise it.

If I wanted to include 3 or 4 variables to od a multiple regresson could I simply go :

y=p*( [exp(r1)] + [exp(r2)] + [exp(r2)] )

In theory if this is minimising error, it should be a better fit than just using the average ?

Thanks so much for your help so far, what do you think is the advantag of regression and whqat techniques are better ?
 
  • #7
There is a deep statistical foundation behind linear regression that is based on the errors being a normal distribution. So there is a lot to gain from sticking with that type of model. One advantage is that it can tell you when adding more complexity to the model is statistically justified. You can get different shapes for the model and still retain the advantages of linear regression.

Suppose :
1) the general trend of data is exponential, and
2) the random scatter of data seems to grow proportional to the trend line (when the values get bigger, the scatter gets bigger), and
3) the y data values are all positive.

Then take the logarithm of all the y values and perform a linear regression to get a model
log(y) = a + b*x + c+x^2 + ..
 

1. What is a regression?

A regression is a statistical method used to analyze the relationship between two or more variables. It is often used to predict the value of one variable based on the values of other variables.

2. What types of regressions are there?

There are several types of regressions, including linear regression, logistic regression, polynomial regression, and multiple regression. Each type is used for different types of data and research questions.

3. How is a regression different from correlation?

Regression and correlation are both used to analyze the relationship between variables. However, regression allows for the prediction of one variable based on the values of other variables, while correlation only measures the strength and direction of the relationship between variables.

4. What is the purpose of a regression analysis?

The purpose of a regression analysis is to determine the strength and direction of the relationship between variables, as well as to make predictions about the value of one variable based on the values of other variables. It is often used in research and data analysis to understand and predict patterns and trends in data.

5. What are some limitations of regression analysis?

Regression analysis can be limited by outliers in the data, non-linear relationships between variables, and the assumption of causality between variables. It is important to carefully consider the data and research question before conducting a regression analysis to ensure accurate and meaningful results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
658
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
672
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
280
Back
Top